We saw that there are two types of Regression Algorithm available in machine learning.
1. Linear Regression
2. Non-Linear Regression
Here we will talk about the Non-Linear Regression in detail.
In the Non-Linear regression Algorithm, The independent and dependent variables are not linear in relation.
In other words, the dependent variable will be a non-linear function of the dependent variable.
For example, Let's look at this data-set:
You can download the data-set from the following link: https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv
This is the data-set containing the GDP of china with their corresponding years.
import pandas as pd data = pd.read_csv(r"china_gdp.csv") data.head(5)
Year | Value | |
---|---|---|
0 | 1960 | 5.918412e+10 |
1 | 1961 | 4.955705e+10 |
2 | 1962 | 4.668518e+10 |
3 | 1963 | 5.009730e+10 |
4 | 1964 | 5.906225e+10 |
The question is, can we predict the GDP of china using the year as the input?
Here we will develop a machine learning model that can predict the GDP of China in any given year.
Now, Let's come back to non-linear regression, Our data-set contains two features, Year and Value.
Here the dependent variable is "Value" and the independent variable is "Year".
Let's observe the relationship between them using the scatter plot.
Observe that the relationship is not linear.
import matplotlib.pyplot as plt plt.scatter(data["Year"],data["Value"])
When the data-set variables are not linear, we use the non-linear regression algorithm.
The best-fit Curve
You can observe that the scatter plot shows some curvy plot.
This means that we can not find a best-fit line that can fit our model, instead, we will find the best fit curve and the equation of that curve will be our formula or the model.
Let's Look at some curves we have in the world of mathematics which can fit the above relationship.
Polynomial Equation
In your higher secondary school, you may have studied the concept of "Polynomial Equations".
Let's rewind that concept, A polynomial equation is an equation in the form of axd + bxd-1 + cxd-2 + ..... + n xd-d, where d is the degree of the polynomial, x is the independent variable and a, b, c,....,n are coefficients.
In non-linear regression, we are going to use n degree polynomial which can fit our data-set.
# cubic Equation
So, Let's look at the cubic degree polynomial where d = 3.
Y = W0*X3 + W1*X2 + W2 *X + W3, where W0, W1, W2, W3 are the weights or the coefficiants.
Let's demonstrate the concept of using NumPy.
import numpy as np x = np.arange(-5.0, 5.0, 0.1) y = 1*(x**3) + 1*(x**2) + 1*(x) + 3 ynoise = 20*np.random.normal(size=x.size) #for generating noise. ydata = y+ynoise plt.scatter(x,ydata) plt.plot(x,y,'r')# If the data-set show the quadratic relationship ## Y = X**2
x = np.arange(-5.0, 5.0, 0.1) # independent variable y = 1*(x**2) ynoise = 2*np.random.normal(size=x.size) #for generating noise. ydata = y+ynoise # dependent variable. plt.scatter(x,ydata) plt.plot(x,y,'r')So in short,
we have to choose or guess a degree of the polynomial equation that can fit our data-set relationship.
After choosing the degree, again like in Linear regression algorithms, we have to choose the best coefficients or the weights for that particular equation which reduces and gives the minimum error.
Python Library For Choosing Polynomial Equation.
You don't need to remember all types of polynomial equations to fit your data.
Python has a pre-built library for making our lives easy.
sklearn.preprocessing library has a class PolynomialFeatures and we have to just import it.
PolynomialFeatures PolynomialFeatures()
will generate a multi-dimension array with X values raised to the degrees d, d-1, d-2, d-3,..., d-d.
In our case let's choose a degree 4, it will generate the features with degree 4, 3, 2, 1, 0
from sklearn.preprocessing import PolynomialFeatures # guess a degree of polynomial you think can fit in your relationship shape. degree = 4 pf = PolynomialFeatures(degree) X = np.asanyarray(data["Year"]).reshape(-1,1) #converting the features as an array Y = np.asanyarray(data["Value"]).reshape(-1,1) X_poly = pf.fit_transform(X) # creating polynomial features. X_poly[:5,:]Output
array([[1.00000000e+00, 1.96000000e+03, 3.84160000e+06, 7.52953600e+09, 1.47578906e+13], [1.00000000e+00, 1.96100000e+03, 3.84552100e+06, 7.54106668e+09, 1.47880318e+13], [1.00000000e+00, 1.96200000e+03, 3.84944400e+06, 7.55260913e+09, 1.48182191e+13], [1.00000000e+00, 1.96300000e+03, 3.85336900e+06, 7.56416335e+09, 1.48484527e+13], [1.00000000e+00, 1.96400000e+03, 3.85729600e+06, 7.57572934e+09, 1.48787324e+13]])Observe Our independent variable is now converted as X0, X1, X2, X3, X4.
Polynomial Regression Model Development
Now, Let's observe our input feature and the output feature.
Input Feature
print("Shape of the input feature: ", X_poly.shape) X_poly[:5,:]OutPut
Shape of the input feature: (55, 5) array([[1.00000000e+00, 1.96000000e+03, 3.84160000e+06, 7.52953600e+09, 1.47578906e+13], [1.00000000e+00, 1.96100000e+03, 3.84552100e+06, 7.54106668e+09, 1.47880318e+13], [1.00000000e+00, 1.96200000e+03, 3.84944400e+06, 7.55260913e+09, 1.48182191e+13], [1.00000000e+00, 1.96300000e+03, 3.85336900e+06, 7.56416335e+09, 1.48484527e+13], [1.00000000e+00, 1.96400000e+03, 3.85729600e+06, 7.57572934e+09, 1.48787324e+13]])
Output Feature
print("Shape of the output feature: ", Y.shape) Y[:5,:]
Shape of the output feature: (55, 1) array([[5.91841165e+10], [4.95570502e+10], [4.66851785e+10], [5.00973033e+10], [5.90622549e+10]])So, you can now observe that our problem is now converted as multiple regression, as we have 5 independent variables and only one dependent variable.
from sklearn import linear_model regression = linear_model.LinearRegression() regression.fit(X_poly, Y) ypred = regression.predict(X_poly)
Model Evaluation
Now, let's see if the curve we obtained from our model satisfies the relationship of data or not.
plt.scatter(data["Year"], data["Value"]) # The orginal relation of the data. plt.plot(data["Year"], ypred, 'r') # the curve generated by our model.Here, you can observe that our model with polynomial degree 4 satisfies the trend of our data-set. Now, let's observe if it works correctly or not.
data.tail(5)
Year | Value | |
---|---|---|
50 | 2010 | 6.039659e+12 |
51 | 2011 | 7.492432e+12 |
52 | 2012 | 8.461623e+12 |
53 | 2013 | 9.490603e+12 |
54 | 2014 | 1.035483e+13 |
year = [[2010]] year_poly = pf.fit_transform(year) regression.predict(year_poly)OutPut
array([[6.23355422e+12]])You can compare the values from the data-set for the year 2010 with the predicted value. It is very close to each other. Let's try to guess the GDP in the year 2018. Observe the above data-set contains the data up to 2014 only.
year = [[2018]] year_poly = pf.fit_transform(year) regression.predict(year_poly)Output
array([[1.37272028e+13]])The GDP of China in 2018 was $13.37 trillion. So, the value predicted by our model is approximately correct!! In the next post, we will talk about overfitting and regularization.
Social Plugin