Polynomial Regression Using Python - CreationCodes


Non-Linear Regression

In previous posts, we talked about the regression Algorithms. 
We saw that there are two types of Regression Algorithm available in machine learning. 
 1. Linear Regression 
2. Non-Linear Regression 

Here we will talk about the Non-Linear Regression in detail. 

In the Non-Linear regression Algorithm, The independent and dependent variables are not linear in relation. 

In other words, the dependent variable will be a non-linear function of the dependent variable. 

For example, Let's look at this data-set: You can download the data-set from the following link: https://s3-api.us-geo.objectstorage.softlayer.net/cf-courses-data/CognitiveClass/ML0101ENv3/labs/china_gdp.csv
import pandas as pd

data = pd.read_csv(r"china_gdp.csv")

data.head(5)
Year Value
0 1960 5.918412e+10
1 1961 4.955705e+10
2 1962 4.668518e+10
3 1963 5.009730e+10
4 1964 5.906225e+10

This is the data-set containing the GDP of china with their corresponding years. 

The question is, can we predict the GDP of china using the year as the input?
 Here we will develop a machine learning model that can predict the GDP of China in any given year.

Now, Let's come back to non-linear regression, Our data-set contains two features, Year and Value. 
Here the dependent variable is "Value" and the independent variable is "Year". Let's observe the relationship between them using the scatter plot.
import matplotlib.pyplot as plt

plt.scatter(data["Year"],data["Value"])

Observe that the relationship is not linear.

When the data-set variables are not linear, we use the non-linear regression algorithm. 

 The best-fit Curve 

You can observe that the scatter plot shows some curvy plot. 
This means that we can not find a best-fit line that can fit our model, instead, we will find the best fit curve and the equation of that curve will be our formula or the model. 

Let's Look at some curves we have in the world of mathematics which can fit the above relationship.

Polynomial Equation 

In your higher secondary school, you may have studied the concept of "Polynomial Equations".
Let's rewind that concept, A polynomial equation is an equation in the form of axd + bxd-1 + cxd-2 + ..... + n xd-d, where d is the degree of the polynomial, x is the independent variable and a, b, c,....,n are coefficients. 

In non-linear regression, we are going to use n degree polynomial which can fit our data-set. # cubic Equation So, Let's look at the cubic degree polynomial where d = 3. Y = W0*X3 + W1*X2 + W2 *X + W3, where W0, W1, W2, W3 are the weights or the coefficiants. 

Let's demonstrate the concept of using NumPy.
import numpy as np

x = np.arange(-5.0, 5.0, 0.1)
y = 1*(x**3) + 1*(x**2) + 1*(x) + 3

ynoise = 20*np.random.normal(size=x.size) #for generating noise.

ydata = y+ynoise

plt.scatter(x,ydata)
plt.plot(x,y,'r')
# If the data-set show the quadratic relationship ## Y = X**2
x = np.arange(-5.0, 5.0, 0.1) # independent variable
y = 1*(x**2)

ynoise = 2*np.random.normal(size=x.size) #for generating noise.

ydata = y+ynoise # dependent variable.

plt.scatter(x,ydata)
plt.plot(x,y,'r')
So in short, 
we have to choose or guess a degree of the polynomial equation that can fit our data-set relationship. 

After choosing the degree, again like in Linear regression algorithms, we have to choose the best coefficients or the weights for that particular equation which reduces and gives the minimum error.

 Python Library For Choosing Polynomial Equation. 

You don't need to remember all types of polynomial equations to fit your data. 

Python has a pre-built library for making our lives easy. sklearn.preprocessing library has a class PolynomialFeatures and we have to just import it. 

 PolynomialFeatures PolynomialFeatures() 

will generate a multi-dimension array with X values raised to the degrees d, d-1, d-2, d-3,..., d-d. In our case let's choose a degree 4, it will generate the features with degree 4, 3, 2, 1, 0
from sklearn.preprocessing import PolynomialFeatures

# guess a degree of polynomial you think can fit in your relationship shape.

degree = 4

pf = PolynomialFeatures(degree)

X = np.asanyarray(data["Year"]).reshape(-1,1) #converting the features as an array
Y = np.asanyarray(data["Value"]).reshape(-1,1)

X_poly = pf.fit_transform(X) # creating polynomial features.
X_poly[:5,:]
Output
    array([[1.00000000e+00, 1.96000000e+03, 3.84160000e+06, 7.52953600e+09,
            1.47578906e+13],
           [1.00000000e+00, 1.96100000e+03, 3.84552100e+06, 7.54106668e+09,
            1.47880318e+13],
           [1.00000000e+00, 1.96200000e+03, 3.84944400e+06, 7.55260913e+09,
            1.48182191e+13],
           [1.00000000e+00, 1.96300000e+03, 3.85336900e+06, 7.56416335e+09,
            1.48484527e+13],
           [1.00000000e+00, 1.96400000e+03, 3.85729600e+06, 7.57572934e+09,
            1.48787324e+13]])

Observe Our independent variable is now converted as X0, X1, X2, X3, X4


Polynomial Regression Model Development 


Now, Let's observe our input feature and the output feature. 

 Input Feature

print("Shape of the input feature: ", X_poly.shape)

X_poly[:5,:]
OutPut
    Shape of the input feature:  (55, 5)
    




    array([[1.00000000e+00, 1.96000000e+03, 3.84160000e+06, 7.52953600e+09,
            1.47578906e+13],
           [1.00000000e+00, 1.96100000e+03, 3.84552100e+06, 7.54106668e+09,
            1.47880318e+13],
           [1.00000000e+00, 1.96200000e+03, 3.84944400e+06, 7.55260913e+09,
            1.48182191e+13],
           [1.00000000e+00, 1.96300000e+03, 3.85336900e+06, 7.56416335e+09,
            1.48484527e+13],
           [1.00000000e+00, 1.96400000e+03, 3.85729600e+06, 7.57572934e+09,
            1.48787324e+13]])

Output Feature

print("Shape of the output feature: ", Y.shape)

Y[:5,:]
    Shape of the output feature:  (55, 1)
    




    array([[5.91841165e+10],
           [4.95570502e+10],
           [4.66851785e+10],
           [5.00973033e+10],
           [5.90622549e+10]])


So, you can now observe that our problem is now converted as multiple regression, as we have 5 independent variables and only one dependent variable.
from sklearn import linear_model

regression = linear_model.LinearRegression()

regression.fit(X_poly, Y)

ypred = regression.predict(X_poly)

Model Evaluation 

 Now, let's see if the curve we obtained from our model satisfies the relationship of data or not.
plt.scatter(data["Year"], data["Value"]) # The orginal relation of the data.

plt.plot(data["Year"], ypred, 'r') # the curve generated by our model.
Here, you can observe that our model with polynomial degree 4 satisfies the trend of our data-set. Now, let's observe if it works correctly or not.
data.tail(5)
Year Value
50 2010 6.039659e+12
51 2011 7.492432e+12
52 2012 8.461623e+12
53 2013 9.490603e+12
54 2014 1.035483e+13
year = [[2010]]

year_poly = pf.fit_transform(year)

regression.predict(year_poly)
OutPut
    array([[6.23355422e+12]])


You can compare the values from the data-set for the year 2010 with the predicted value. It is very close to each other. Let's try to guess the GDP in the year 2018. Observe the above data-set contains the data up to 2014 only.
year = [[2018]]

year_poly = pf.fit_transform(year)

regression.predict(year_poly)
Output
    array([[1.37272028e+13]])
The GDP of China in 2018 was $13.37 trillion. So, the value predicted by our model is approximately correct!! In the next post, we will talk about overfitting and regularization.