Blog_Banner_Asset
    Homebreadcumb forward arrow iconBlogbreadcumb forward arrow iconArtificial Intelligencebreadcumb forward arrow iconPolynomial Regression: Importance, Step-by-Step Implementation

Polynomial Regression: Importance, Step-by-Step Implementation

Last updated:
29th Jan, 2021
Views
Read Time
10 Mins
share image icon
In this article
Chevron in toc
View All
Polynomial Regression: Importance, Step-by-Step Implementation

Introduction

In this vast field of Machine Learning, what would be the first algorithm that most of us would have studied? Yes, it is the Linear Regression. Mostly being the first program and algorithm that one would have learned in their initial days of Machine Learning Programming, Linear Regression has its own importance and power with a linear type of data.

Top Machine Learning and AI Courses Online

What if the dataset we come across is not linearly separable? What if the linear regression model is not able to derive any sort of relationship between both the independent and dependent variables?

There comes another type of regression known as the Polynomial Regression. True to its name, Polynomial Regression is a regression algorithm that models the relationship between the dependent (y) variable and the independent variable (x) as an nth degree polynomial. In this article, we shall understand the algorithm and math behind Polynomial Regression along with its implementation in Python.

Ads of upGrad blog

Trending Machine Learning Skills

What is Polynomial Regression?

As defined earlier, Polynomial Regression is a special case of linear regression in which a polynomial equation with a specified (n) degree is fit on the non-linear data which forms a curvilinear relationship between the dependent and independent variables.

y= b0+b1x1+ b2x12+ b3x13+…… bnx1n

Here,

y is the dependent variable (output variable)

x1 is the independent variable (predictors)

b0 is the bias

b1, b2, ….bn are the weights in the regression equation.

As the degree of the polynomial equation (n) becomes higher, the polynomial equation becomes more complicated and there is a possibility of the model tending to overfit which will be discussed in the later part.

Comparison of Regression Equations

Simple Linear Regression ===>         y= b0+b1x

Multiple Linear Regression ===>     y= b0+b1x1+ b2x2+ b3x3+…… bnxn

Polynomial Regression ===>         y= b0+b1x1+ b2x12+ b3x13+…… bnx1n

From the above three equations, we see that there are several subtle differences in them. The Simple and Multiple Linear Regressions are different from the Polynomial Regression equation in that it has a degree of only 1. The Multiple Linear Regression consists of several variables x1, x2, and so on. Though the Polynomial Regression equation has only one variable x1, it has a degree n which differentiates it from the other two.

Need for Polynomial Regression

From the below diagrams we can see that in the first diagram, a linear line is attempted to be fit on the given set of non-linear datapoints. It is understood that it becomes very difficult for a straight line to form a relationship with this non-linear data. Because of this when we train the model, the loss function increases causing the high error.

On the other hand, when we apply Polynomial Regression it is clearly visible that the line fits well on the data points. This signifies that the polynomial equation that fits the datapoints derives some sort of relationship between the variables in the dataset. Thus, for such cases where the data points are arranged in a non-linear manner, we require the Polynomial Regression model.

Implementation of Polynomial Regression in Python

From here, we shall build a Machine Learning model in Python implementing Polynomial Regression. We shall compare the results obtained with Linear Regression and Polynomial Regression. Let us first understand the problem that we are going to solve with Polynomial Regression.

Problem Description

In this, consider the case of a Start-up looking to hire several candidates from a company. There are different openings for different job roles in the company. The start-up has details of the salary for each role in the previous company. Thus, when a candidate mentions his or her previous salary, the HR of the start-up needs to verify it with the existing data. Thus, we have two independent variables which are Position and Level. The dependent variable (output) is the Salary which is to be predicted using Polynomial Regression.

On visualizing the above table in a graph, we see that the data is non-linear in nature. In other words, as the level increases the salary increases at a higher rate thus giving us a curve as shown below.

Step 1: Data Pre-Processing

The first step in building any Machine Learning model is to import the libraries. Here, we have only three basic libraries to be imported. After this, the dataset is imported from my GitHub repository and the dependent variables and independent variables are assigned. The independent variables are stored in the variable X and the dependent variable is stored in the variable y.

import numpy as np

import matplotlib.pyplot as plt

import pandas as pd

dataset = pd.read_csv(‘https://raw.githubusercontent.com/mk-gurucharan/Regression/master/PositionSalaries_Data.csv’)

X = dataset.iloc[:, 1:-1].values

y = dataset.iloc[:, -1].values

Here in the term [:, 1:-1], the first colon represents that all rows must be taken and the term 1:-1 denotes that the columns to be included are from the first column to the penultimate column which is given by -1.

Step 2: Linear Regression Model

In the next step, we shall build a Multiple Linear Regression model and use it to predict the salary data from the independent variables. For this, the class LinearRegression is imported from the sklearn library. It is then fitted on the variables X and y for training purposes.

from sklearn.linear_model import LinearRegression

regressor = LinearRegression()

regressor.fit(X, y)

Once the model is built, on visualizing the results, we get the following graph.

As it is clearly seen, by trying to fit a straight line on a non-linear dataset, there is no relationship that is derived by the Machine Learning model. Thus, we need to go for Polynomial Regression to get a relationship between the variables.

Step 3: Polynomial Regression Model

In this next step, we shall fit a Polynomial Regression model on this dataset and visualize the results. For this, we import another Class from the sklearn module named as PolynomialFeatures in which we give the degree of the polynomial equation to be built. Then the LinearRegression class is used to fit the Polynomial equation to the dataset.

from sklearn.preprocessing import PolynomialFeatures

from sklearn.linear_model import LinearRegression

poly_reg = PolynomialFeatures(degree = 2)

X_poly = poly_reg.fit_transform(X)

lin_reg = LinearRegression()

lin_reg.fit(X_poly, y)

In the above case, we have given the degree of the polynomial equation to be equal to 2. On plotting the graph, we see that there is some sort of curve that is derived but still there is much deviation from the real data (in red) and the predicted curve points (in green). Thus, in the next step we shall increase the degree of the polynomial to higher numbers such as 3 & 4 and then compare it with each other.

On comparing the results of the Polynomial Regression with degrees 3 and 4, we see that as the degree increases, the model trains well with the data. Thus, we can infer that a higher degree enables the Polynomial equation to fit more accurately on the training data. However, this is the perfect case of overfitting. Thus, it becomes important to choose the value of n precisely to prevent overfitting.

What is Overfitting?

As the name says, Overfitting is termed as a situation in statistics when a function (or a Machine Learning model in this case) is too closely fit on to a set of limited data points. This causes the function to perform poorly with new data points.

In Machine Learning if a model is said to be overfitting on a given set of training data points, then when the same model is introduced to a completely new set of points (say the test dataset), then it performs very badly on it as the overfitting model hasn’t generalized well with the data and is only overfitting on the training data points.

Also Read: Machine Learning Project Ideas

In polynomial regression, there is a good chance of the model getting overfit on the training data as the degree of the polynomial is increased. In the example shown above, we see a typical case of overfitting in polynomial regression which can be corrected with only a trial-and-error basis for choosing the optimal value of the degree.

Ads of upGrad blog

Popular AI and ML Blogs & Free Courses

Conclusion

To conclude, Polynomial Regression is utilized in many situations where there is a non-linear relationship between the dependent and independent variables. Though this algorithm suffers from sensitivity towards outliers, it can be corrected by treating them before fitting the regression line. Thus, in this article, we have been introduced to the concept of Polynomial Regression along with an example of its implementation in Python Programming on a simple dataset.

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Learn ML Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Profile

Pavan Vadapalli

Blog Author
Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology strategy.
Get Free Consultation

Selectcaret down icon
Select Area of interestcaret down icon
Select Work Experiencecaret down icon
By clicking 'Submit' you Agree to  
UpGrad's Terms & Conditions

Our Popular Machine Learning Course

Frequently Asked Questions (FAQs)

1What do you mean by linear regression?

Linear regression is a type of predictive numerical analysis through which we can find the value of an unknown variable with the help of a dependent variable. It also explains the connection between one dependent and one or more independent variables. Linear regression is a statistical technique for demonstrating a link between two variables. Linear regression plots a trend line from a set of data points. Linear regression can be used to generate a prediction model from seemingly random data, such as cancer diagnoses or stock prices. There are several methods for calculating linear regression. The ordinary least-squares approach, which estimates unknown variables in data and visually transforms into the sum of the vertical distances between the data points and the trend line, is one of the most prevalent.

2What are some of Linear Regression's drawbacks?

In most cases, regression analysis is used in research to establish that there is a link between variables. However, correlation does not imply causation since a link between two variables does not imply that one causes the other to happen. Even a line in a basic linear regression that suits the data points well may not ensure a relationship between circumstances and logical outcomes. Using a linear regression model, you may determine whether or not there is any correlation between variables. Extra investigation and statistical analysis will be required to determine the exact nature of the link and whether one variable causes the other.

3What are the basic assumptions of linear regression?

In linear regression, there are three key assumptions. The dependent and independent variables must, first and foremost, have a linear connection. A scatter plot of the dependent and independent variables is used to check this relationship. Second, there should be minimal or zero multi-collinearity between the independent variables in the dataset. It implies that the independent variables are unrelated. The value must be limited, which is determined by the domain requirement. Homoscedasticity is the third factor. The assumption that errors are evenly distributed is one of the most essential assumptions.

Explore Free Courses

Suggested Blogs

RPA Developer Salary in India: For Freshers & Experienced [2024]
904648
Wondering what is the range of RPA developer salary in India? According to Forrester, if the Robotic Process Automation or RPA market continues to gr
Read More

by Pavan Vadapalli

29 Jul 2024

15 Interesting MATLAB Project Ideas & Topics For Beginners [2024]
82995
Diving into the world of engineering and data science, I’ve discovered the potential of MATLAB as an indispensable tool. It has accelerated my c
Read More

by Pavan Vadapalli

09 Jul 2024

5 Types of Research Design: Elements and Characteristics
47385
The reliability and quality of your research depend upon several factors such as determination of target audience, the survey of a sample population,
Read More

by Pavan Vadapalli

07 Jul 2024

Biological Neural Network: Importance, Components & Comparison
50612
Humans have made several attempts to mimic the biological systems, and one of them is artificial neural networks inspired by the biological neural net
Read More

by Pavan Vadapalli

04 Jul 2024

Production System in Artificial Intelligence and its Characteristics
86790
The AI market has witnessed rapid growth on the international level, and it is predicted to show a CAGR of 37.3% from 2023 to 2030. The production sys
Read More

by Pavan Vadapalli

03 Jul 2024

AI vs Human Intelligence: Difference Between AI & Human Intelligence
113357
In this article, you will learn about AI vs Human Intelligence, Difference Between AI & Human Intelligence. Definition of AI & Human Intelli
Read More

by Pavan Vadapalli

01 Jul 2024

Career Opportunities in Artificial Intelligence: List of Various Job Roles
89813
Artificial Intelligence or AI career opportunities have escalated recently due to its surging demands in industries. The hype that AI will create tons
Read More

by Pavan Vadapalli

26 Jun 2024

Gini Index for Decision Trees: Mechanism, Perfect & Imperfect Split With Examples
71191
As you start learning about supervised learning, it’s important to get acquainted with the concept of decision trees. Decision trees are akin to
Read More

by MK Gurucharan

24 Jun 2024

Random Forest Vs Decision Tree: Difference Between Random Forest and Decision Tree
51883
Recent advancements have paved the growth of multiple algorithms. These new and blazing algorithms have set the data on fire. They help in handling da
Read More

by Pavan Vadapalli

24 Jun 2024

Schedule 1:1 free counsellingTalk to Career Expert
icon
footer sticky close icon