HomeMachine Learning & AIStep-by-Step Guide to Implementing Linear Regression with Python

Step-by-Step Guide to Implementing Linear Regression with Python

Linear regression is a commonly used statistical technique to model relationships between two variables. 

This article will cover the key steps to implement simple linear regression in Python from scratch. We’ll use a salary dataset to predict salaries based on years of experience. 

Read on to learn this versatile machine-learning technique.  

Understanding Linear Regression 

Linear regression establishes a linear relationship between a dependent variable (y) and one or more independent variables (x). The goal is to find the best linear model that predicts the value of y from x. 

The linear equation takes the form:

y = mx + b

Where m is the slope and b is the y-intercept.

The key assumptions are:

  1. Linear relationship: The dependent variable changes linearly concerning changes in the independent variable. A linear regression line best captures this relationship.
  2. No multicollinearity: The independent variables should not be highly correlated. 
  3. Minimal errors: The differences between the observed and predicted values of y (residuals) should follow a normal distribution centred around 0.
  4. Homoscedasticity: The variance of residuals should not change substantially across the range of values for the independent variables

In Python, the Sklearn library provides easy tools for building linear regression models. We’ll use this step-by-step to build our model.

Import Libraries and Data 

multiple linear regression

First, we import pandas and numpy for data manipulation, matplotlib and seaborn for visualisation, and sklearn modules for modelling. We load the salary dataset into a pandas frame. 

Information is available on the number of years individuals have been working (x) and their earnings (y). The number of years someone has worked can be used to determine their income.

Split Data into Training and Test Sets  

It divides the dataset into an 80:20 ratio of a training set and a test set using Sklearn’s train_test_split method. The model will be trained on the training data, while the test data will be used later to evaluate performance.

We separate the independent (years of experience) and dependent (salary) variables from the data into X and y arrays for modelling.

Train the Linear Regression Model    

Sklearn provides a linear regression class to train linear regression models. We initialise a Linear Regression estimator and call the .fit() method, passing in the training data (X_train and y_train).

The .fit() method learns the linear relationship between years of experience (input) and salaries (output) from the training data. This step trains and fits the model to the data.

Make Predictions on Test Data 

Now that the model is trained, it can be used to make predictions on new test data. 

The test input data (X_test) is passed to the .predict() method to predict the output values. This generates the model’s predicted salaries (y_pred_test) corresponding to the test years of experience data.

Predictions on training data (X_train) are also made for comparison.

Evaluate Model Performance 

To determine model effectiveness, we compare predicted salaries to actual salaries visually and numerically:

Visual Evaluation 

We plot the training data and draw the regression line obtained from training. We also show the test data to inspect visually how well the line fits new unseen data. The model makes reasonable predictions.

Numerical Evaluation

Key metrics are:

  1. Coefficient (m) and intercept (b) values: The linear equation learned by the model gives insight into variable relationships.
  2. Difference between actual and predicted salaries (residuals) Lower residuals indicate a better fit. We aim to minimise residuals.
  3. Score metrics like mean absolute error, mean squared error and R-squared numerically quantify model performance. We omit them here for simplicity.

There is potential to improve performance by tuning parameters, adding polynomial terms, trying other algorithms, etc. However, our basic model sufficiently demonstrates the linear regression workflow.

upgrad referral

Conclusion 

This walkthrough covered the essential steps for linear regression in Python: importing data, splitting it into train/test sets, training a model, making predictions, and evaluating performance. We used the sklearn library to build a working model to predict salaries quickly with Python.

Vamshi Krishna sanga
Vamshi Krishna sanga
Vamshi Krishna Sanga, a Computer Science graduate with a master’s degree in Management, is a seasoned Product Manager in the EdTech sector. With over 5 years of experience, he's adept at ideating, defining, and delivering E-learning Digital Solutions across various platforms
RELATED ARTICLES

Title image box

Add an Introductory Description to make your audience curious by simply setting an Excerpt on this section

Get Free Consultation

Most Popular