Step-by-Step Guide to Implementing Linear Regression with Python

August 26, 2024

977

Table of Contents

Linear regression is a commonly used statistical technique to model relationships between two variables.

This article will cover the key steps to implement simple linear regression in Python from scratch. We’ll use a salary dataset to predict salaries based on years of experience.

Read on to learn this versatile machine-learning technique.

Understanding Linear Regression

Linear regression establishes a linear relationship between a dependent variable (y) and one or more independent variables (x). The goal is to find the best linear model that predicts the value of y from x.

The linear equation takes the form:

y = mx + b

Where m is the slope and b is the y-intercept.

The key assumptions are:

Linear relationship: The dependent variable changes linearly concerning changes in the independent variable. A linear regression line best captures this relationship.
No multicollinearity: The independent variables should not be highly correlated.
Minimal errors: The differences between the observed and predicted values of y (residuals) should follow a normal distribution centred around 0.
Homoscedasticity: The variance of residuals should not change substantially across the range of values for the independent variables

In Python, the Sklearn library provides easy tools for building linear regression models. We’ll use this step-by-step to build our model.

Import Libraries and Data

multiple linear regression

First, we import pandas and numpy for data manipulation, matplotlib and seaborn for visualisation, and sklearn modules for modelling. We load the salary dataset into a pandas frame.

Information is available on the number of years individuals have been working (x) and their earnings (y). The number of years someone has worked can be used to determine their income.

Split Data into Training and Test Sets

It divides the dataset into an 80:20 ratio of a training set and a test set using Sklearn’s train_test_split method. The model will be trained on the training data, while the test data will be used later to evaluate performance.

We separate the independent (years of experience) and dependent (salary) variables from the data into X and y arrays for modelling.

Train the Linear Regression Model

Sklearn provides a linear regression class to train linear regression models. We initialise a Linear Regression estimator and call the .fit() method, passing in the training data (X_train and y_train).

The .fit() method learns the linear relationship between years of experience (input) and salaries (output) from the training data. This step trains and fits the model to the data.

Make Predictions on Test Data

Now that the model is trained, it can be used to make predictions on new test data.

The test input data (X_test) is passed to the .predict() method to predict the output values. This generates the model’s predicted salaries (y_pred_test) corresponding to the test years of experience data.

Predictions on training data (X_train) are also made for comparison.

Evaluate Model Performance

To determine model effectiveness, we compare predicted salaries to actual salaries visually and numerically:

Visual Evaluation

We plot the training data and draw the regression line obtained from training. We also show the test data to inspect visually how well the line fits new unseen data. The model makes reasonable predictions.

Numerical Evaluation

Key metrics are:

Coefficient (m) and intercept (b) values: The linear equation learned by the model gives insight into variable relationships.
Difference between actual and predicted salaries (residuals) Lower residuals indicate a better fit. We aim to minimise residuals.
Score metrics like mean absolute error, mean squared error and R-squared numerically quantify model performance. We omit them here for simplicity.

There is potential to improve performance by tuning parameters, adding polynomial terms, trying other algorithms, etc. However, our basic model sufficiently demonstrates the linear regression workflow.

Conclusion

This walkthrough covered the essential steps for linear regression in Python: importing data, splitting it into train/test sets, training a model, making predictions, and evaluating performance. We used the sklearn library to build a working model to predict salaries quickly with Python.

Step-by-Step Guide to Implementing Linear Regression with Python

Understanding Linear Regression

Import Libraries and Data

Split Data into Training and Test Sets

Train the Linear Regression Model

Make Predictions on Test Data

Evaluate Model Performance

Visual Evaluation

Numerical Evaluation

Conclusion

Top AI and ML Certifications to Boost Your Career in the US

Top AI Jobs in the US: Most Wanted Roles for 2025

ML vs. DL: What U.S. Professionals Need to Know About These AI Technologies

Title image box

Most Popular

Why Data Science Networking Matters for US Online Learners

Top AI and ML Certifications to Boost Your Career in the US

Salaries for Accountants in the US in 2025: What You Can Expect at Different Career Levels

Your 2025 Guide to Becoming a Cloud Developer in the US

EDITOR PICKS

Why Data Science Networking Matters for US Online Learners

Top AI and ML Certifications to Boost Your Career in the US

Salaries for Accountants in the US in 2025: What You Can Expect at Different Career Levels

POPULAR POSTS

Why Data Science Networking Matters for US Online Learners

Top AI and ML Certifications to Boost Your Career in the US

Salaries for Accountants in the US in 2025: What You Can Expect at Different Career Levels

POPULAR CATEGORY

Step-by-Step Guide to Implementing Linear Regression with Python

Understanding Linear Regression

Import Libraries and Data

Split Data into Training and Test Sets

Train the Linear Regression Model

Make Predictions on Test Data

Evaluate Model Performance

Visual Evaluation

Numerical Evaluation

Conclusion

Title image box

Most Popular

Get Free Consultation

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY