Linear regression is a commonly used statistical technique to model relationships between two variables.
This article will cover the key steps to implement simple linear regression in Python from scratch. We’ll use a salary dataset to predict salaries based on years of experience.
Read on to learn this versatile machine-learning technique.
Understanding Linear Regression
Linear regression establishes a linear relationship between a dependent variable (y) and one or more independent variables (x). The goal is to find the best linear model that predicts the value of y from x.
The linear equation takes the form:
y = mx + b
Where m is the slope and b is the y-intercept.
The key assumptions are:
- Linear relationship: The dependent variable changes linearly concerning changes in the independent variable. A linear regression line best captures this relationship.
- No multicollinearity: The independent variables should not be highly correlated.
- Minimal errors: The differences between the observed and predicted values of y (residuals) should follow a normal distribution centred around 0.
- Homoscedasticity: The variance of residuals should not change substantially across the range of values for the independent variables
In Python, the Sklearn library provides easy tools for building linear regression models. We’ll use this step-by-step to build our model.
Import Libraries and Data
First, we import pandas and numpy for data manipulation, matplotlib and seaborn for visualisation, and sklearn modules for modelling. We load the salary dataset into a pandas frame.
Information is available on the number of years individuals have been working (x) and their earnings (y). The number of years someone has worked can be used to determine their income.
Split Data into Training and Test Sets
It divides the dataset into an 80:20 ratio of a training set and a test set using Sklearn’s train_test_split method. The model will be trained on the training data, while the test data will be used later to evaluate performance.
We separate the independent (years of experience) and dependent (salary) variables from the data into X and y arrays for modelling.
Train the Linear Regression Model
Sklearn provides a linear regression class to train linear regression models. We initialise a Linear Regression estimator and call the .fit() method, passing in the training data (X_train and y_train).
The .fit() method learns the linear relationship between years of experience (input) and salaries (output) from the training data. This step trains and fits the model to the data.
Make Predictions on Test Data
Now that the model is trained, it can be used to make predictions on new test data.
The test input data (X_test) is passed to the .predict() method to predict the output values. This generates the model’s predicted salaries (y_pred_test) corresponding to the test years of experience data.
Predictions on training data (X_train) are also made for comparison.
Evaluate Model Performance
To determine model effectiveness, we compare predicted salaries to actual salaries visually and numerically:
Visual Evaluation
We plot the training data and draw the regression line obtained from training. We also show the test data to inspect visually how well the line fits new unseen data. The model makes reasonable predictions.
Numerical Evaluation
Key metrics are:
- Coefficient (m) and intercept (b) values: The linear equation learned by the model gives insight into variable relationships.
- Difference between actual and predicted salaries (residuals) Lower residuals indicate a better fit. We aim to minimise residuals.
- Score metrics like mean absolute error, mean squared error and R-squared numerically quantify model performance. We omit them here for simplicity.
There is potential to improve performance by tuning parameters, adding polynomial terms, trying other algorithms, etc. However, our basic model sufficiently demonstrates the linear regression workflow.
Conclusion
This walkthrough covered the essential steps for linear regression in Python: importing data, splitting it into train/test sets, training a model, making predictions, and evaluating performance. We used the sklearn library to build a working model to predict salaries quickly with Python.