Linear regression is a statistical technique to comprehend the relationship between variables x and y. Before conducting linear regression, make sure the below assumptions are met:
If there is a violation of assumptions, linear regression results can be unreliable.
Every assumption discussed below explains how to determine linear regression if it's met and steps to perform if the assumption violates.
Assumption-1: Linear Relationship
It assumes the existence of a linear relationship between the dependent variable (y) and the independent variable (x).
The easiest way to determine assumption fulfilment is to prepare a scatter plot of x vs. y. It helps you visually know the linear relationship between these variables. If the plot shows points falling across a straight line, there is some kind of linear relationship between them, and this assumption is fulfiled.
Solutions to try if this assumption is violated:
When you prepare a scatter plot of x and y values, notice no linear relationship exists between them, you have the following options:
i. Implement a non-linear transformation to the independent or dependent variables. You can implement non-linear transformation using log, square root, or reciprocal of the independent or dependent variables.
ii. Add an independent variable to the model.
Assumption-2: Independence
In this assumption, the residuals are independent. There is zero correlation between successive residuals in the time series data. It implies residuals do not steadily grow more prominent with time.
Observing a residual time series plot is the easiest way to check this assumption fulfilment, showing a graph of residuals vs. time. Most of the residual autocorrelations must be inside the 95% confidence bands close to zero. These are present at approx. +/- 2 above the square root of n (where n is the sample size). The Durbin-Watson test also helps you check the fulfilment of this assumption.
Solutions to try if this assumption is violated:
Here are a few solutions you can try based on how this assumption is violated:
- If the serial correlation is positive, add lags of the dependent or independent variable to a particular model.
- For the serial correlation to be negative, ensure no variables are over-differenced.
- For periodic correlation, add periodic dummy variables to the model.
Assumption-3: Homoscedasticity
In this assumption, the residuals bear constant variance at each level of x. The existence of heteroscedasticity in a regression analysis makes it difficult to rely on the analysis results. Particularly, heteroscedasticity enlarges the difference in the regression coefficient estimates. There are high odds for the regression model to state that a term in the model is statistically substantial, although it’s not.
The easiest way to recognise heteroscedasticity is to create a fitted value vs. residual plot. After fitting a regression line to a data set, you can prepare a scatterplot representing the model’s fitted values vs. residuals of corresponding values. With the increase in the fitted values, the residuals spread out more, and the cone shape shows the existence of heteroscedasticity.
Solutions to try if this assumption is violated:
i. Transformation of the dependent variable:
The common way of transforming the dependent variable is to take its log. For example, suppose we use population size to predict the total number of fruit shops in a town. Here, the population size is the independent variable, and the number of fruit shops is a dependent variable. We can use the log of the dependent variable (population size) instead of the dependent variable itself to predict the number of fruit shops. Following this approach usually eliminates heteroscedasticity.
ii. Weighted regression:
This form of regression allocates weight to every data point depending on the variance of its fitted value. It provides small weights to those data points bearing higher variances, decreasing their squared residuals' value. Overall, proper weights can discard heteroscedasticity.
iii. Redefine the dependent variable:
A typical method to redefine the dependent variable is using a rate instead of the raw value. Let’s consider the example discussed in solution-i. Rather than using the population size to predict the number of fruit shops in a town, use population size to indicate the number of fruit shops per capita.
In many cases, this approach decreases the variability between more significant populations as we measure the number of fruit shops per person instead of the absolute amount of fruit shops
Assumption-4: Normality
The model’s residuals are normally distributed.
How to determine Normality assumption:
i. Visual testing using Q-Q plots.
A Q-Q plot (quantile-quantile plot) is helpful to know whether a model’s residuals obey a normal distribution. The normality assumption is fulfiled when the points on the plot coarsely create a straight diagonal line.
ii. Using formal statistical tests:
This solution checks the normality assumption using formal statistical tests such as Shapiro-Wilk, Jarque-Barre, Kolmogorov-Smirnov, or D’Agostino-Pearson. These tests are sensitive to huge sample sizes. They usually determine that the residuals are not normal when the sample size is big. Therefore, using graphical methods like the Q-Q plot to test this assumption is better.
Solutions to try if this assumption is violated:
Firstly, make sure any outliers don’t lay an immense influence on the distribution. If outliers are present, you need to confirm their real values; no data entry errors are allowed.
You can implement a non-linear transformation to the dependent or independent variable. For example, you can apply the dependent or independent variable's square root, log, or reciprocal.