Homoscedasticity In Machine Learning: Detection, Effects & How to Treat
Updated on Sep 23, 2022 | 8 min read | 8.6k views
Share:
For working professionals
For fresh graduates
More
Updated on Sep 23, 2022 | 8 min read | 8.6k views
Share:
Table of Contents
By the end of this tutorial, you will have knowledge of the following:
Homoscedasticity means to be of “The same Variance”. In Linear Regression, one of the main assumptions is that there is a Homoscedasticity present in the errors or the residual terms (Y_Pred – Y_actual).
In other words, Linear Regression assumes that for all the instances, the error terms will be the same and of very little variance.
Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
Let’s understand it with the help of an example. Consider we have two variables – Carpet area of the house and price of the house. As the carpet area increases, the prices also increase.
So we fit a linear regression model and see that the errors are of the same variance throughout. The graph in the below image has Carpet Area in the X-axis and Price in the Y-axis.
As you can see, the predictions are almost along the linear regression line and with similar variance throughout.
Also, if we plot these residuals on the X-axis, we’d see it along in a straight line parallel to the X-axis. This is a clear sign of Homoscedasticity
When this condition is violated, it means there is Heteroscedasticity in the model. Considering the same example as above, let’s say that for houses with lesser carpet area the errors or residuals or very small. And as the carpet area increases, the variance in the predictions increase which results in increasing value of error or residual terms. When we plot the values again we see the typical Cone curve which strongly indicates the presence of Heteroscedsticity in the model.
Specifically speaking, Heteroscedasticity is a systematic increase or decrease in the variance of residuals over the range of independent variables. This is an issue because Homoscedasticity is an assumption of linear regression and all errors should be of the same variance. Learn more about linear Regression
In the simplest terms, the easiest way to know if Heteroscedasticity is present is by plotting the graph of residuals. If you see any pattern present then there is Heteroscedasticity. Typically the values increase as the fitted value increase, thereby making a cone-shaped curve.
Read: Machine Learning Project Ideas
Now with the above reasons, the Heteroscedasticity can either be Pure or Impure. When we fit the right model (linear or non-linear) and if yet there is a visible pattern in the residuals then it is called Pure Heteroscedasticity.
However, if we fit the wrong model and then observe a pattern in the residuals then it is a case of Impure Heteroscedasticity. Depending on the type of Heteroscedasticity the measures need to be taken to overcome it. It also depends on the domain you’re working in and varies from domain to domain.
As we discussed earlier, the linear regression model makes an assumption about Homoscedasticity being present in the data. If that assumption is broken then we won’t be able to trust the results we get.
If Heteroscedasticity is present then the instances with high variance will have a larger impact on the prediction which we don’t want.
If you detect the presence of Heteroscedasticity, then there are multiple ways to tackle it. First, let’s consider an example where we have 2 variables: Population of City and Number of Infections of COVID-19.
Now in this example, there will be a huge difference in the number of infections in large metro cities vs small tier-3 cities. The variable Number of Infections will be independent and Population of City will be a dependent variable.
Consider that fit a regression model to this data and observe Heteroscedasticity similar to the image above. So now we know that there is Heteroscedasticity present in the model and it needs to be fixed.
Now the first step would be to identify the source of Heteroscedasticity. In our case, it is the variable with a large variance.
There can be multiple ways to deal with Heteroscedasticity, but we’ll look at three such methods.
We can make some modifications to the variables/features we have to reduce the impact of this large variance on the model predictions. One way to do this by modifying the features to rates and percentages rather than actual values.
This would make the features convey a bit different information but it is worth trying. It will also depend on the problem and data if this type of approach can be implemented or not.
This method involves the least modification with features and often help solve the problem and even make the model’s performance better in some cases.
So in our case, we can change the feature “Number of Infections” to “Rate of infections”. This will help reduce the variance as quite obviously the number of infections in cities with a large population will be large.
Weighted regression is a modification of normal regression where the data points are assigned certain weights according to their variance. The ones with large variance are given small weights and the ones with less variance are given larger weights.
So when these weights are squared, the square of small weights underestimates the effect of high variance.
When correct weights are used, Heteroscedasticity is replaced by Homoscedasticity. But how to find correct weights? One quick way is to use the inverse of that variable as the weight.
So in our case, the weight will be Inverse of City Population.
Transforming the data is the last resort as by doing that you lose the interpretability of the feature.
What that means is you no longer can easily explain what the feature is showing.
One way could be to use Box-Cox transformations and log transformations.
There can be many reasons for Heteroscedasticity in your data. It also highly varies from one domain to another.
So it is essential to have the knowledge of that as well before you start with the above processes to remove Heteroscedasticity.
In this blog, we discussed Homoscedasticity and Heteroscedasticity and how it can be used to implement several machine learning algorithms.
If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s Executive PG Programme in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources