In the last segment, you learnt about the new considerations that are required to be made when moving to multiple linear regression. Rahim has already talked about overfitting. Let’s now look at the next aspect, i.e., multicollinearity.
Multicollinearity refers to the phenomenon of having related predictor (independent) variables in the input data set. In simple terms, in a model that has been built using several independent variables, some of these variables might be interrelated, due to which the presence of that variable in the model is redundant. You drop some of these related independent variables as a way of dealing with multicollinearity.
Multicollinearity affects the following:
It is, thus, essential to detect and deal with the multicollinearity present in a model while interpreting it. Let's see how you can detect it.
You saw two basic ways of dealing with multicollinearity:
The VIF is given by:
Here,'i' refers to the i-th variable, which is being represented as a linear combination of the rest of the independent variables. You will see VIF in action during the Python demonstration on multiple linear regression.
The common heuristic we follow for the VIF values is:
> 10: VIF value is definitely high, and the variable should be eliminated.
> 5: Can be okay, but it is worth inspecting.
< 5: Good VIF value. No need to eliminate this variable.
But once you have detected the multicollinearity present in the data set, how exactly do you deal with it? Rahim answers this question in the following video.
Some methods that can be used to deal with multicollinearity are as follows: