Before you move on to the model building part, there is still one theoretical aspect left to be addressed - the significance of the derived beta coefficient. When you fit a straight line through the data, you'll obviously get the two parameters of the straight line, i.e. the intercept () and the slope (). Now, while is not of much importance right now, but there are a few aspects surrounding which need to be checked and verified.
The first question we ask is, "Is the beta coefficient significant?" What does this mean?
Suppose you have a dataset for which the scatter plot looks like the following:
Now, if you run a linear regression on this dataset in Python, Python will fit a line on the data which, say, looks like the following:
Now, you can clearly see that the data in randomly scattered and doesn't seem to follow a linear trend or any trend, in general. But Python will anyway fit a line through the data using the least squared method. But you can see that the fitted line is of no use in this case.
Hence, every time you perform a linear regression, you need to test whether the fitted line is a significant one or not or to simply put it, you need to test whether is significant or not. And in comes the idea of Hypothesis Testing on . Please note that the following text will assume the knowledge of hypothesis testing, which was covered in one of the earlier modules.
You start by saying that is not significant, i.e. there is no relationship between X and y.
So in order to perform the hypothesis test, we first propose the null hypothesis that is 0. And the alternative hypothesis thus becomes is not zero.
Let's first discuss the implications of this hypothesis test. If you fail to reject the null hypothesis that would mean that is zero which would simply mean that is insignificant and of no use in the model. Similarly, if you reject the null hypothesis, it would mean that is not zero and the line fitted is a significant one.
Now, how do you perform the hypothesis test? You first used to compute the t-score (which is very similar to the Z-score) which is given by where is the population mean and is the sample standard deviation which when divided by is also known as standard error.
Using this, the t-score for comes out to be (since the null hypothesis is that is equal to zero):
Now, in order to perform the hypothesis test, you need to derive the p-value for the given beta. Please note that the formula of provided in the t-score above is out of scope of this course.
Let's do a quick recap of how do you calculate p-value anyway:
Now, if the p-value turns out to be less than 0.05, you can reject the null hypothesis and state that is indeed significant.
Please note that all of the above steps will be performed by Python automatically.
Why does the test statistic for follow a t-distribution instead of a normal distribution? (here)