In the previous lectures, you learnt the process of segmented univariate analysis. Let’s now move on to the next step of segmented univariate analysis — the comparison of averages.
By now, you know how to group the data by categorical variables and compare the averages. But you should be careful while comparing averages, especially if the difference in average values is small. Let’s see why this is important.
You would have noticed that both the data sets created by Anand have different distributions of the scores of boys and girls. In the first data set, every girl scored higher marks than every other boy. The difference in averages is still 1, but in this case, you can say that girls get higher scores than boys.
On the other hand, in the second data set, the difference in averages was 1 again, but it is difficult to conclude that girls score higher than boys since the range of scores is much wider. Now, the difference is not as significant as in the previous case, since the variation in scores may cause small differences in the mean due to randomness as well.
“Don’t blindly believe in the averages of the buckets — you need to observe the distribution of each bucket closely and ask yourself if the difference in means is significant enough to draw a conclusion. If the difference in means is small, you may not be able to draw inferences. In such cases, a technique called hypothesis testing is used to ascertain whether the difference in means is significant or due to randomness.“ Don’t worry if you do not get the concept of hypothesis correctly, It will be dealt separately in hypothesis module.
In case, if you have not downloaded the National Achievement Survey dataset, you can download it from the link below.
In the next lecture, you will learn to compare metrics other than the mean.