Let’s summarise everything that has been taught so far in this session, and then, you can move on to the rest of the session.
First, you saw how instead of finding the mean and standard deviation for the entire population, it is sometimes beneficial to find the mean and standard deviation for only a small representative sample. You may have to do this because of time and/or money constraints.
For example, for an office of 30,000 employees, we wanted to find the average commute time. So, instead of asking all employees, we asked only 100 of them and collected the data. We found the mean to be 36.6 minutes and the standard deviation to be 10 minutes.
However, it would not be fair to infer that the population mean is exactly equal to the sample mean. This is because the flaws of the sampling process must have led to some error. Hence, the sample mean’s value has to be reported with some margin of error.
For example, the mean commute time for the office of 30,000 employees would be equal to 36.6 + 3 minutes, 36.6 + 1 minutes, or 36.6 + 10 minutes, i.e., 36.6 minutes + some margin of error.
However, at this point in time, you do not exactly know how to find what this margin of error is.
Then, we moved on to sampling distributions, some of the properties of which would help you find this margin of error.
We created a sampling distribution, which was a probability density function for 100 sample means with a sample size = 5.
A sampling distribution, which is essentially the distribution of the sample means of a population, has some interesting properties, which are collectively called the central limit theorem. It states that no matter how the original population is distributed, the sampling distribution will follow these three properties:
Sampling distribution’s mean () = Population mean (),
Sampling distribution’s standard deviation (Standard error) = , where is the population’s standard deviation and n is the sample size, and
For n > 30, the sampling distribution becomes a normal distribution.
To verify these properties, we performed sampling using the data collected for our upGrad game from the first session on inferential statistics. The values for the sampling distribution thus created ( = 2.348, S.E. = 0.4248) were quite close to the values predicted by theory ( = 2.385, S.E. = 0.44).
To summarise, the notations and formulas for populations, samples and sampling distributions are as follows:
Before moving on to the next lecture, let's spend some time attempting a few practice questions.