You have learnt how to conduct univariate analysis on categorical variables. Now, let's look at quantitative or numeric variables.
Prerequisites
In this segment, Anand will take you through various summary metrics. Knowledge of these concepts is very essential for this topic and the forthcoming topics in other modules, so make sure that you familiarise yourself with those concepts before moving ahead.
Mean: This is the sum of all the data values, divided by the total number of sample values.
Student name | Score (out of 20 marks) |
Raj | 12 |
Pawrush | 14 |
Srijan | 19 |
Anjali | 20 |
Anamika | 20 |
In the above example, the mean value would be the sum of all the score values (85) divided by the number of values (5), which is 17.
Mode: In your sample data, the value that has the highest frequency is the mode.
Note: There can be more than one mode in a sample. For instance, there can be elections in which three parties participate, two of those get 40% of the votes each, and the third party gets 20% of the votes. In this case, there are two modes since two parties have the highest (equal) number of votes.
Median: If you arrange the sample data in ascending order of frequency, from left to right, the value in the middle is called the median.
Let’s now learn how to analyse quantitative variables.
Mean and median are single values that give a broad representation of the entire data. As Anand clearly stated, it is very important to understand when to use these metrics to avoid inaccurate analysis.
While mean gives an average of all the values, median gives a typical value that could be used to represent the entire group. As a simple rule of thumb, always question someone if they use mean because median is almost always a better measure of ‘representativeness’.
Let’s now look at some other summary descriptive statistics such as mode, interquartile distance, standard deviation, etc.
Standard deviation and interquartile difference are both used to represent the spread of the data.
Interquartile difference is a much better metric than standard deviation if there are outliers in the data. This is because the standard deviation will be influenced by outliers, whereas the interquartile difference will simply ignore them.
You also saw how box plots are used to understand the spread of data.