Now, let’s move to the most interesting part of EDA: getting useful insights from the data. So far, you have seen two types of variables: categorical (ordered / unordered) and quantitative (or numeric). In this segment, you will learn how to perform univariate analysis on unordered categorical variables.
You saw how one can use plots to extract meaningful information from unordered categorical variables. Compare the answer you had given to the question before the lecture: How would your approach of analysing unordered categorical variables change after studying this?
It is important to note that rank-frequency plots enable you to extract meaning even from seemingly trivial unordered categorical variables such as country, name of an artist, name of a github user, etc.
The objective here is not to put excessive focus on power laws or rank-frequency plots, but rather to understand that non-trivial analysis is possible even on unordered categorical variables and that plots can help you in that process.
Let us now see how a power law distribution is created in Excel.
Download the data set used in the lecture here.
Why plotting on a log-log scale helps
The objective of using a log scale is to make the plot readable by changing the scale. For example, the first ranked item had a frequency of 29000; the second-ranked had 3500; the seventh had 700; and most others had very low frequencies such as 100, 80, 21, etc. The range of frequencies is too large to fit in the plot.
Plotting on a log scale compresses the values to a smaller scale, which makes the plot easy to read.
This happens because log(x) is a much smaller number than x. For example, log(10) = 1, log(100) = 2, log(1000) = 3 and so on. Thus, log(29000) is now approx. 4.5; log(3500) is approx. 3.5; and so on. What was earlier varying from 29000 to 1 is now compressed between 4.5 and 0, making the values easier to read on a plot.
To summarise, the major takeaways from this lecture are:
In the next lecture, you will study how to conduct univariate analysis on ordered categorical variables.