In the previous lecture, you learnt how to fix missing values. You will now learn how to standardise values.
Let’s see how to go about standardising quantitative values in a data set.
Scaling ensures that the values have a common scale, which makes analysis easier. For example, let's take a data set containing the grades of students studying in different universities. Some of the universities give grades on a scale of 4, while others give grades on a scale of 10. Therefore, you cannot assume that a GPA of 3 on a scale of 4 is equal to a GPA of 3 on a scale of 10, even though they are quantitatively the same. Thus, for the purpose of analysis, these values need to be brought to a common scale, such as the percentage scale.
One of the concepts that surely caught your attention is outliers. Removing outliers is an important step in data cleaning. An outlier may disproportionately affect the results of your analysis. This may lead to faulty interpretations. It is also important to understand that there is no fixed definition of an outlier. It is left up to the judgment of the analyst to decide the criteria on which the data would be categorised as abnormal or an outlier. We will look into one such method in the next session.
Let’s summarise what you learnt about standardising variables. You could use this as a checklist for future data cleaning exercises.
Standardise units: Ensure that all observations under a variable have a common and consistent unit, e.g., convert lbs to kgs, miles/hr to km/hr, etc.
Scale values if required: Make sure that the observations under a variable have a common scale.
Standardise precision: This allows you to present the data better, e.g., 4.5312341 kgs to 4.53 kgs.
Remove outliers: Remove high and low values that would disproportionately affect the results of your analysis.
Now that you have learned how to standardise numeric values, let's see how to standardise text values, which is equally important.
(Note: In the video, the date would be 2016/10/23 instead of 2016/10/20.)
Let's summarise what you learnt about standardising text. You could use this as a checklist for future data cleaning exercises.
Remove extra characters: For example, common prefixes/suffixes, leading/trailing/multiple spaces, etc. These are irrelevant to the analysis.
Standardise the case: There are various cases that string variables may take, e.g., UPPERCASE, lowercase, Title Case, Sentence case, etc.
Standardise the format: E.g., 23/10/16 to 2016/10/23, 'Modi, Narendra' to 'Narendra Modi', etc.
Some of the Excel functions that can be used to convert the case of the text are given below.
PROPER: This function uppercases the first letter of the text and the letter that follows any character other than a letter of the alphabet.
Examples: PROPER(best course)= Best Course, PROPER(best@course)= Best@Course
UPPER: This function capitalises all the letters of the text.
LOWER: This is used to convert all the letters of the text to lowercase.
In the next session, you will learn how to fix invalid values.