In the previous lecture, you learnt how to fix rows and columns. Let’s now study another common data quality issue: how to treat missing values.
In the following lecture, Anand will walk you through some important points that you should keep in mind while treating missing values.
The most important takeaway from this lecture is that good methods add information, and bad methods exaggerate information.
In case you can add information from reliable external sources, you should use it to replace missing values. But often, it is better to let missing values be and continue with the analysis rather than extrapolate the available information.
Let's summarise how to deal with missing values:
Set values as missing values: Identify the values that indicate missing data and yet are not recognised by the software, e.g., treat blank strings, 'NA', 'XX', '99', etc. as missing and replace such data indicating missing values with the blank cell in Excel.
Adding is good, exaggerating is bad: You should try to get information from reliable external sources as much as possible; but if you cannot, then it is better to keep missing values as such rather than exaggerating the existing rows/columns.
Delete rows, columns: Delete rows if the number of missing values is insignificant, as this would not impact the analysis. Columns could be removed if the missing values are quite significant in number.
Fill partial missing values using business judgement: Missing time zone, century, etc. These values are easily identifiable.