Welcome to the session on ‘Data Cleaning’.
In the previous session, you learnt about the various sources of data. Once you have procured the data, the next step is to clean it to get rid of data quality issues.
There are various types of quality issues when it comes to data, and that’s why data cleaning is one of the most time-consuming steps in data analysis. In real-world scenarios, the data you need to analyse often come from a third party, clients, etc., and the data collection/entry methods, etc. often leed to errors, due to which cleaning the data becomes crucial.
For example, there could be formatting errors (e.g., rows and columns are ill-formatted, unclearly named, etc.), missing values, repeated rows, spelling inconsistencies, etc. These issues could make it difficult to analyse data and could lead to errors or irrelevant results. Thus, these issues need to be corrected before data is analysed.
You will learn how to identify the various quality issues in data and learn the techniques to clean it.
Though data cleaning is often done in a somewhat haphazard way and it is too difficult to define a ‘single structured process’, we will study data cleaning in the following steps:
People you will hear from in this session
Subject Matter Expert:
CEO, Gramener
Gramener is one of the most prominent data analytics and visualisation companies in India. Anand, currently the CEO, was previously the Chief Data Scientist at Gramener and has extensive experience in management consulting and equity research.