Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Data Preprocessing In Data Mining: Steps, Missing Value Imputation, Data Standardization

Updated on 24 November, 2022

6.14K+ views
8 min read

The most time-consuming part of a Data Scientist’s job is to prepare and preprocess the data at hand. The data we get in real-life scenarios is not clean and suitable for modelling. The data needs to be cleaned, brought to a certain format and transformed before feeding to the Machine Learning models.

At the end of this tutorial, you will know the following

Why Data Preprocessing?

When data is retrieved by scrapping websites and gathering it from other data sources, it is generally full of discrepancies. It can be formatting issues, missing values, garbage values and text and even errors in the data. Several preprocessing steps need to be done to make sure that the data that is fed to the model is up to the mark so that the model can learn and generalize on it.

Data Cleaning

The first and most essential step is to clean the irregularities in the data. Without doing this step, we cannot make much sense out of the statistics of the data. These can be formatting issues, garbage values and outliers.

Formatting issues

We need the data to be in a tabular format most of the times but it is not the case. The data might have missing or incorrect column names, blank columns. Moreover, when dealing with unstructured data such as Images and Text, it becomes utmost essential to get the 2D or 3D data loaded in Dataframes for modelling.

Garbage Values

Many instances or complete columns might have certain garbage values appended to the actual required value. For example, consider a column “rank” which has the values such as: “#1”, “#3”, “#12”, “#2” etc. Now, it is important to remove all the preceding “#” characters to be able to feed the numeric value to the model.

Outliers

Many times certain numeric values are either too large or too low than the average value of the specific column. These are considered as outliers. Outliers need special treatment and are a sensitive factor to treat. These outliers might be measurement errors or they might be real values as well. They either need to be removed completely or handled separately as they might contain a lot of important information.

Missing Values

It is seldom the case that your data will contain all the values for every instance. Many values are missing or filled with garbage entry. These missing values need to be treated. These values can have multiple reasons why they might be missing. They could be missing due to some reason such as sensor error or other factors, or they can also be missing completely at random.

Read: Data Mining Projects in India

Dropping

The most straightforward and easiest way is to drop the rows where values are missing. Doing this has many disadvantages like loss of crucial information. It might be a good step to drop the missing values when the amount of data you have is huge. But if the data is less and there are a lot of missing values, you need better ways to tackle this issue.

Mean/Median/Mode imputation

The quickest way to impute missing values is by simply imputing the mean value of the column. However, it has disadvantages because it disturbs the original distribution of the data. You can also impute the median value or the mode value which is generally better than the simple mean.

Linear interpolation & KNN

More smart ways can also be used to impute missing values. 2 of which are Linear Interpolations using multiple models by treating the column with blank values as the feature to be predicted. Another way is to use clustering by KNN. KNN makes clusters of the values in a particular feature and then assigns the value closest to the cluster.

Data Standardization

In a data set with multiple numerical features, all the features might not be on the same scale. For example, a feature “Distance” has distances in meters such as 1300, 800, 560, etc. And another feature “time” has times in hours such as 1, 2.5, 3.2, 0.8, etc. So, when these two features are fed to the model, it considers the feature with distances as more weightage as its values are large. To avoid this scenario and to have faster convergence, it is necessary to bring all the features on the same scale.

Normalization

A common way to scale the features is by normalizing them. It can be implemented using Scikit-learn’s Normalizer. It works not on the columns, but on the rows. L2 normalization is applied to each observation so that the values in a row have a unit norm after scaling.

Min Max Scaling

Min Max scaling can be implemented using Scikit-learn’s Min MaxScaler class. It subtracts the minimum value of the features and then divides by the range, where the range is the difference between the original maximum and original minimum. It preserves the shape of the original distribution, with default range in 0-1.

upGrad’s Exclusive Data Science Webinar for you –

ODE Thought Leadership Presentation

 

 

Standard Scaling

Standard Scaler also can be implemented using Scikit-learn’s class. It standardizes a feature by subtracting the mean and then scaling to unit variance, where unit variance means dividing all the values by the standard deviation. It makes the mean of the distribution 0 and standard deviation as 1.

Discretization

A lot of times data is not in numeric form instead of in categorical form. For example, consider a feature “temperature” with values as “High”, “Low”, “Medium”. These textual values need to be encoded in numerical form to able for the model to train upon. 

Categorical Data  

Categorical Data is label encoded to bring it in numerical form. So “High”, “Medium” and “Low” can be Label Encoded to 3,2, and 1. Categorical features can be either nominal or ordinal. Ordinal categorical features are those which have a certain order. For example, in the above case, we can say that 3>2>1 as the temperatures can be measured/quantified. 

However, in an example where a feature of “City” which has values like “Delhi”, “Jammu” & “Agra”, cannot be measured. In other words, when we label encode them as 3, 2, 1, we cannot say that 3>2>1 because “Delhi” > ”Jammu” won’t make much sense. In such cases, we use One Hot Encoding.

Continuous Data

Features with continuous values can also be discretized by binning the values into bins of specific ranges. Binning means converting a numerical or continuous feature into a discrete set of values, based on the ranges of the continuous values. This comes in handy when you want to see the trends based on what range the data point falls in. 

For example, say we have marks for 7 kids ranging from 0-100. Now, we can assign every kid’s marks to a particular “bin”. Now we can divide into 3 bins with ranges 0 to 50, 51-70, and 71-100 belonging to bins 1,2, and 3 respectively. Therefore, the feature will now only contain one of these 3 values. Pandas offers 2 functions to achieve binning quickly: qcut and cut.

Pandas qcut takes in the number of quantiles and divides the data points to each bin based on the data distribution.

Pandas cut, on the other hand, takes in the custom ranges defined by us and divides the data points in those ranges.

Related read: Data Preprocessing in Machine Learning

Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Conclusion

Data Preprocessing is an essential step in any Data Mining and Machine Learning task. All the steps we discussed are certainly not all but do cover most of the basic part of the process. Data preprocessing techniques are different for NLP and Image data as well. Make sure to try examples of above steps and implement in your Data Mining pipeline.

If you are curious to learn about data science, check out IIIT-B & upGrad’s PG Diploma in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Frequently Asked Questions (FAQs)

1. What is data preprocessing and what is its significance?

This is a technique to furnish the raw unstructured data which is in the form of images, text, videos. This data is first preprocessed to remove inconsistencies, errors, and redundancies so that it can be analyzed later.
The raw data is transformed into relevant data that can be understood by the machines. Preprocessing the data is an important step to transform the data for modelling. Without processing, it is practically useless.

2. What are the steps involved in data preprocessing?

Data preprocessing involves various steps to complete the whole process. The data is first cleaned to remove noises and fill the missing values. After this, the data is integrated from multiple sources to combine into a single data set. These steps are then followed by transformation, reduction, and discretization.
The transformation of the raw data involves normalizing the data. Reduction and discretization basically deal with reducing the attributes and dimensions of the data. This is followed by compressing this large set of data.

3. What is the difference between univariate and multivariate methods?

The univariate method is the simplest method to handle an outlier. It does not overview any relationship since it is a single variate and its main purpose is to analyze the data and determine the pattern associated with it. Mean, median, and mode are examples of patterns found in the univariate data.
On the other hand, the multivariate method is for analyzing three or more variables. It is more precise than the earlier method since, unlike the univariate method, the multivariate method deals with relationships and patterns. Additive Tree, Canonical Correlation Analysis, and Cluster Analysis are some of the ways to perform multivariate analysis.