Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!
By Kechit Goyal
Updated on Jun 27, 2025 | 33 min read | 160.74K+ views
Share:
For working professionals
For fresh graduates
More
By Kechit Goyal
Updated on Jun 27, 2025 | 33 min read | 160.74K+ views
Share:
Table of Contents
Did you know? In machine learning, data quality is as important as data quantity. More data can improve model performance, but its quality and structure make the real difference. In fact, data practitioners spend up to 80% of their time on data preprocessing and management, ensuring the data is properly prepared for machine learning models. |
Data preprocessing in machine learning transforms raw data into a clean, structured format suitable for model training. This process involves handling missing values, encoding categorical variables, and scaling numerical data.
These steps are essential for ensuring the model functions effectively. It is widely used in fraud detection, sentiment analysis, and recommendation systems.
In this blog, you’ll discover 11 key steps and effective techniques of data preprocessing. We will also explore its critical role in enhancing the performance of machine learning models.
Want to strengthen your machine learning skills for effective data preprocessing and analysis? upGrad’s Artificial Intelligence & Machine Learning - AI ML Courses can equip you with tools and strategies to stay ahead in your career. Enroll today!
Data preprocessing is a critical step in the machine learning pipeline that transforms raw data into a format suitable for modeling. This process involves cleaning, structuring, and enhancing the dataset to improve model accuracy and efficiency.
By handling missing values, detecting outliers, encoding categorical variables, and addressing data imbalances, we ensure that the model can learn meaningful patterns without being influenced by irrelevant or incomplete data.
Want to enhance your understanding of key ML concepts like data preprocessing and advance your career? Check out these courses to gain the knowledge and hands-on experience needed to handle huge data and improve your ML models confidently:
Let's explore the 11 key steps to prepare your dataset for machine learning model training. We will use the Titanic dataset from Kaggle for detailed explanations and code samples.
Before starting data preprocessing, it’s essential to acquire a dataset that aligns with your modeling objectives. The quality, structure, and source of the dataset significantly impact the downstream steps, such as feature engineering, model validation, and performance.
1. Data Sources: Well-established platforms such as Kaggle, UCI Machine Learning Repository, and open APIs like Quandl and OpenWeatherMap offer structured, labeled datasets ideal for experimentation and benchmarking. Be sure to validate that the dataset is properly maintained and updated.
2. File Formats:
3. Use Case Alignment: Ensure the dataset matches the goals of your machine learning project. For example:
4. Quality Consideration: Always evaluate the dataset for errors, duplicates, or inconsistencies. Good data preprocessing in machine learning starts with clean, well-structured data, and addressing these quality issues is key.
5. Dataset Size and Performance: The size of the dataset will determine the algorithms you can use. Large datasets may require deep learning approaches, while smaller datasets may work well with traditional machine learning algorithms.
6. Licensing and Ethics: Ensure that the dataset you acquire is properly licensed and ethically sourced, especially if it is used for research or commercial purposes. Review licensing agreements to ensure proper use.
Before loading your dataset, it’s essential to set the working directory. This defines the folder where your dataset is stored and ensures that your program can access it correctly. It's especially important in environments where the default working directory is not automatically set.
Sample Code:
import os
# Set the working directory where your dataset is stored
# Replace '/path/to/your/dataset' with the actual path to your dataset
os.chdir('/path/to/your/dataset') # Adjust to the actual path where the dataset is located
# Verify the current working directory
print("Current working directory:", os.getcwd())
Explanation:
Error Handling: To make sure the path exists, you can add an error handling check:
if os.path.exists('/path/to/your/dataset'):
os.chdir('/path/to/your/dataset')
print("Working directory set successfully!")
else:
print("Error: The specified directory does not exist.")
Explanation: os.path.exists() checks if the provided directory path exists. If it does, the program changes to that directory; otherwise, it prints an error message.
Portable Path Usage: For better portability across different operating systems (e.g., Windows, macOS, Linux), it's a good practice to use os.path.join() to create file paths:
dataset_path = os.path.join('path', 'to', 'your', 'dataset')
os.chdir(dataset_path)
This ensures that the code works on any operating system, avoiding issues with file path separators (\ vs /).
Output Explanation:
Efficient data preprocessing in machine learning requires specialized libraries that handle numerical operations, data manipulation, and visualizations. The following Python libraries are crucial for tasks like cleaning, analyzing, and visualizing the data to prepare it for modeling.
Code:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
# Simple verification
print("All libraries imported successfully.")
Explanation:
Output: This output confirms that all the necessary libraries are now ready for use. You can proceed with tasks such as normalization, missing value handling, outlier detection, and data visualization. These libraries will help in efficiently processing and analyzing data, making your model preparation process smoother.
All libraries imported successfully.
Additional Libraries to Consider:
Also Read: 50 Statistical Functions in Microsoft Excel
Now, it’s time to load the dataset into your workspace, which ensures the data is ready for processing. Loading the data correctly is crucial as it sets the stage for all subsequent preprocessing steps.
Code:
# Load the dataset
dataset = pd.read_csv('train.csv') # Adjust the file path as needed
# Preview the dataset
print(dataset.head())
# Optionally, preview the structure of the dataset
print(dataset.info()) # Shows column data types and non-null counts
Explanation:
Output:
Error Handling Example (Optional): If the file path is incorrect or the file doesn’t exist, the code will throw a FileNotFoundError. You can add error handling to catch such errors:
Code Example: This example uses a try-except block to handle potential errors when loading the dataset.
try:
dataset = pd.read_csv('train.csv')
print("Dataset loaded successfully!")
except FileNotFoundError:
print("Error: The file 'train.csv' was not found. Please check the path.")
Explanation:
Example Output:
If the file is found:
Dataset loaded successfully!
If the file is not found:
Error: The file 'train.csv' was not found. Please check the path.
This ensures the program handles missing files gracefully without crashing.
Also Read: Machine Learning Basics: Key Concepts and Essential Elements Explained
Understanding the dataset's structure is essential for identifying potential issues such as missing values, skewed distributions, or data type mismatches. This step helps determine the right preprocessing techniques for each feature.
Code:
# Inspect dataset dimensions and summary statistics
print(dataset.shape)
# Statistical summary for numeric columns
print(dataset.describe())
# Missing value count per column
print(dataset.isnull().sum())
# Check data types of each column
print(dataset.dtypes)
Explanation:
Output: The dataset contains 891 rows and 12 columns.
(891, 12)
Summary:
PassengerId ... Fare
count 891.000000 ... 891.000000
mean 446.000000 ... 32.204208
std 257.353842 ... 49.693429
Missing Values:
Age 177
Cabin 687
Embarked 2
Explanation:
Key Considerations:
1. Skewed Distributions: After reviewing the summary statistics, check for skewness in features like Age or Fare. For highly skewed distributions, consider applying transformations (e.g., log transformation).
2. Handling Categorical Features: After checking the dtypes, make sure that categorical variables like Sex and Embarked are encoded properly before feeding them into models. If they are still strings, you may need to use Label Encoding or One-Hot Encoding.
Enhance your understanding of machine learning and advance your skills with upGrad’s Advanced Generative AI Certification Course. In just 5 months, gain expertise in prompt engineering and GenAI-powered workflows to automate tasks.
Properly handling missing data is essential for model performance, as missing values can introduce bias and reduce the quality of your predictions. Depending on the nature of the feature and the amount of missing data, you can either drop or impute missing values.
Code:
# Handle missing data
dataset['Age'].fillna(dataset['Age'].median(), inplace=True) # Impute missing 'Age' with median
dataset.drop(columns=['Cabin'], inplace=True) # Drop the 'Cabin' column due to high missing data
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True) # Impute missing 'Embarked' with the mode
# Check for missing values
print(dataset.isnull().sum())
Explanation:
Output:
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 0
SibSp 0
Parch 0
Ticket 0
Fare 0
Embarked 0
Key Considerations:
1. Imputation Strategies:
2. Dropping vs. Imputing:
Also Read: Binomial Theorem: Mean, SD, Properties & Related Terms
Outliers can skew the results and significantly affect model performance. Detecting and handling outliers is a crucial step in the data preprocessing process. In this step, we'll use both visualization and statistical methods to detect outliers in the dataset, particularly in numerical features like Fare.
Sample Code:
# Boxplot to detect outliers in 'Fare'
sns.boxplot(x=dataset['Fare'])
plt.show()
# Z-score method for 'Fare'
from scipy.stats import zscore
z_scores = zscore(dataset['Fare'])
dataset = dataset[(z_scores < 3).all(axis=1)] # Remove rows with z-scores > 3
Explanation:
Output: The boxplot visualizes the distribution of the Fare column, showing potential outliers as points outside the whiskers of the box. The Z-score method removes rows where the Z-score is greater than 3, ensuring the dataset is cleaned of extreme outliers.
Key Considerations:
1. Z-score Method: The Z-score method works well when the data is normally distributed. If the data is skewed, other methods like IQR-based filtering might be more appropriate.
2. Other Methods for Handling Outliers:
Feature engineering is a crucial step to improving model performance. It involves creating new features, transforming existing ones, or combining multiple features to extract more meaningful information from the data. Properly engineered features can significantly improve the performance of ML models by providing them with more useful and informative inputs.
Feature engineering can include:
Sample Code:
# Feature Engineering: Create a new feature for 'Age_Group'
dataset['Age_Group'] = pd.cut(dataset['Age'], bins=[0, 12, 18, 30, 50, 80],
labels=["Child", "Teen", "Adult", "Middle_Aged", "Senior"])
# Preview the dataset with new feature
print(dataset[['Age', 'Age_Group']].head())
Explanation:
Output:
This feature can potentially help the model better differentiate between passengers from different age groups and may improve prediction accuracy.
Age Age_Group
0 22.0 Adult
1 38.0 Adult
2 26.0 Adult
3 35.0 Adult
4 35.0 Adult
Additional Feature Engineering Techniques:
1. Combining Features: You can create new features by combining existing ones.
Sample Code:
dataset['Age_Pclass'] = dataset['Age'] * dataset['Pclass']
Explanation: This creates a new feature Age_Pclass by multiplying Age and Pclass. This may help capture relationships between age and class (e.g., younger passengers in higher classes might have different survival rates).
2. Extracting Information from Dates: If you have datetime columns, you can extract meaningful features, such as day of the week, month, year, or time of day.
Sample Code:
dataset['Embarked_Year'] = pd.to_datetime(dataset['Embarked'], errors='coerce').dt.year
Explanation: This converts the Embarked column to a datetime format and extracts the year from it. (Note: This assumes Embarked represents a date, which may need adjustment depending on actual data.)
3. Handling Categorical Features:
You might need to encode categorical features before passing them to models. For instance, Label Encoding or One-Hot Encoding can be used to convert categories into numeric values.
Sample Code:
dataset = pd.get_dummies(dataset, columns=['Embarked'], drop_first=True)
Explanation: pd.get_dummies() creates binary columns for each category in Embarked, while drop_first=True avoids the multicollinearity problem by removing the first column.
4. Scaling Features: Features such as Age or Fare may have large numeric ranges, so scaling or normalizing them can help improve model performance, especially for algorithms sensitive to feature scale.
Sample Code:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dataset[['Age', 'Fare']] = scaler.fit_transform(dataset[['Age', 'Fare']])
Explanation: StandardScaler() standardizes the Age and Fare columns by removing the mean and scaling to unit variance, ensuring that these features contribute equally to the model.
5. Log Transformations: If a feature is highly skewed (such as Fare), a log transformation can help normalize it.
Sample Code:
dataset['Fare'] = np.log1p(dataset['Fare'])
Explanation: np.log1p() applies a logarithmic transformation to the Fare column to handle skewness, making the distribution more symmetric.
Also Read: 15 Essential Advantages of Machine Learning for Businesses in 2025
Data imbalance can significantly affect model performance, especially in classification tasks. When one class in the target variable is underrepresented compared to others, the model may become biased towards the majority class, leading to inaccurate predictions for the minority class. To handle this, techniques like oversampling and undersampling can be used.
Code (Oversampling with SMOTE):
from imblearn.over_sampling import SMOTE
# Separate features and target
X = dataset.drop(columns=['Survived'])
y = dataset['Survived']
# Apply SMOTE for oversampling
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)
# Verify the class distribution after resampling
print(pd.Series(y_res).value_counts())
Explanation:
Output: After applying SMOTE, the class distribution of the target variable Survived is now balanced.
0 549
1 549
Name: Survived, dtype: int64
Key Considerations:
1. Oversampling vs. Undersampling:
2. Evaluation Metrics: Accuracy can be misleading in imbalanced datasets since a model might predict the majority class most of the time. Instead, focus on Precision, Recall, F1-Score, and ROC AUC, which provide a more balanced evaluation of performance.
3. Impact on Model Performance:
Looking to strengthen your ML skills alongside data preprocessing? upGrad’s Introduction to Natural Language Processing course covers essential NLP techniques such as tokenization, RegEx, phonetic hashing, and spam detection. Enroll Now!
Visualizations are essential for uncovering patterns, trends, and anomalies in the data that might not be immediately obvious from raw data alone. They help you better understand the distributions of features, identify outliers, detect missing data, and ensure that the data is ready for modeling.
In this step, we’ll use visualizations to inspect missing data and the distribution of the Age feature before and after preprocessing.
Sample Code (Visualizing Data Before and After Preprocessing):
# Visualize missing data distribution using a heatmap
sns.heatmap(dataset.isnull(), cbar=False, cmap='viridis')
plt.show()
# Visualize the distribution of 'Age' before and after filling missing values
sns.histplot(dataset['Age'], kde=True)
plt.show()
Explanation:
Output Explanation:
Additional Visualizations for Data Exploration:
1. Boxplot for Outlier Detection: A boxplot is useful for detecting outliers in numerical features like Fare and Age. It visualizes the distribution, median, and potential outliers as points outside the box’s whiskers.
Sample Code:
sns.boxplot(x=dataset['Fare'])
plt.show()
Explanation: sns.boxplot() creates a boxplot of the Fare feature. Outliers are displayed as points outside the whiskers, helping to identify extreme values that might influence model training.
2. Correlation Heatmap: It helps visualize the relationships between numerical features. This is important for detecting multicollinearity and selecting relevant features for your model.
Sample Code:
corr_matrix = dataset.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()
Explanation:
3. Pairplot: This plot visualizes pairwise relationships between selected features. This is particularly useful for understanding interactions between features like Age, Fare, and Pclass and their relationship with the target variable, Survived.
Sample Code:
sns.pairplot(dataset[['Age', 'Fare', 'Pclass', 'Survived']])
plt.show()
Explanation:
Also Read: 15 Key Techniques for Dimensionality Reduction in Machine Learning
Splitting the dataset into training and testing subsets is crucial to ensure a fair evaluation of model performance. The model is trained on the training set and evaluated on the testing set to assess its generalization capabilities on unseen data.
Code:
from sklearn.model_selection import train_test_split
# Features (X) and target (y)
X = dataset.drop(columns=['PassengerId', 'Name', 'Ticket', 'Survived'])
y = dataset['Survived']
# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
# Output the shape of the sets
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")
Explanation:
Output:
Training set shape: (712, 7)
Testing set shape: (179, 7)
Importance of Splitting the Dataset
Notes: If your dataset is imbalanced, consider using stratified splitting (keeping the same proportion of each class in both training and testing sets). In train_test_split(), you can set stratify=y to ensure that the distribution of the target variable (Survived) is the same in both sets. |
With all the preprocessing steps completed, the dataset is now clean, structured, and ready for training machine learning models. By addressing missing data, outliers, imbalanced datasets, and adding useful features, we ensure that the data is in the best possible state for model building.
Looking to strengthen your data analysis skills for machine learning? upGrad’s Introduction to Data Analysis using Excel course provides comprehensive training in data cleaning, analysis, and visualization. Enroll today!
Also Read: Linear Regression Model in Machine Learning: Concepts, Types, And Challenges in 2025
Let's explore key preprocessing techniques that enhance model performance across industries. These methods are crucial for ensuring data quality and model accuracy.
Effective data preprocessing is critical in transforming raw, unstructured data into a clean, structured format suitable for machine learning models. The quality and preparation of the data directly influence the accuracy and reliability of the model's predictions.
Below are advanced techniques that ensure data is optimally prepared for machine learning, enabling models to learn from high-quality information:
1. Data Cleaning
Data cleaning is a critical step in identifying and rectifying errors, inconsistencies, and inaccuracies within the dataset. This process involves following elements that can distort model performance:
Example:
2. Data Reduction
Data reduction involves reducing the volume of data while retaining critical information. This is often achieved through techniques like feature selection, dimensionality reduction, or data aggregation to enhance computational efficiency and model accuracy.
Example:
These techniques are critical in fields like Internet of Things (IoT) and data mining, where large volumes of data need efficient processing and analysis.
3. Data Transformation
Data transformation ensures that the features are in a format suitable for model analysis. It includes techniques like scaling, normalization, encoding categorical variables, and log transformation to ensure that the data meets the assumptions of the model and is ready for analysis.
Example:
These techniques are crucial in RESTful API development, ensuring data is preprocessed into the correct format for efficient querying and processing.
4. Data Enrichment
Data enrichment involves enhancing the dataset by adding new, valuable information from external or internal sources, thereby increasing its completeness and relevance. This helps in generating more meaningful features that can improve model performance.
Example:
5. Data Validation
Data validation ensures that the data is accurate, complete, and consistent. This process involves checking for data integrity, ensuring proper formatting, and verifying the correctness of data values.
Example:
When developing models in Node.js for customer segmentation or creating Vue.js dashboards to visualize trends, the raw data is often incomplete or contains anomalies. Applying these techniques ensures the data is clean, accurate, and optimized for better model performance.
If you are interested in learning the basics of data visualization, check out upGrad’s Case Study using Tableau, Python, and SQL. This 10-hour free program will help you gain expertise on creating dashboards and analyzing churn rates for applications.
Also Read: ML Types Explained: A Complete Guide to Data Types in Machine Learning
Let's explore how these data preprocessing techniques in machine learning are applied in various industries, showcasing their impact on practical applications.
Data preprocessing is a vital step in machine learning that ensures the dataset is in an optimal state for modeling. The way data is prepared can significantly impact the model's performance, as clean, structured data allows algorithms to identify patterns and make accurate predictions.
Below is a detailed breakdown of the preprocessing techniques and their real-world applications across different industries:
1. Handling Missing Data
Many machine learning algorithms cannot handle missing values directly. Missing data can lead to incomplete models, biased results, or errors during training.
Common Techniques Used: Imputation techniques (e.g., using the mean, median, mode, or more advanced methods like KNN imputation or multivariate imputation by chained equations) ensure that the model learns from complete data, improving the quality of predictions.
Real-time Application:
2. Feature Scaling
Different features can have vastly different scales (e.g., Age vs. Income), leading algorithms to give undue weight to certain features.
Common Techniques Used: Feature scaling techniques like Min-Max Scaling (scaling features to a range of [0, 1]) or Standardization (removing the mean and scaling to unit variance) ensure that features contribute equally to the model, improving convergence speed and model performance.
Real-time Application:
3. Encoding Categorical Variables
Most machine learning algorithms require numerical input. Categorical features (e.g., Gender, Embarked) must be encoded to allow the algorithm to process them.
Common Techniques Used: Techniques like Label Encoding and One-Hot Encoding are used to convert categorical data into numerical representations. Label Encoding assigns an integer to each category, while One-Hot Encoding creates binary columns for each category, preserving the non-ordinal nature of the categories.
Real-time Application:
4. Outlier Detection and Removal
Outliers can distort statistical analyses, skew distributions, and affect model accuracy, particularly in algorithms sensitive to data distribution (e.g., linear regression, SVM).
Common Techniques Used: Outliers can be detected using methods such as the Z-score (values with Z-scores greater than 3 standard deviations are considered outliers) or Interquartile Range (IQR). Once identified, outliers can be removed or transformed to prevent them from influencing model training.
Real-time Application:
5. Addressing Data Imbalance
Imbalanced datasets (e.g., a classification problem with more instances of one class than another) can lead to biased models that perform poorly on the minority class.
Common Techniques Used: Techniques like SMOTE generate synthetic data points for the minority class, while undersampling reduces the number of majority class instances. These methods ensure balanced class distributions and improve the model’s ability to predict the minority class effectively.
Real-time Application:
6. Feature Engineering
Raw data may not always contain the best features for model performance. Feature engineering involves creating new, informative features that better represent the underlying patterns in the data.
Common Techniques Used: Examples include creating interaction features, binning continuous features into categorical ones (Age_Group), or extracting date/time components from a timestamp. This process involves applying domain knowledge to enrich the dataset, improving the model’s predictive power.
Real-time Application:
7. Data Normalization
Some algorithms, like k-means clustering or support vector machines, are sensitive to the scale of the features. Without normalization, features with larger ranges will dominate the learning process.
Common Techniques Used: Normalization transforms data into a common scale, ensuring that each feature contributes equally to the model. For example, Min-Max Normalization scales the data to a fixed range. Whereas Z-score Normalization standardizes the data to have a mean of 0 and a standard deviation of 1.
Real-time Application:
Effective data preprocessing in machine learning is key to building reliable models, ensuring they perform optimally on practical tasks. By refining data quality, we enable algorithms to learn more accurately and make better predictions.
Data preprocessing in machine learning ensures that raw data is cleaned, transformed, and appropriately structured for efficient model training. It involves tasks like handling missing values, detecting and removing outliers, performing feature scaling, and encoding categorical variables. These steps enhance the performance of ML models, enabling algorithms to identify meaningful patterns and make reliable predictions.
However, applying these concepts with confidence in practical scenarios can be challenging. That’s where upGrad steps in, with expert-led programs that blend theory with hands-on training.
Here are a few additional upGrad courses to help you get started:
Want to gain expertise in standard deviation ML in 2025? Reach out to upGrad for personalized counseling and expert guidance. You can also visit your nearest upGrad offline center to explore the right learning path for your goals.
Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.
Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.
Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.
Reference:
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/
Dataset:
Titanic Data
95 articles published
Experienced Developer, Team Player and a Leader with a demonstrated history of working in startups. Strong engineering professional with a Bachelor of Technology (BTech) focused in Computer Science fr...
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources