View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Data Preprocessing in Machine Learning: 11 Key Steps You Must Know!

By Kechit Goyal

Updated on Jun 27, 2025 | 33 min read | 160.74K+ views

Share:

Did you know? In machine learning, data quality is as important as data quantity. More data can improve model performance, but its quality and structure make the real difference. In fact, data practitioners spend up to 80% of their time on data preprocessing and management, ensuring the data is properly prepared for machine learning models.

Data preprocessing in machine learning transforms raw data into a clean, structured format suitable for model training. This process involves handling missing values, encoding categorical variables, and scaling numerical data. 

These steps are essential for ensuring the model functions effectively. It is widely used in fraud detection, sentiment analysis, and recommendation systems.

In this blog, you’ll discover 11 key steps and effective techniques of data preprocessing. We will also explore its critical role in enhancing the performance of machine learning models.

Want to strengthen your machine learning skills for effective data preprocessing and analysis? upGrad’s Artificial Intelligence & Machine Learning - AI ML Courses can equip you with tools and strategies to stay ahead in your career. Enroll today!

Data Preprocessing in Machine Learning: 11 Key Steps

Data preprocessing is a critical step in the machine learning pipeline that transforms raw data into a format suitable for modeling. This process involves cleaning, structuring, and enhancing the dataset to improve model accuracy and efficiency.

By handling missing values, detecting outliers, encoding categorical variables, and addressing data imbalances, we ensure that the model can learn meaningful patterns without being influenced by irrelevant or incomplete data. 

Want to enhance your understanding of key ML concepts like data preprocessing and advance your career? Check out these courses to gain the knowledge and hands-on experience needed to handle huge data and improve your ML models confidently:

Let's explore the 11 key steps to prepare your dataset for machine learning model training. We will use the Titanic dataset from Kaggle for detailed explanations and code samples.

Step 1: Acquiring the Dataset

Before starting data preprocessing, it’s essential to acquire a dataset that aligns with your modeling objectives. The quality, structure, and source of the dataset significantly impact the downstream steps, such as feature engineering, model validation, and performance.

1. Data Sources: Well-established platforms such as Kaggle, UCI Machine Learning Repository, and open APIs like Quandl and OpenWeatherMap offer structured, labeled datasets ideal for experimentation and benchmarking. Be sure to validate that the dataset is properly maintained and updated.

2. File Formats:

  • Choose file formats compatible with your toolchain, such as CSV for tabular data, JSON for nested key-value pairs, or XLSX for business spreadsheets.
  • Pandas in Python supports CSV and JSON natively, making it easy to load the data.
  • For large-scale datasets, formats like Parquet or even SQL databases may be preferred for efficient storage and querying.

3. Use Case Alignment: Ensure the dataset matches the goals of your machine learning project. For example:

  • Customer churn models: Transactional data, such as customer activity and demographics.
  • Disease prediction models: Diagnostic data from Electronic Medical Records (EMR) systems or public health datasets.

4. Quality Consideration: Always evaluate the dataset for errors, duplicates, or inconsistencies. Good data preprocessing in machine learning starts with clean, well-structured data, and addressing these quality issues is key.

5. Dataset Size and Performance: The size of the dataset will determine the algorithms you can use. Large datasets may require deep learning approaches, while smaller datasets may work well with traditional machine learning algorithms.

6. Licensing and Ethics: Ensure that the dataset you acquire is properly licensed and ethically sourced, especially if it is used for research or commercial purposes. Review licensing agreements to ensure proper use.

Step 2: Setting the Working Directory

Before loading your dataset, it’s essential to set the working directory. This defines the folder where your dataset is stored and ensures that your program can access it correctly. It's especially important in environments where the default working directory is not automatically set.

Sample Code:

import os

# Set the working directory where your dataset is stored
# Replace '/path/to/your/dataset' with the actual path to your dataset
os.chdir('/path/to/your/dataset')  # Adjust to the actual path where the dataset is located

# Verify the current working directory
print("Current working directory:", os.getcwd())

Explanation:

  • os.chdir() changes the current working directory to the specified path. This allows easy access to files stored in that directory.
  • os.getcwd() returns the current working directory. This can be used to verify that the directory was correctly changed.

Error Handling: To make sure the path exists, you can add an error handling check:

if os.path.exists('/path/to/your/dataset'):
    os.chdir('/path/to/your/dataset')
    print("Working directory set successfully!")
else:
    print("Error: The specified directory does not exist.")

Explanation: os.path.exists() checks if the provided directory path exists. If it does, the program changes to that directory; otherwise, it prints an error message.

Portable Path Usage: For better portability across different operating systems (e.g., Windows, macOS, Linux), it's a good practice to use os.path.join() to create file paths:

dataset_path = os.path.join('path', 'to', 'your', 'dataset')
os.chdir(dataset_path)

This ensures that the code works on any operating system, avoiding issues with file path separators (\ vs /).

Output Explanation:

  • The os.getcwd() method will confirm the directory change by printing the current working directory.
  • Error handling ensures the code won’t crash if the provided directory path is invalid.

Placement Assistance

Executive PG Program12 Months
background

Liverpool John Moores University

Master of Science in Machine Learning & AI

Dual Credentials

Master's Degree18 Months

Strengthen your expertise in data preprocessing and other core statistical concepts with upGrad's Master’s in Artificial Intelligence and Machine Learning – IIITB Program. Advance your career by using Copilot for instant code generation and error debugging.

Step 3: Importing Essential Libraries for Data Preprocessing

Efficient data preprocessing in machine learning requires specialized libraries that handle numerical operations, data manipulation, and visualizations. The following Python libraries are crucial for tasks like cleaning, analyzing, and visualizing the data to prepare it for modeling.

  • NumPy: Optimized for numerical operations and array manipulations. It provides powerful tools for working with large datasets and matrices.
  • Pandas: A versatile library that provides DataFrame structures for handling and manipulating structured data, making it easy to clean, filter, and transform data.
  • Matplotlib & Seaborn: Visualization libraries to help explore feature distributions, identify outliers, and create informative plots. While Matplotlib is more flexible, Seaborn builds on it for statistical plotting with simpler syntax and better aesthetics.

Code:

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# Simple verification
print("All libraries imported successfully.")

Explanation:

  • NumPy (np): This library is used for handling numerical data, especially arrays and matrices, and supports advanced mathematical operations.
  • Pandas (pd): Provides DataFrame and Series structures to work with structured/tabular data and supports a wide variety of data manipulation tasks such as filtering, joining, and aggregating data.
  • Matplotlib (plt) Seaborn (sns):
    • Matplotlib is used for creating static, animated, and interactive visualizations. It’s highly customizable and can be used for a wide range of plot types.
    • Seaborn is built on top of Matplotlib and provides a higher-level interface to create more attractive and informative statistical plots easily.

Output: This output confirms that all the necessary libraries are now ready for use. You can proceed with tasks such as normalization, missing value handling, outlier detection, and data visualization. These libraries will help in efficiently processing and analyzing data, making your model preparation process smoother.

All libraries imported successfully.

Additional Libraries to Consider:

  • Scikit-learn: Useful for machine learning tasks, including data preprocessing operations like scaling, imputation, and encoding categorical variables.
  • SciPy: A library that builds on NumPy and provides additional functionality for optimization, integration, and statistics.
  • Imbalanced-learn: A library specifically designed to handle imbalanced datasets, providing functions for over-sampling, under-sampling, and SMOTE.

Also Read: 50 Statistical Functions in Microsoft Excel

Step 4: Load the Dataset into Your Workspace

Now, it’s time to load the dataset into your workspace, which ensures the data is ready for processing. Loading the data correctly is crucial as it sets the stage for all subsequent preprocessing steps.

Code:

# Load the dataset
dataset = pd.read_csv('train.csv')  # Adjust the file path as needed

# Preview the dataset
print(dataset.head())

# Optionally, preview the structure of the dataset
print(dataset.info())  # Shows column data types and non-null counts

Explanation:

  • pd.read_csv(): Reads the CSV file into a Pandas DataFrame, which is a 2D table-like structure for easy manipulation. This is the most common method for loading CSV files.
    • Note: You can also load datasets in other formats using pd.read_excel() for XLSX files, pd.read_json() for JSON files, or pd.read_sql() for SQL queries.
  • dataset.head(): Displays the first 5 rows of the dataset to give a quick overview of the data, making it easier to spot any obvious issues or understand the data structure.
  • dataset.info(): Displays useful information about the dataset, such as column names, data types, non-null values, and memory usage.

Output:

  • The head() function outputs the first five rows of the dataset, allowing you to confirm its structure and get a snapshot of the columns (e.g., PassengerId, Survived, Pclass, Fare, and Embarked).
  • info() provides a concise summary, displaying data types, the number of non-null values, and memory usage, helping identify missing values and understanding the dataset’s general structure.
    PassengerId  Survived  Pclass  ...     Fare Cabin Embarked
    0            1         0       3  ...   7.2500   NaN        S
    1            2         1       1  ...  71.2833   C85        C
    2            3         1       3  ...   7.9250   NaN        S
    3            4         1       1  ...  53.1000  C123        S
    4            5         0       3  ...   8.0500   NaN        S

Error Handling Example (Optional): If the file path is incorrect or the file doesn’t exist, the code will throw a FileNotFoundError. You can add error handling to catch such errors:

Code Example: This example uses a try-except block to handle potential errors when loading the dataset.

try:
    dataset = pd.read_csv('train.csv')
    print("Dataset loaded successfully!")
except FileNotFoundError:
    print("Error: The file 'train.csv' was not found. Please check the path.")

Explanation:

  • try: The code attempts to load the dataset using pd.read_csv('train.csv'). If the file is found, it prints a success message.
  • except FileNotFoundError: If the file is not found, a FileNotFoundError is raised, and the program prints an error message instead of crashing.

Example Output:

If the file is found:

Dataset loaded successfully!

If the file is not found:

Error: The file 'train.csv' was not found. Please check the path.

This ensures the program handles missing files gracefully without crashing.

Also Read: Machine Learning Basics: Key Concepts and Essential Elements Explained

Step 5: Understand the Dataset

Understanding the dataset's structure is essential for identifying potential issues such as missing values, skewed distributions, or data type mismatches. This step helps determine the right preprocessing techniques for each feature.

Code:

# Inspect dataset dimensions and summary statistics
print(dataset.shape)

# Statistical summary for numeric columns
print(dataset.describe())

# Missing value count per column
print(dataset.isnull().sum())

# Check data types of each column
print(dataset.dtypes)

Explanation:

  • dataset.shape: Returns the number of rows and columns in the dataset. This is useful to understand the size of your dataset.
  • dataset.describe(): Provides summary statistics for numerical columns, such as:
    • count: Number of non-null entries.
    • mean: Average value.
    • std: Standard deviation.
    • min: Minimum value.
    • 25%, 50%, 75%: Quartiles (25th, median, and 75th percentiles).
    • max: Maximum value.
  • dataset.isnull().sum(): Counts the number of missing (NaN) values in each column, helping to identify features that may need imputation or removal.
  • dataset.dtypes: Displays the data type of each column. This helps to check if any columns need conversion (e.g., converting "Sex" from string to a category or numeric type).

Output: The dataset contains 891 rows and 12 columns.

(891, 12)

Summary:
     PassengerId  ...        Fare
count   891.000000  ...  891.000000
mean    446.000000  ...   32.204208
std     257.353842  ...   49.693429

Missing Values:
Age         177
Cabin       687
Embarked      2

Explanation:

  • describe() method: It reveals that Fare has a wide range (from 0 to 512), and Age has a mean of 32 and standard deviation of 14. The "count" value shows how many valid entries there are for each column. 
  • Columns with fewer valid entries (like Age with 177 missing values and Cabin with 687 missing values) need attention.
  • Missing Values:
    • Age has 177 missing values, which could be imputed (mean, median, or model-based imputation).
    • Cabin has 687 missing values, suggesting it might need to be dropped due to a high proportion of missing data.
    • Embarked has only 2 missing values, which could likely be imputed using the mode (most frequent value).

Key Considerations:

1. Skewed Distributions: After reviewing the summary statistics, check for skewness in features like Age or Fare. For highly skewed distributions, consider applying transformations (e.g., log transformation).

2. Handling Categorical Features: After checking the dtypes, make sure that categorical variables like Sex and Embarked are encoded properly before feeding them into models. If they are still strings, you may need to use Label Encoding or One-Hot Encoding.

Enhance your understanding of machine learning and advance your skills with upGrad’s Advanced Generative AI Certification Course. In just 5 months, gain expertise in prompt engineering and GenAI-powered workflows to automate tasks.

Step 6: Handle Missing Data

Properly handling missing data is essential for model performance, as missing values can introduce bias and reduce the quality of your predictions. Depending on the nature of the feature and the amount of missing data, you can either drop or impute missing values.

Code:

# Handle missing data
dataset['Age'].fillna(dataset['Age'].median(), inplace=True)  # Impute missing 'Age' with median
dataset.drop(columns=['Cabin'], inplace=True)  # Drop the 'Cabin' column due to high missing data
dataset['Embarked'].fillna(dataset['Embarked'].mode()[0], inplace=True)  # Impute missing 'Embarked' with the mode

# Check for missing values
print(dataset.isnull().sum())

Explanation:

  • fillna(): Replaces missing values with a specified value. In this case:
    • For Age, we use the median because it's a numerical feature, and median imputation is more robust to outliers compared to the mean.
    • For Embarked, we use the mode (most frequent value) since it's a categorical feature.
  • drop(): Removes the Cabin column entirely because it has a significant amount of missing data. Dropping columns with more than 50% missing values is a common strategy.
  • isnull().sum(): This checks if any missing values remain in the dataset, confirming that all missing data has been addressed.

Output:

  • All missing values have been addressed: the Age column was imputed with the median, the Embarked column was imputed with the mode, and the Cabin column was dropped.
  • After this step, the dataset is now complete and ready for further preprocessing steps, such as feature encoding, scaling, or training a machine learning model.

PassengerId    0
Survived       0
Pclass         0
Name           0
Sex            0
Age            0
SibSp          0
Parch          0
Ticket         0
Fare           0
Embarked       0

Key Considerations:

1. Imputation Strategies:

  • Numerical features: For features like Age, imputation with the mean, median, or KNN imputation is commonly used. The median is generally preferred for skewed distributions.
  • Categorical features: For features like Embarked or Sex, imputation with the mode is commonly used. Alternatively, most frequent values or a prediction model (such as regression or classification) can be employed.

2. Dropping vs. Imputing:

  • When the amount of missing data is minimal (usually less than 5-10%), imputation is the preferred method. This is because dropping rows or columns can result in data loss.
  • When a feature has a significant proportion of missing data (e.g., more than 50%), it may be better to drop the feature entirely, as it won't provide enough information for the model to learn.

Also Read: Binomial Theorem: Mean, SD, Properties & Related Terms

Step 7: Check for Outliers

Outliers can skew the results and significantly affect model performance. Detecting and handling outliers is a crucial step in the data preprocessing process. In this step, we'll use both visualization and statistical methods to detect outliers in the dataset, particularly in numerical features like Fare.

Sample Code:

# Boxplot to detect outliers in 'Fare'
sns.boxplot(x=dataset['Fare'])
plt.show()

# Z-score method for 'Fare'
from scipy.stats import zscore
z_scores = zscore(dataset['Fare'])
dataset = dataset[(z_scores < 3).all(axis=1)]  # Remove rows with z-scores > 3

Explanation:

  • Box Plots:
    • The boxplot is used to visualize the distribution of the Fare column and identify extreme values (outliers) by showing the IQR. Outliers are typically any values that fall outside 1.5 times the IQR (above the upper quartile or below the lower quartile).
    • It gives a visual overview of the spread of the data, including the median, quartiles, and potential outliers.
  • Z-score method:
    • The Z-score measures how many standard deviations away a data point is from the mean. Outliers are defined as values that are more than 3 standard deviations away from the mean.
    • In this case, we remove rows where the Z-score for Fare is greater than 3, meaning the data point is highly extreme.

Output: The boxplot visualizes the distribution of the Fare column, showing potential outliers as points outside the whiskers of the box. The Z-score method removes rows where the Z-score is greater than 3, ensuring the dataset is cleaned of extreme outliers.

Key Considerations:

1. Z-score Method: The Z-score method works well when the data is normally distributed. If the data is skewed, other methods like IQR-based filtering might be more appropriate.

2. Other Methods for Handling Outliers:

  • IQR (Interquartile Range): You can calculate the IQR to detect outliers. Typically, values outside 1.5 times the IQR above the 75th percentile or below the 25th percentile are considered outliers.
  • Visualization with Histograms: In addition to the boxplot, histograms or Kernel Density Estimates (KDEs) can help visualize the overall distribution of a feature and identify any extreme values.
  • Outlier Removal vs. Transformation: If outliers are important to the problem (for example, in financial datasets), instead of removing them, you could transform the data to reduce the impact of outliers.

Step 8: Feature Engineering

Feature engineering is a crucial step to improving model performance. It involves creating new features, transforming existing ones, or combining multiple features to extract more meaningful information from the data. Properly engineered features can significantly improve the performance of ML models by providing them with more useful and informative inputs.

Feature engineering can include:

  • Creating new features from existing ones: For example, combining Age and Pclass to create a new feature like Age_Class.
  • Transforming features for better model compatibility: This includes scaling, normalizing, encoding categorical features, or applying mathematical transformations.

Sample Code:

# Feature Engineering: Create a new feature for 'Age_Group'
dataset['Age_Group'] = pd.cut(dataset['Age'], bins=[0, 12, 18, 30, 50, 80], 
                               labels=["Child", "Teen", "Adult", "Middle_Aged", "Senior"])

# Preview the dataset with new feature
print(dataset[['Age', 'Age_Group']].head())

Explanation:

  • pd.cut(): This function is used to bin numerical data into discrete intervals. In this case, the Age column is divided into categories such as Child, Teen, Adult, etc.
    • bins: Specifies the intervals (in years) for categorizing the Age data.
    • labels: Specifies the labels for the bins created in the Age column. The resulting feature will categorize each passenger into one of the defined age groups.
  • Creating Age_Group: The new Age_Group column is based on the Age of each passenger. This new feature can help capture age-related patterns in the data that could improve the model's performance.

Output:

  • A new feature, Age_Group, has been created based on Age. Each passenger has been categorized into one of the five age groups: Child, Teen, Adult, Middle_Aged, or Senior.

This feature can potentially help the model better differentiate between passengers from different age groups and may improve prediction accuracy.

Age  Age_Group
0  22.0      Adult
1  38.0      Adult
2  26.0      Adult
3  35.0      Adult
4  35.0      Adult

Additional Feature Engineering Techniques:

1. Combining Features: You can create new features by combining existing ones.

Sample Code:

dataset['Age_Pclass'] = dataset['Age'] * dataset['Pclass']

Explanation: This creates a new feature Age_Pclass by multiplying Age and Pclass. This may help capture relationships between age and class (e.g., younger passengers in higher classes might have different survival rates).

2. Extracting Information from Dates: If you have datetime columns, you can extract meaningful features, such as day of the week, month, year, or time of day.

Sample Code:

dataset['Embarked_Year'] = pd.to_datetime(dataset['Embarked'], errors='coerce').dt.year

Explanation: This converts the Embarked column to a datetime format and extracts the year from it. (Note: This assumes Embarked represents a date, which may need adjustment depending on actual data.)

3. Handling Categorical Features:

You might need to encode categorical features before passing them to models. For instance, Label Encoding or One-Hot Encoding can be used to convert categories into numeric values.

Sample Code:

dataset = pd.get_dummies(dataset, columns=['Embarked'], drop_first=True)

Explanation: pd.get_dummies() creates binary columns for each category in Embarked, while drop_first=True avoids the multicollinearity problem by removing the first column.

4. Scaling Features: Features such as Age or Fare may have large numeric ranges, so scaling or normalizing them can help improve model performance, especially for algorithms sensitive to feature scale.

Sample Code:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
dataset[['Age', 'Fare']] = scaler.fit_transform(dataset[['Age', 'Fare']])

Explanation: StandardScaler() standardizes the Age and Fare columns by removing the mean and scaling to unit variance, ensuring that these features contribute equally to the model.

5. Log Transformations: If a feature is highly skewed (such as Fare), a log transformation can help normalize it.

Sample Code:

dataset['Fare'] = np.log1p(dataset['Fare'])

Explanation: np.log1p() applies a logarithmic transformation to the Fare column to handle skewness, making the distribution more symmetric.

Also Read: 15 Essential Advantages of Machine Learning for Businesses in 2025

Step 9: Handle Data Imbalance

Data imbalance can significantly affect model performance, especially in classification tasks. When one class in the target variable is underrepresented compared to others, the model may become biased towards the majority class, leading to inaccurate predictions for the minority class. To handle this, techniques like oversampling and undersampling can be used.

Code (Oversampling with SMOTE):

from imblearn.over_sampling import SMOTE

# Separate features and target
X = dataset.drop(columns=['Survived'])
y = dataset['Survived']

# Apply SMOTE for oversampling
smote = SMOTE(random_state=42)
X_res, y_res = smote.fit_resample(X, y)

# Verify the class distribution after resampling
print(pd.Series(y_res).value_counts())

Explanation:

  • SMOTE: SMOTE generates synthetic data points for the minority class by creating samples that are similar to the existing minority class instances. This helps balance the class distribution and prevents the model from being biased towards the majority class.
    • random_state=42 ensures that the sampling process is reproducible, so the results can be consistent across different runs of the code.
  • fit_resample(): This function applies SMOTE to the features X and target y, generating synthetic samples for the minority class and returning the resampled datasets X_res and y_res.
  • pd.Series(y_res).value_counts(): This checks the class distribution of the resampled target variable to ensure that both classes are now balanced.

Output: After applying SMOTE, the class distribution of the target variable Survived is now balanced.

  • Both classes (0 and 1) have 549 samples, ensuring that the model will treat both classes with equal importance.
  • This helps in improving the prediction accuracy for the minority class.

0    549
1    549
Name: Survived, dtype: int64

Key Considerations:

1. Oversampling vs. Undersampling:

  • Oversampling (e.g., SMOTE) creates synthetic data for the minority class, helping balance the dataset while preserving all data. It’s especially useful for smaller datasets.
  • Undersampling reduces the majority class but may discard important information. SMOTE is generally preferred when you want to retain as much information as possible.

2. Evaluation Metrics: Accuracy can be misleading in imbalanced datasets since a model might predict the majority class most of the time. Instead, focus on Precision, Recall, F1-Score, and ROC AUC, which provide a more balanced evaluation of performance.

3. Impact on Model Performance:

  • Balancing the dataset helps the model focus on both classes equally, improving its ability to predict the minority class. Using SMOTE in small, imbalanced datasets helps reduce model bias.
  • For large datasets, SMOTE should be used carefully to avoid overfitting, as it may generate too many synthetic samples that don’t represent real-world data.

Looking to strengthen your ML skills alongside data preprocessing? upGrad’s Introduction to Natural Language Processing course covers essential NLP techniques such as tokenization, RegEx, phonetic hashing, and spam detection. Enroll Now!

Step 10: Visualizations

Visualizations are essential for uncovering patterns, trends, and anomalies in the data that might not be immediately obvious from raw data alone. They help you better understand the distributions of features, identify outliers, detect missing data, and ensure that the data is ready for modeling.

In this step, we’ll use visualizations to inspect missing data and the distribution of the Age feature before and after preprocessing.

Sample Code (Visualizing Data Before and After Preprocessing):

# Visualize missing data distribution using a heatmap
sns.heatmap(dataset.isnull(), cbar=False, cmap='viridis')
plt.show()

# Visualize the distribution of 'Age' before and after filling missing values
sns.histplot(dataset['Age'], kde=True)
plt.show()

Explanation:

  • The first visualization shows the missing data distribution using a heatmap.
  • The second visualization plots the Age distribution before and after filling missing values using histplot().

Output Explanation:

  • Missing Data Heatmap: The heatmap will show where the missing data points exist within the dataset. Each column will represent a feature, and rows will represent individual samples. Missing values will be highlighted by a different color (e.g., dark spots on the heatmap), making it easy to locate them visually.
  • Age Distribution Before and After Imputation: The histogram will show the distribution of Age before and after missing values were filled. If Age had a skewed distribution or missing values, the imputation process will likely change the shape of the distribution.

Additional Visualizations for Data Exploration:

1. Boxplot for Outlier Detection: A boxplot is useful for detecting outliers in numerical features like Fare and Age. It visualizes the distribution, median, and potential outliers as points outside the box’s whiskers.

Sample Code:

sns.boxplot(x=dataset['Fare'])
plt.show()

Explanation: sns.boxplot() creates a boxplot of the Fare feature. Outliers are displayed as points outside the whiskers, helping to identify extreme values that might influence model training.

2. Correlation Heatmap: It helps visualize the relationships between numerical features. This is important for detecting multicollinearity and selecting relevant features for your model.

Sample Code:

corr_matrix = dataset.corr()
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm')
plt.show()

Explanation:

  • dataset.corr() calculates the correlation matrix between numerical features.
  • sns.heatmap() visualizes the matrix with annotations to show the strength of the correlations.
  • The coolwarm colormap visually distinguishes positive and negative correlations, helping identify strongly correlated features that might require dimensionality reduction or feature selection.

3. Pairplot: This plot visualizes pairwise relationships between selected features. This is particularly useful for understanding interactions between features like Age, Fare, and Pclass and their relationship with the target variable, Survived.

Sample Code:

sns.pairplot(dataset[['Age', 'Fare', 'Pclass', 'Survived']])
plt.show()

Explanation:

  • sns.pairplot() creates a grid of scatter plots for the selected features (Age, Fare, Pclass, Survived).
  • The diagonal of the grid shows the distribution of each feature, and the off-diagonal plots show the pairwise relationships between them.
  • This helps in detecting trends and correlations between the features, making it easier to spot any meaningful patterns before model training.

Also Read: 15 Key Techniques for Dimensionality Reduction in Machine Learning

Step 11: Split the Dataset

Splitting the dataset into training and testing subsets is crucial to ensure a fair evaluation of model performance. The model is trained on the training set and evaluated on the testing set to assess its generalization capabilities on unseen data.

Code:

from sklearn.model_selection import train_test_split

# Features (X) and target (y)
X = dataset.drop(columns=['PassengerId', 'Name', 'Ticket', 'Survived'])
y = dataset['Survived']

# Split the dataset (80% training, 20% testing)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Output the shape of the sets
print(f"Training set shape: {X_train.shape}")
print(f"Testing set shape: {X_test.shape}")

Explanation:

  • train_test_split(): This function splits the dataset into training and testing subsets. It takes the features (X) and target (y) and divides them into training (80%) and testing (20%) sets. The test_size=0.2 parameter ensures that 20% of the data is reserved for testing.
    • random_state=42 ensures reproducibility, meaning that each time you run the code with the same random state, the dataset will be split in the same way.
  • Features (X): All columns except for the target (Survived) are used as features.
  • Target (y): The target variable (Survived) is what the model will try to predict.

Output:

  • The training set contains 712 samples (80% of the dataset) and 7 features.
  • The testing set contains 179 samples (20% of the dataset) and 7 features.
  • This split ensures that the model is trained on one subset and tested on a completely separate subset, allowing for a more accurate evaluation of the model's performance.

Training set shape: (712, 7)
Testing set shape: (179, 7)

Importance of Splitting the Dataset

  • Training and Testing: This split helps assess the model's ability to generalize to new, unseen data. Training the model on the entire dataset and then testing it on the same data would lead to overfitting, where the model performs well on the training data but poorly on unseen data.
  • Ensures Fair Evaluation: By holding out the test data during training, you can evaluate the model’s true predictive power, ensuring that it’s not biased by having seen the test data during training.
Notes: If your dataset is imbalanced, consider using stratified splitting (keeping the same proportion of each class in both training and testing sets). In train_test_split(), you can set stratify=y to ensure that the distribution of the target variable (Survived) is the same in both sets.

With all the preprocessing steps completed, the dataset is now clean, structured, and ready for training machine learning models. By addressing missing data, outliers, imbalanced datasets, and adding useful features, we ensure that the data is in the best possible state for model building.

Looking to strengthen your data analysis skills for machine learning? upGrad’s Introduction to Data Analysis using Excel course provides comprehensive training in data cleaning, analysis, and visualization. Enroll today!

Also Read: Linear Regression Model in Machine Learning: Concepts, Types, And Challenges in 2025

Let's explore key preprocessing techniques that enhance model performance across industries. These methods are crucial for ensuring data quality and model accuracy.

5 Key Techniques for Effective Data Preprocessing

Effective data preprocessing is critical in transforming raw, unstructured data into a clean, structured format suitable for machine learning models. The quality and preparation of the data directly influence the accuracy and reliability of the model's predictions.

Below are advanced techniques that ensure data is optimally prepared for machine learning, enabling models to learn from high-quality information:

1. Data Cleaning

Data cleaning is a critical step in identifying and rectifying errors, inconsistencies, and inaccuracies within the dataset. This process involves following elements that can distort model performance:

  • Missing Data Handling: Missing values can be addressed using imputation techniques like mean, median, mode, or advanced methods such as KNN imputation or multiple imputation by chained equations. Choosing the right method depends on the nature of the missing data (MCAR, MAR, or MNAR).
  • Duplicate Removal: Removing duplicate records ensures that repeated data points do not skew the model. This is especially important in large datasets where duplicates might be introduced during data collection or merging.
  • Outlier Treatment: Outliers can significantly impact models like linear regression or SVMs . Detection methods such as the Z-score (values beyond 3 standard deviations) or IQR method (values outside 1.5 times the interquartile range) can help identify and address extreme values.

Example:

  • Imputing missing Age values using the median or replacing missing Embarked values with the mode.
  • Removing duplicate records of the same transaction or user in a customer dataset.
  • Correcting misentered values such as "NaN" or "12345" for a phone number column in HTML forms or TypeScript input fields.

2. Data Reduction

Data reduction involves reducing the volume of data while retaining critical information. This is often achieved through techniques like feature selection, dimensionality reduction, or data aggregation to enhance computational efficiency and model accuracy.

  • Feature Selection: Reduces the number of input variables by identifying and retaining the most relevant features. Methods like mutual information, RFE, and L1 regularization (Lasso) can help eliminate redundant or irrelevant features.
  • Dimensionality Reduction: Techniques like Principal Component Analysis (PCA), t-SNE, or Autoencoders, help reduce the feature space by transforming high-dimensional data into fewer dimensions. PCA, for example, identifies the directions that maximize variance, enabling us to reduce data dimensions while preserving critical information.

Example:

  • Using PCA to reduce the number of features in an image dataset while maintaining the most important variations in pixel values.
  • Employing L1 regularization to select a subset of important features for a predictive model in a sales forecast problem.

These techniques are critical in fields like Internet of Things (IoT) and data mining, where large volumes of data need efficient processing and analysis.

3. Data Transformation

Data transformation ensures that the features are in a format suitable for model analysis. It includes techniques like scaling, normalization, encoding categorical variables, and log transformation to ensure that the data meets the assumptions of the model and is ready for analysis.

  • Scaling and Normalization: Min-Max scaling transforms features to a range of [0, 1], while Z-score standardization (mean = 0, standard deviation = 1) standardizes the distribution of features. This is essential for models like KNN, SVM, or neural networks, which are sensitive to the scale of input data.
  • One-Hot Encoding: Categorical variables are transformed into binary columns, where each column represents a category. This is essential for algorithms like logistic regression or decision trees.
  • Log Transformation: For skewed distributions, a log transformation can help normalize data, especially for features with exponential growth patterns (like income or house prices).

Example:

  • Scaling income data for a loan prediction model to ensure that loan amount and income features are comparable.
  • One-Hot Encoding the Embarked feature (C, Q, S) for a Titanic dataset.
  • Applying log transformation to the Fare feature to reduce its skewness before training a regression model.

These techniques are crucial in RESTful API development, ensuring data is preprocessed into the correct format for efficient querying and processing.

4. Data Enrichment

Data enrichment involves enhancing the dataset by adding new, valuable information from external or internal sources, thereby increasing its completeness and relevance. This helps in generating more meaningful features that can improve model performance.

  • Merging External Data: Enriching data by integrating external datasets (e.g., weather data, demographic information, or economic indicators) can provide additional context for the problem being solved.
  • Creating Derived Features: New features can be derived from existing data, such as creating age groups from the age column or deriving customer tenure from the subscription start date.

Example:

  • Enriching a sales dataset by appending weather data to account for seasonal sales fluctuations.
  • Generating new features like customer lifetime value from transactional history for a churn prediction model.

5. Data Validation

Data validation ensures that the data is accurate, complete, and consistent. This process involves checking for data integrity, ensuring proper formatting, and verifying the correctness of data values.

  • Data Integrity Checks: Verifying the consistency of data entries, like ensuring email columns follow proper formats and that age values are within a reasonable range.
  • Data Type Validation: Ensuring that numeric columns contain only numerical values and categorical columns contain appropriate categories (e.g., no misclassified numeric data in a city field).
  • Consistency Validation: Ensuring that there are no conflicting values (e.g., start date greater than end date).

Example:

  • Validating that a date of birth feature has reasonable ranges (e.g., age > 0).
  • Ensuring that price values are not negative or that transaction date is formatted correctly in a financial dataset.

When developing models in Node.js for customer segmentation or creating Vue.js dashboards to visualize trends, the raw data is often incomplete or contains anomalies. Applying these techniques ensures the data is clean, accurate, and optimized for better model performance.

If you are interested in learning the basics of data visualization, check out upGrad’s Case Study using Tableau, Python, and SQL. This 10-hour free program will help you gain expertise on creating dashboards and analyzing churn rates for applications. 

Also Read: ML Types Explained: A Complete Guide to Data Types in Machine Learning

Let's explore how these data preprocessing techniques in machine learning are applied in various industries, showcasing their impact on practical applications.

Importance of Data Preprocessing in ML Across Industries

Data preprocessing is a vital step in machine learning that ensures the dataset is in an optimal state for modeling. The way data is prepared can significantly impact the model's performance, as clean, structured data allows algorithms to identify patterns and make accurate predictions.

Below is a detailed breakdown of the preprocessing techniques and their real-world applications across different industries:

1. Handling Missing Data

Many machine learning algorithms cannot handle missing values directly. Missing data can lead to incomplete models, biased results, or errors during training.

Common Techniques Used: Imputation techniques (e.g., using the mean, median, mode, or more advanced methods like KNN imputation or multivariate imputation by chained equations) ensure that the model learns from complete data, improving the quality of predictions.

Real-time Application:

  • Healthcare: In medical datasets (e.g., patient records), missing values in features like Age, Blood Pressure, or Diagnosis can be imputed, allowing for a more complete analysis of patient health and improving predictive models for disease outcomes or risk predictions.
  • E-commerce: Missing product details (like Price, Brand, or Category) can be imputed, ensuring a more accurate recommendation system for users.

2. Feature Scaling

Different features can have vastly different scales (e.g., Age vs. Income), leading algorithms to give undue weight to certain features.

Common Techniques Used: Feature scaling techniques like Min-Max Scaling (scaling features to a range of [0, 1]) or Standardization (removing the mean and scaling to unit variance) ensure that features contribute equally to the model, improving convergence speed and model performance.

Real-time Application:

  • Financial Industry: Stock price prediction models use features like trading volume and price data. Without proper scaling, certain features may dominate the model, leading to inaccurate predictions.
  • Sports Analytics: Models predicting player performance or team rankings must scale features such as points per game, assists, and minutes played to ensure equal importance.

3. Encoding Categorical Variables

Most machine learning algorithms require numerical input. Categorical features (e.g., Gender, Embarked) must be encoded to allow the algorithm to process them.

Common Techniques Used: Techniques like Label Encoding and One-Hot Encoding are used to convert categorical data into numerical representations. Label Encoding assigns an integer to each category, while One-Hot Encoding creates binary columns for each category, preserving the non-ordinal nature of the categories.

Real-time Application:

  • Retail: Customer segmentation often involves categorical variables such as Age group, Gender, and Location. These need to be encoded to build models for targeted marketing and product recommendations.
  • Telecom: Customer churn prediction models use categorical variables like Subscription Plan and Region, which need encoding to analyze and predict customer behavior.

4. Outlier Detection and Removal

Outliers can distort statistical analyses, skew distributions, and affect model accuracy, particularly in algorithms sensitive to data distribution (e.g., linear regression, SVM).

Common Techniques Used: Outliers can be detected using methods such as the Z-score (values with Z-scores greater than 3 standard deviations are considered outliers) or Interquartile Range (IQR). Once identified, outliers can be removed or transformed to prevent them from influencing model training.

Real-time Application:

  • Banking: In fraud detection, outliers in transaction amounts or frequencies can indicate suspicious activity. Removing or handling outliers allows fraud detection models to focus on normal transaction patterns.
  • Manufacturing: In predictive maintenance, removing outliers in sensor readings (such as temperature or pressure) can improve the accuracy of models predicting equipment failures.

5. Addressing Data Imbalance

Imbalanced datasets (e.g., a classification problem with more instances of one class than another) can lead to biased models that perform poorly on the minority class.

Common Techniques Used: Techniques like SMOTE generate synthetic data points for the minority class, while undersampling reduces the number of majority class instances. These methods ensure balanced class distributions and improve the model’s ability to predict the minority class effectively.

Real-time Application:

  • Healthcare: In disease prediction, the number of healthy patients may greatly outnumber those with a disease, causing models to be biased. SMOTE can balance the dataset, improving predictions for minority diseases.
  • Credit Risk Modeling: In loan approval systems, the minority class (e.g., defaulting customers) may be underrepresented, leading to inaccurate predictions. Balancing the data helps in predicting defaults more accurately.

6. Feature Engineering

Raw data may not always contain the best features for model performance. Feature engineering involves creating new, informative features that better represent the underlying patterns in the data.

Common Techniques Used: Examples include creating interaction features, binning continuous features into categorical ones (Age_Group), or extracting date/time components from a timestamp. This process involves applying domain knowledge to enrich the dataset, improving the model’s predictive power.

Real-time Application:

  • E-commerce: Customer behavior models can be improved by creating new features from raw data (e.g., time spent on website * items in cart) for better product recommendations.
  • Energy Sector: Feature engineering can be used to create new features from time-series data (e.g., daily energy consumption patterns), helping in demand forecasting and price optimization.

7. Data Normalization

Some algorithms, like k-means clustering or support vector machines, are sensitive to the scale of the features. Without normalization, features with larger ranges will dominate the learning process.

Common Techniques Used: Normalization transforms data into a common scale, ensuring that each feature contributes equally to the model. For example, Min-Max Normalization scales the data to a fixed range. Whereas Z-score Normalization standardizes the data to have a mean of 0 and a standard deviation of 1.

Real-time Application:

  • Finance: In risk assessment models, normalizing financial data (like assets and liabilities) ensures that no single variable dominates the decision-making process.
  • Transportation: In route optimization, normalizing data like travel time, cost, and distance ensures that each feature contributes equally to model predictions.

Effective data preprocessing in machine learning is key to building reliable models, ensuring they perform optimally on practical tasks. By refining data quality, we enable algorithms to learn more accurately and make better predictions.

If you want to gain expertise in machine learning with cloud computing, check out upGrad’s Professional Certificate Program in Cloud Computing and DevOps. This program will help you build the core principles of DevOps, AWS, GCP, and more.

Advance Your Skills in Data Machine Learning with upGrad

Data preprocessing in machine learning ensures that raw data is cleaned, transformed, and appropriately structured for efficient model training. It involves tasks like handling missing values, detecting and removing outliers, performing feature scaling, and encoding categorical variables. These steps enhance the performance of ML models, enabling algorithms to identify meaningful patterns and make reliable predictions.

However, applying these concepts with confidence in practical scenarios can be challenging. That’s where upGrad steps in, with expert-led programs that blend theory with hands-on training.

Here are a few additional upGrad courses to help you get started:

Want to gain expertise in standard deviation ML in 2025? Reach out to upGrad for personalized counseling and expert guidance. You can also visit your nearest upGrad offline center to explore the right learning path for your goals.

Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.

Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.

Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.

Reference:
https://www.forbes.com/sites/gilpress/2016/03/23/data-preparation-most-time-consuming-least-enjoyable-data-science-task-survey-says/
Dataset:
Titanic Data 

Frequently Asked Questions (FAQs)

1. What are the common challenges in data preprocessing in machine learning?

2. How do data imbalances affect machine learning models?

3. What is the importance of normalizing data in machine learning?

4. What is the difference between data cleaning and data preprocessing?

5. How do you handle categorical variables in data preprocessing?

6. What is feature engineering, and why is it important?

7. How can data preprocessing improve the performance of neural networks?

8. What is the role of data augmentation in machine learning preprocessing?

9. How does data preprocessing in machine learning differ for time-series data?

10. Can data preprocessing in machine learning affect model interpretability?

11. What is the importance of handling outliers in data preprocessing?

Kechit Goyal

95 articles published

Experienced Developer, Team Player and a Leader with a demonstrated history of working in startups. Strong engineering professional with a Bachelor of Technology (BTech) focused in Computer Science fr...

Get Free Consultation

+91

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Top Resources

Recommended Programs

LJMU

Liverpool John Moores University

Master of Science in Machine Learning & AI

Dual Credentials

Master's Degree

18 Months

IIITB
bestseller

IIIT Bangalore

Executive Diploma in Machine Learning and AI

Placement Assistance

Executive PG Program

12 Months

upGrad
new course

upGrad

Advanced Certificate Program in GenerativeAI

Generative AI curriculum

Certification

4 months