How to Perform Cross-Validation in Machine Learning?
Updated on Mar 28, 2025 | 19 min read | 10.6k views
Share:
For working professionals
For fresh graduates
More
Updated on Mar 28, 2025 | 19 min read | 10.6k views
Share:
Table of Contents
Cross-validation is a critical step in model selection, helping you evaluate the performance of machine learning models and avoid overfitting. It allows you to assess how well a model generalizes to unseen data.
This is done by splitting the dataset into multiple subsets for training and validation. By using cross-validation, you can identify the best model for your project, improving its accuracy and reliability.
This blog will walk you through the cross-validation process, different types of cross-validation techniques, and how they can help you select the most effective model.
Cross-validation in machine learning is used to ensure your model performs well on unseen data to prevent overfitting. It allows you to evaluate how well your model generalizes to different subsets of your data, making it a crucial tool for selecting the best model.
Here’s why cross-validation matters in model selection:
Cross-validation plays a critical role in model selection by giving you a clearer, more reliable understanding of how your model will perform in actual conditions. This ensures that the selected model is not only accurate but also generalizes well to new data.
Also Read: Top 14 Most Common Data Mining Algorithms You Should Know
Now that you have a basic understanding of cross-validation and model selection, let’s dive into how you can perform cross-validation and model selection.
Each step of the cross-validation process serves a specific purpose in strengthening the model's ability to generalize to new data. By breaking the dataset into multiple subsets and using different combinations of training and testing sets, cross-validation reduces the bias that might arise from random splits.
Let’s break down how you can perform cross-validation in machine learning and the key metrics involved.
Start by splitting your dataset into multiple subsets (folds). Typically, a dataset is split into K folds. If you're using K-fold cross-validation, each fold serves as a test set once, while the model is trained on the remaining K-1 folds. This process is repeated K times, with each fold used as the validation set exactly once.
Sample Code:
import pandas as pd
from sklearn.datasets import load_iris
from sklearn.model_selection import KFold
# Load the Iris dataset
data = load_iris()
X, y = data.data, data.target
# Initialize KFold with 5 splits
kf = KFold(n_splits=5)
# Print the splits (training and validation data indices)
for train_index, test_index in kf.split(X):
print("Train indices:", train_index, "\nTest indices:", test_index)
Explanation: You can use KFold from sklearn.model_selection to split the data into 5 folds. The kf.split() method generates indices for the train and test sets in each iteration.
Output:
Train indices: [ 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34]
Test indices: [35 36 37 38 39]
The splitting ensures that every data point gets a chance to be part of the training and validation sets, providing a better evaluation.
Also Read: Cross-Validation in Python: Everything You Need to Know About
For each fold, train the model using the training set (K-1 folds) and evaluate it on the validation set (the remaining fold). This gives you a performance score for each fold.
Sample Code:
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score
# Initialize the Logistic Regression model
model = LogisticRegression(max_iter=200)
# Initialize KFold
kf = KFold(n_splits=5)
accuracy_scores = []
# Perform cross-validation
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train the model
model.fit(X_train, y_train)
# Test the model
y_pred = model.predict(X_test)
# Evaluate accuracy
accuracy = accuracy_score(y_test, y_pred)
accuracy_scores.append(accuracy)
print(f"Accuracy Scores for each fold: {accuracy_scores}")
Explanation: For each fold, you split the data into train and test sets. You train the Logistic Regression model on the train set and make predictions on the test set. The accuracy for each fold is calculated and stored.
Output:
Accuracy Scores for each fold: [0.9667, 1.0, 1.0, 0.9667, 0.9667]
If you are using Scikit-learn, the cross_val_score() function automatically splits the data, trains the model, and returns the performance scores for each fold.
Also Read: 10 Interesting R Project Ideas For Beginners [2025]
Once you have the scores from each fold, you calculate the average performance score. Common metrics include:
Accuracy: Measures the proportion of correctly classified instances over the total instances. It's suitable when the classes are balanced.
F1-Score: The harmonic mean of precision and recall, used when you need to balance both metrics.
Precision and Recall: These two metrics are crucial when dealing with imbalanced datasets, where one class is much more frequent than the other. Precision tells you the proportion of positive predictions that were actually correct. In contrast, recall indicates how many of the actual positives the model was able to identify.
The relationship between precision and recall is often an inverse one: improving one can sometimes lead to a decrease in the other. For example, if you adjust the model to be more selective and increase precision, it may miss some true positives, decreasing recall. On the other hand, focusing on recall may increase false positives, which would decrease precision.
The choice between precision and recall depends on the problem at hand:
In many applications, you'll want to balance precision and recall, which can be achieved using the F1 score, the harmonic mean of both metrics.
These metrics help evaluate your model’s ability to generalize beyond just accuracy, particularly in imbalanced data situations.
Sample Code:
from sklearn.metrics import precision_score, recall_score, f1_score
# Initialize lists to store scores
precision_scores, recall_scores, f1_scores = [], [], []
# Perform cross-validation
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train the model
model.fit(X_train, y_train)
# Test the model
y_pred = model.predict(X_test)
# Calculate precision, recall, and F1 score
precision = precision_score(y_test, y_pred, average='weighted')
recall = recall_score(y_test, y_pred, average='weighted')
f1 = f1_score(y_test, y_pred, average='weighted')
precision_scores.append(precision)
recall_scores.append(recall)
f1_scores.append(f1)
print(f"Precision Scores: {precision_scores}")
print(f"Recall Scores: {recall_scores}")
print(f"F1 Scores: {f1_scores}")
Explanation: You calculate precision, recall, and F1 scores for each fold, using the weighted average to handle multiclass classification.
Output:
Precision Scores: [0.96, 1.0, 1.0, 0.96, 0.96]
Recall Scores: [0.96, 1.0, 1.0, 0.96, 0.96]
F1 Scores: [0.96, 1.0, 1.0, 0.96, 0.96]
After completing cross-validation, you’ll have multiple performance scores (one for each fold). Calculate the mean score and the standard deviation across the folds. The mean score gives the overall performance, and the standard deviation shows how consistent the model is across different subsets.
Sample Code:
import numpy as np
# Calculate the mean and standard deviation of the accuracy scores
mean_accuracy = np.mean(accuracy_scores)
std_accuracy = np.std(accuracy_scores)
print(f"Mean Accuracy: {mean_accuracy}")
print(f"Standard Deviation of Accuracy: {std_accuracy}")
Explanation: The mean gives an overall performance score, while the standard deviation reflects how consistent the model is across different folds.
Output:
Mean Accuracy: 0.98
Standard Deviation of Accuracy: 0.016
A model with high variance in its cross-validation scores might indicate overfitting.
After running cross-validation on multiple models (e.g., decision trees, SVM, random forest), compare the performance metrics of each model. The one with the best and most consistent performance across the folds should be your final model.
Sample Code:
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
# Define models
models = {
"Logistic Regression": LogisticRegression(max_iter=200),
"Random Forest": RandomForestClassifier(),
"SVM": SVC()
}
# Perform cross-validation for each model
for name, model in models.items():
accuracy_scores = []
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
model.fit(X_train, y_train)
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
accuracy_scores.append(accuracy)
mean_accuracy = np.mean(accuracy_scores)
print(f"{name} Mean Accuracy: {mean_accuracy}")
Explanation: This code tests multiple models and outputs the mean accuracy for each, helping you compare their performance.
Output:
Logistic Regression Mean Accuracy: 0.98
Random Forest Mean Accuracy: 1.0
SVM Mean Accuracy: 0.98
If you’re tuning hyperparameters, you can combine cross-validation with grid search or random search. This allows you to evaluate different hyperparameter combinations and choose the one that gives the best cross-validation score.
Tools to use: GridSearchCV or RandomizedSearchCV in Scikit-learn.
Sample Code:
from sklearn.model_selection import GridSearchCV
# Define the parameter grid
param_grid = {'C': [0.1, 1, 10], 'kernel': ['linear', 'rbf']}
# Apply GridSearchCV with cross-validation
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X, y)
print(f"Best Parameters: {grid_search.best_params_}")
print(f"Best Score: {grid_search.best_score_}")
Explanation: This code performs hyperparameter tuning using GridSearchCV, automatically tuning parameters and evaluating the best-performing combination via cross-validation.
Output:
Best Parameters: {'C': 1, 'kernel': 'rbf'}
Best Score: 0.98
This step-by-step process helps in performing cross-validation effectively, evaluating multiple models, and selecting the one that offers the best performance. This ensures reliable model selection.
Also Read: Python Cheat Sheet: From Fundamentals to Advanced Concepts for 2025
Next, let’s explore the different types of cross-validation for model selection.
Cross-validation is a powerful technique used in machine learning to evaluate model performance. The choice of cross-validation method largely depends on the dataset and problem at hand.
Let’s explore the most common types of cross-validation, when to use them, and how to implement them in Python.
K-Fold cross-validation splits the dataset into K equal-sized subsets (or folds). The model is trained on K-1 of these folds and validated on the remaining fold. This process repeats K times, with each fold serving as the validation set once. This method provides a more robust evaluation by using different portions of the data for both training and validation.
What sets K-Fold apart from a simple train-test split is that it helps reduce bias by averaging the performance over multiple test sets. This ensures the model generalizes better.
While more computationally expensive, especially for large datasets, K-Fold cross-validation offers a more accurate estimate of model performance. Also, it is less sensitive to random variations in the data split.
When to Use: K-Fold cross-validation is most effective when working with homogeneous datasets, particularly when there is no significant class imbalance. For instance, in a customer churn prediction task, where the dataset consists of a balanced number of customers who left and those who stayed, K-Fold cross-validation can be ideal.
This method ensures that the model is validated across different data subsets, providing a solid understanding of how it generalizes to unseen data. The dataset is balanced and the problem is not highly complex. K-Fold ensures good model evaluation without the added complexity of needing specialized methods like stratified sampling or leave-one-out approaches.
Code Example:
from sklearn.model_selection import KFold
from sklearn.ensemble import RandomForestClassifier
from sklearn.datasets import load_iris
from sklearn.model_selection import cross_val_score
# Load dataset
data = load_iris()
X, y = data.data, data.target
# Initialize KFold
kf = KFold(n_splits=5, shuffle=True, random_state=42)
# Model
model = RandomForestClassifier()
# Perform cross-validation
scores = cross_val_score(model, X, y, cv=kf)
print("K-Fold Cross Validation Scores:", scores)
Output:
K-Fold Cross Validation Scores: [1. 0.96 0.96 0.96 1. ]
This indicates how well the model performed across 5 different data splits.
Stratified K-Fold cross-validation is an enhancement of the standard K-Fold method, where the data is divided into K subsets, but with a key difference: each fold maintains the same class distribution as the original dataset. This ensures that the proportion of each class in the training and validation sets is consistent across all folds.
This method is particularly beneficial when dealing with imbalanced datasets, where some classes are underrepresented. In such cases, traditional K-Fold cross-validation might result in folds that don't accurately represent the minority class, leading to biased model evaluation.
It helps mitigate this issue, ensuring that all classes are properly represented in each fold, resulting in a more reliable performance estimate for imbalanced classification problems.
When to Use: Stratified K-Fold cross-validation is ideal for datasets with imbalanced classes, such as fraud detection. In this case, the dataset may have a significantly larger number of non-fraudulent transactions compared to fraudulent ones.
Each fold maintains the same proportion of fraud and non-fraud cases as the original dataset, resulting in a more accurate evaluation. It prevents the model from being biased towards the majority class. This ensures that the model's performance on rare events (fraudulent transactions) is properly assessed and optimized.
Code Example:
from sklearn.model_selection import StratifiedKFold
# Initialize StratifiedKFold
skf = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)
# Perform cross-validation
stratified_scores = cross_val_score(model, X, y, cv=skf)
print("Stratified K-Fold Cross Validation Scores:", stratified_scores)
Output:
Stratified K-Fold Cross Validation Scores: [1. 1. 1. 1. 1. ]
Here, you can see how each fold maintains class distribution and yields consistent results.
Leave-One-Out Cross-Validation (LOOCV) is a special case of K-Fold cross-validation where K is set to the number of data points in the dataset. In LOOCV, each data point is used once as the validation set, and the remaining n-1 data points are used to train the model. This process is repeated for every data point in the dataset, resulting in as many training-validation cycles as there are data points.
While LOOCV provides a very thorough evaluation and ensures that every data point is used for both training and validation, it is computationally expensive, especially for large datasets. Since the model is trained and validated n times, it can be time-consuming. However, LOOCV is particularly useful when dealing with small datasets, where every individual data point is valuable for model evaluation.
When to Use: Leave-One-Out Cross-Validation (LOOCV) is particularly useful when working with small datasets, where every data point is valuable. For example, in rare disease prediction, where data is limited and each sample carries significant weight, LOOCV ensures that every data point is used as both training and validation data.
This method maximizes the use of available data, providing a more reliable performance estimate. However, LOOCV can be computationally expensive for larger datasets, so it’s best suited for situations where the dataset is small enough to handle the increased computational load.
Code Example:
from sklearn.model_selection import LeaveOneOut
# Initialize LeaveOneOut
loo = LeaveOneOut()
# Perform cross-validation
loo_scores = cross_val_score(model, X, y, cv=loo)
print("Leave-One-Out Cross Validation Scores:", loo_scores)
Output:
Leave-One-Out Cross Validation Scores: [1. 1. 1. 1. 1. ... ]
This method ensures that each data point is validated once, but for large datasets, it can be very slow due to the large number of iterations.
ShuffleSplit is a variation of cross-validation that generates multiple train-test splits by randomly shuffling the data and splitting it into train and test sets. This method is useful when you want to ensure a random distribution in each split.
When to Use: ShuffleSplit is useful when you need control over the number of train-test splits and prefer randomization. For instance, in customer segmentation, you can evaluate the model’s performance on different data subsets with each split.
It allows every data point to be used in both training and testing across multiple iterations, ensuring better model generalization and preventing overfitting.
Code Example:
from sklearn.model_selection import ShuffleSplit
# Initialize ShuffleSplit
ss = ShuffleSplit(n_splits=5, test_size=0.2, random_state=42)
# Perform cross-validation
shuffle_split_scores = cross_val_score(model, X, y, cv=ss)
print("ShuffleSplit Cross Validation Scores:", shuffle_split_scores)
Output:
ShuffleSplit Cross Validation Scores: [0.96 0.98 0.96 1. 1. ]
Explanation: This technique ensures random train-test splits, which could help test the model’s robustness across different data subsets.
Scikit-learn offers built-in iterators like KFold and StratifiedKFold to streamline the process of splitting the data for cross-validation. These iterators automatically handle the logic of dividing the dataset into K folds and ensuring the data is shuffled (if needed). KFold divides the data into equally sized folds, while StratifiedKFold ensures that each fold maintains the same class distribution as the original dataset, making it ideal for imbalanced datasets.
Using these iterators eliminates the need for manual data handling, ensuring consistency and reducing the chances of errors. Additionally, they offer convenient features like controlling the shuffle of data before splitting, making it easier to implement cross-validation with minimal code. These iterators make the process efficient and are highly customizable, allowing users to tweak the number of splits or control how the data is distributed across folds.
KFold Iterator:
from sklearn.model_selection import KFold
kf = KFold(n_splits=5)
for train_index, test_index in kf.split(X):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and test model here
StratifiedKFold Iterator:
from sklearn.model_selection import StratifiedKFold
skf = StratifiedKFold(n_splits=5)
for train_index, test_index in skf.split(X, y):
X_train, X_test = X[train_index], X[test_index]
y_train, y_test = y[train_index], y[test_index]
# Train and test model here
By using Scikit-learn's built-in functions, you can easily implement these techniques and enhance your model’s robustness.
Also Read: Data Structures in Python
You’ll get a better understanding of cross-validation and model selection by going over how it’s used in real-life applications.
Cross-validation is not just a theoretical concept but a practical and powerful tool used across various industries. Ensuring that a model is robust and generalizes well to unseen data, it helps businesses make more informed and reliable decisions.
Below are some real-life examples showcasing how cross-validation and model selection impact industries:
1. Fraud Detection in Finance
Fraud detection is crucial in the finance industry for minimizing risks and losses. Cross-validation plays a key role in evaluating machine learning models used for detecting fraudulent transactions. By ensuring the models generalize well across different transaction patterns, cross-validation prevents overfitting to past data.
Impact: Cross-validation helps financial institutions test multiple models, ensuring that the chosen model doesn’t just memorize past fraudulent patterns but performs reliably on unseen data. This minimizes false positives (incorrectly flagged transactions), thus improving fraud detection accuracy.
Example: Cross-validation compares models like logistic regression, decision trees, and random forests, identifying the best-performing algorithm for real-time fraud detection.
Outcome: Better model selection via cross-validation leads to more accurate fraud detection, reducing financial losses and reputational damage.
2. Recommendation Systems in E-Commerce
E-commerce platforms rely on recommendation systems to suggest products to users based on their behaviors and preferences. Cross-validation helps evaluate recommendation models to prevent overfitting to specific users’ past behavior, ensuring suggestions remain relevant and accurate over time.
Impact: Cross-validation ensures the recommendation algorithms are robust and perform well on new user data. It helps businesses select the best model for delivering personalized recommendations, enhancing user experience and driving sales.
Example: Cross-validation is used to test collaborative filtering models and hybrid approaches, helping e-commerce platforms determine which model maximizes user engagement.
Outcome: A well-validated recommendation system improves customer satisfaction, boosts conversion rates, and increases overall sales.
3. Predictive Maintenance in Manufacturing
Predictive maintenance relies on data from sensors, machine logs, and historical records to forecast when equipment will fail. Cross-validation is crucial in testing the models that predict these failures, ensuring they generalize to new equipment and varying operational conditions.
Impact: Cross-validation enables manufacturers to assess model performance across different maintenance scenarios. This ensures predictive maintenance models are robust, helping avoid unexpected downtimes and reduce maintenance costs.
Example: Cross-validation helps evaluate machine learning models like random forests and support vector machines, ensuring they accurately predict machinery failures.
Outcome: Better model accuracy leads to timely maintenance interventions, reducing downtime and saving operational costs.
4. Medical Diagnostics and Healthcare
In healthcare, cross-validation is essential to evaluate diagnostic models that predict diseases or health outcomes from medical data, like images or patient records. Cross-validation ensures that these models are not overfitting to the training data, leading to more reliable results.
Impact: By using cross-validation, healthcare providers can select the best-performing diagnostic models, ensuring they perform well on new patient data. This results in more accurate diagnoses and improved patient care.
Example: Cross-validation is used to evaluate convolutional neural networks (CNNs) for detecting diseases from medical images, ensuring the model can generalize to new image datasets.
Outcome: Accurate model selection improves diagnostic accuracy, leading to better treatment plans and overall health outcomes for patients.
5. Customer Segmentation in Marketing
Customer segmentation is crucial for targeting the right audience with the right campaigns. Cross-validation helps evaluate clustering models like K-means and DBSCAN, ensuring that the segmentation is meaningful and applies to the entire customer base.
Impact: Cross-validation prevents the segmentation model from overfitting to one specific segment, ensuring that the customer segments identified are valid across the entire dataset. This helps businesses make more data-driven marketing decisions.
Example: Cross-validation compares clustering algorithms like K-means and DBSCAN to determine which one produces the most accurate and meaningful customer segments based on purchasing behavior.
Outcome: Accurate customer segmentation through cross-validation leads to more targeted marketing campaigns, driving higher engagement and increased sales.
By selecting the right model using cross-validation, companies can improve efficiency, reduce costs, and enhance user experience.
Also Read: Machine Learning Course Syllabus: A Complete Guide to Your Learning Path
Now that you have a good knowledge of cross-validation and model selection, let’s explore how upGrad can take your learning journey forward.
Unsure about how to implement cross-validation and select the best model for your project yet? upGrad’s specialized certification courses will guide you through advanced techniques in cross-validation, model selection, and performance evaluation.
Gain hands-on experience in optimizing your models and make more informed data-driven decisions.
Here are some relevant courses you can enroll in:
If you're unsure about the next step in your learning journey, you can contact upGrad’s personalized career counseling for guidance on choosing the best path tailored to your goals. You can also visit your nearest upGrad center and start hands-on training today!
Expand your expertise with the best resources available. Browse the programs below to find your ideal fit in Best Machine Learning and AI Courses Online.
Discover in-demand Machine Learning skills to expand your expertise. Explore the programs below to find the perfect fit for your goals.
Discover popular AI and ML blogs and free courses to deepen your expertise. Explore the programs below to find your perfect fit.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources