Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
KnowledgeHut upGradKnowledgeHut upGradBackend Development Bootcamp
  • Self-Paced
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Cross Validation in Python: Everything You Need to Know About

Updated on 08 January, 2024

11.14K+ views
10 min read

In Data Science, validation is probably one of the most important techniques used by Data Scientists to validate the stability of the ML model and evaluate how well it would generalize to new data. Validation ensures that the ML model picks up the right (relevant) patterns from the dataset while successfully canceling out the noise in the dataset. Essentially, the goal of validation techniques is to make sure ML models have a low bias-variance factor. 

Today we’re going to discuss at length on one such model validation technique – Cross-Validation.

What is Cross-Validation?

Cross-Validation is a validation technique designed to evaluate and assess how the results of statistical analysis (model) will generalize to an independent dataset. Cross-Validation is primarily used in scenarios where prediction is the main aim, and the user wants to estimate how well and accurately a predictive model will perform in real-world situations.

Cross-Validation seeks to define a dataset by testing the model in the training phase to help minimize problems like overfitting and underfitting. However, you must remember that both the validation and the training set must be extracted from the same distribution, or else it would lead to problems in the validation phase.

Learn data science certification course from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Benefits of Cross-Validation

  • It helps evaluate the quality of your model.
  • It helps to reduce/avoid problems of overfitting and underfitting.
  • It lets you select the model that will deliver the best performance on unseen data.

 Read: Python Projects for Beginners

What are Overfitting and Underfitting?

Overfitting refers to the condition when a model becomes too data-sensitive and ends up capturing a lot of noise and random patterns that do not generalize well to unseen data. While such a model usually performs well on the training set, its performance suffers on the test set.

Underfitting refers to the problem when the model fails to capture enough patterns in the dataset, thereby delivering a poor performance for both the training as well as the test set.

Going by these two extremities, the perfect model is one that performs equally well for both training and test sets.

Source 

Cross-Validation: Different Validation Strategies

Validation strategies are categorized based on the number of splits done in a dataset. Now, let’s look at the different Cross-Validation strategies in Python.

1. Validation set 

 This validation approach divides the dataset into two equal parts – while 50% of the dataset is reserved for validation, the remaining 50% is reserved for model training. Since this approach trains the model based on only 50% of a given dataset, there always remains a possibility of missing out on relevant and meaningful information hidden in the other 50% of the data. As a result, this approach generally creates a higher bias in the model.

Source 

Python code:

train, validation = train_test_split(data, test_size=0.50, random_state = 5)

2. Train/Test split 

In this validation approach, the dataset is split into two parts – training set and test set. This is done to avoid any overlapping between the training set and the test set (if the training and test sets overlap, the model will be faulty). Thus, it is crucial to ensure that the dataset used for the model must not contain any duplicated samples in our dataset. The train/test split strategy lets you retrain your model based on the whole dataset without altering any hyperparameters of the model.

Source 

However, this approach has one significant limitation – the model’s performance and accuracy largely depend on how it is split. For instance, if the split isn’t random, or one subset of the dataset has only a part of the complete information, it will lead to overfitting. With this approach, you cannot be sure which data points will be in which validation set, thereby creating different results for different sets. Hence, the train/test split strategy should only be used when you have enough data at hand.

Python code:

 >>> from sklearn.model_selection import train_test_split

>>> X, y = np.arange(10).reshape((5, 2)), range(5)

>>> X

array([[0, 1],

       [2, 3],

       [4, 5],

       [6, 7],

       [8, 9]])

>>> list(y)

[0, 1, 2, 3, 4]

3. K-fold 

As seen in the previous two strategies, there is the possibility of missing out on important information in the dataset, which increases the probability of bias-induced error or overfitting. This calls for a method that reserves abundant data for model training while also leaving sufficient data for validation.

Enter the K-fold validation technique. In this strategy, the dataset is split into ‘k’ number of subsets or folds, wherein k-1 subsets are reserved for model training, and the last subset is used for validation (test set). The model is averaged against the individual folds and then finalized. Once the model is finalized, you can test it using the test set. 

Source 

Here, each data point appears in the validation set exactly once while remaining in the training set k-1 number of times. Since most of the data is used for fitting, the problem of underfitting significantly reduces. Similarly, the issue of overfitting is eliminated since a majority of data is also used in the validation set.

 Read: Python vs Ruby: Complete Side-by-side comparison

The K-fold strategy is best for instances where you have a limited amount of data, and there’s a substantial difference in the quality of folds or different optimal parameters between them. 

Python code:

 from sklearn.model_selection import KFold # import KFold

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4]]) # create an array

y = np.array([1, 2, 3, 4]) # Create another array

kf = KFold(n_splits=2) # Define the split – into 2 folds

kf.get_n_splits(X) # returns the number of splitting iterations in the cross-validator

print(kf)

KFold(n_splits=2, random_state=None, shuffle=False)

 

Check out all trending Python tutorial concepts in 2024.

4. Leave one out

 The leave one out cross-validation (LOOCV) is a special case of K-fold when k equals the number of samples in a particular dataset. Here, only one data point is reserved for the test set, and the rest of the dataset is the training set. So, if you use the “k-1” object as training samples and “1” object as the test set, they will continue to iterate through every sample in the dataset. It is the most useful method when there’s too little data available.

Source 

 Since this approach uses all data points, the bias is typically low. However, as the validation process is repeated ‘n’ number of times (n=number of data points), it leads to greater execution time. Another notable constraint of the methods is that it may lead to a higher variation in testing model effectiveness as you test the model against one data point. So, if that data point is an outlier, it will create a higher variation quotient.

 Python code:  

>>> import numpy as np

>>> from sklearn.model_selection import LeaveOneOut

>>> X = np.array([[1, 2], [3, 4]])

>>> y = np.array([1, 2])

>>> loo = LeaveOneOut()

>>> loo.get_n_splits(X)

2

>>> print(loo)

LeaveOneOut()

>>> for train_index, test_index in loo.split(X):

…    print(“TRAIN:”, train_index, “TEST:”, test_index)

…    X_train, X_test = X[train_index], X[test_index]

…    y_train, y_test = y[train_index], y[test_index]

…    print(X_train, X_test, y_train, y_test)

TRAIN: [1] TEST: [0]

[[3 4]] [[1 2]] [2] [1]

TRAIN: [0] TEST: [1]

[[1 2]] [[3 4]] [1] [2]

5. Stratification

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on How to Build Digital & Data Mindset?

Typically, for the train/test split and the K-fold, the data is shuffled to create a random training and validation split. Thus, it allows for different target distribution in different folds. Similarly, stratification also facilitates target distribution over different folds while splitting the data.

In this process, data is rearranged in different folds in a way that ensures each fold to become a representative of the whole. So, if you are dealing with a binary classification problem where each class consists of 50% of the data, you can use stratification to arrange the data in a way that each class includes half of the instances.

The stratification process is best suited for small and unbalanced datasets with multiclass classification.

Python code: 

from sklearn.model_selection import StratifiedKFold

skf = StratifiedKFold(n_splits=5, random_state=None)

# X is the feature set and y is the target

for train_index, test_index in skf.split(X,y): 

    print(“Train:”, train_index, “Validation:”, val_index) 

    X_train, X_test = X[train_index], X[val_index] 

    y_train, y_test = y[train_index], y[val_index]

Read: Data Frames in Python – Tutorial

When to Use each of these five Cross-Validation strategies?

As we mentioned before, each Cross-Validation technique has unique use cases, and hence, they perform best when applied correctly to the right scenarios. For instance, if you have enough data, and the scores and optimal parameters (of the model) for different splits are likely to be similar, the train/test split approach will work excellently.

However, if the scores and optimal parameters vary for different splits, the K-fold technique will be best. For instances where you have too little data, the LOOCV approach works best, whereas, for small and unbalanced datasets, stratification is the way to go. 

We hope this detailed article helped you gain an in-depth idea of Cross-Validation in Python.

If you are curious to learn about data science, check out IIIT-B & upGrad’s Executive PG Program in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.

Frequently Asked Questions (FAQs)

1. What is the ‘permutation test’ in ML?

By generating a test statistic on the dataset and then for many random permutations of that data, a permutation test is used to assess the statistical significance of a model. The initial test statistic value should fall into one of the null hypothesis distribution's tails if the model is significant. To find the p-value, you need to just count the number of test-statistics that are as severe as or more extreme than the initial test statistics and then divide that number by the total number of test-statistics we computed. Given that the null hypothesis is true, the P-value is the chance of getting a result at least as severe as the test statistic.

2. What are the disadvantages of cross validation in machine learning?

1. Cross Validation significantly lengthens the training period. Previously, you could only train your model on one training set; now, you can train it on several training sets using Cross Validation.
2.In most cases, the structure you're studying develops over time in predictive modelling. As a result, you may notice variations in the training and validation sets.
3. Cross Validation requires a lot of computing power.

3. How can I detect overfitting in ML models?

Before you evaluate the data, detecting overfitting is nearly impossible. It can help with the difficulty of generalising data sets, which is an intrinsic feature of overfitting. As a result, the data may be divided into distinct subsets to make training and testing easier. The proportion of accuracy seen in both data sets can be used to determine whether or not overfitting is present. If the model performs better on the training set than on the test set, there are chances that it's overfitting.
Another suggestion is to begin with a very basic ML model to act as a baseline. Later, when you test complex algorithms, you'll have a benchmark against which to judge if the added complexity is worthwhile.
Validation measures such as accuracy and loss can also be used to detect overfitting. When the model is impacted by overfitting, the validation measures generally grow until they plateau or begin to decline.