Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Cross Validation in R: Usage, Models & Measurement

By Rohit Sharma

Updated on Jun 30, 2023 | 13 min read

Share:

When you embark on your journey into the world of data science and machine learning, there is always a tendency to start with model creation and algorithms. You tend to avoid learning or knowing how to test the models’ effectiveness in real-world data. 

Cross-Validation in R is a type of model validation that improves hold-out validation processes by giving preference to subsets of data and understanding the bias or variance trade-off to obtain a good understanding of model performance when applied beyond the data we trained it on. This article will be a start to end guide for data model validation and elucidating the need for model validation.

The Instability of Learning Models

To understand this, we will be using these pictures to illustrate the learning curve fit of various models:

Source

We have shown here the learned model of dependency on the article price on size.

We made a linear transformation equation fitting between these to show the plots. 

From training set points, the first plot is erroneous. Thus, on the test set, it does not perform great. So, we can say this is “Underfitting”. Here, the model is not able to understand the actual pattern in data.

The next plot shows the correct dependency on price on size. It depicts minimal training error. Thus, the relationship is generalized.

In the last plot, we establish a relationship that has almost no training error at all. We build the relationship by considering each fluctuation in the data point and the noise. The data model is very vulnerable. The fit arranges itself to minimize the error, hence generating complicated patterns in the given dataset. This is known as “Overfitting”. Here, there might be a higher difference between the training and test sets.

In the world of data science, out of various models, there is a lookout for a model that performs better. But sometimes, it is tough to understand if this improved score is because the relationship is captured better or just data over-fitting. We use these validation techniques to have the correct solutions. Herewith we also get a better-generalized pattern via these techniques.

What is Overfitting & Underfitting?

Underfitting in machine learning refers to capturing insufficient patterns. When we run the model on training and test sets, it performs very poorly.

Overfitting in machine learning means capturing noise and patterns. These do not generalize well to the data which didn’t undergo training. When we run the model on the training set, it performs extremely well, but it performs poorly when run on the test set.

What is Cross-Validation?

Cross-Validation aims to test the model’s ability to make a prediction of new data not used in estimation so that problems like overfitting or selection bias are flagged. Also, insight on the generalization of the database is given.  

Steps to organize Cross-Validation:

  1. We keep aside a data set as a sample specimen.
  2. We undergo the model training with the other part of the dataset.
  3. We use the reserved sample set for testing. This set helps in quantifying the compelling performance of the model.

Statistical model validation

In statistics, model validation confirms that a statistical model’s acceptable outputs are generated from the real data. It makes sure that the statistical model outputs are derived from the data-generating process outputs so that the program’s main aims are thoroughly processed.

Validation is generally not only evaluated on data that was used in the model construction, but it also uses data that was not used in construction. So, validation usually tests some of the predictions of the model.

Our learners also read: Free Online Python Course for Beginners

What is the use of cross-validation?

Cross-Validation is primarily used in applied machine learning for estimation of the skill of the model on future data. That is, we use a given sample to estimate how the model is generally expected to perform while making predictions on unused data during the model training.

Does Cross-Validation reduce Overfitting?

Cross-Validation is a strong protective action against overfitting. The idea is that we use our initial data used in training sets to obtain many smaller train-test splits. Then we use these splits for tuning our model. In the normal k-fold Cross-Validation, we divide the data into k subsets which are then called folds.

Read: R Developer Salary in India

Methods Used for Cross-Validation in R

There are many methods that data scientists use for Cross-Validation performance. We discuss some of them here.

1. Validation Set Approach

The Validation Set Approach is a method used to estimate the error rate in a model by creating a testing dataset. We build the model using the other set of observations, also known as the training dataset. The model result is then applied to the testing dataset. We can then calculate the testing dataset error. Thus, it allows models not to overfit. 

R code:

We have written the above code to create a training dataset and a different testing dataset. Therefore, we use the training dataset to build a predictive model. Then it will be applied to the testing dataset to check for error rates.

2. Leave-one-out cross-validation (LOOCV)

Leave-one-out Cross-Validation (LOOCV) is a certain multi-dimensional type of Cross-Validation of k folds. Here the number of folds and the instance number in the data set are the same. For every instance, the learning algorithm runs only once. In statistics, there is a similar process called jack-knife estimation.

R Code Snippet:

We can leave some training examples out, which will create a validation set of the same size for each iteration. This process is known as LPOCV (Leave P Out Cross Validation)

upGrad’s Exclusive Data Science Webinar for you –

Watch our Webinar on How to Build Digital & Data Mindset?

3. k-Fold Cross-Validation

A resampling procedure was used in a limited data sample for the evaluation of machine learning models.

The procedure begins with defining a single parameter, which refers to the number of groups that a given data sample is to be split. Thus, this procedure is named as k-fold Cross-Validation

Data scientists often use Cross-Validation in applied machine learning to estimate features of a machine learning model on unused data.

It is comparatively simple to understand. It often results in a less biased or overfitted estimate of the model skill like a simple train set or test set.

The general procedure is built-up with a few simple steps:

  1. We have to mix the dataset to randomize it.
  2. Then we split the dataset into k groups of similar size.
  3. For each unique group:

We have to take a group as a particular test data set. Then we consider all the remaining groups as a whole training data set. Then we fit a model on the training set and to confirm the outcome. We run it on the test set. We note down the evaluation score.

R code Snippet:

4. Stratified k-fold Cross-Validation

Stratification is a rearrangement of data to make sure that each fold is a wholesome representative. Consider a binary classification problem, having each class of 50% data.

When dealing with both bias and variance, stratified k-fold Cross Validation is the best method.

R Code Snippet:

5. Adversarial Validation

The basic idea is for checking the percentage of similarity in features and their distribution between training and tests. If they are not easy to differentiate, the distribution is, by all means, similar, and the general validation methods should work out. 

While dealing with actual datasets, there are cases sometimes where the test sets and train sets are very different. The internal Cross-Validation techniques generate scores, not within the arena of the test score. Here, adversarial validation comes into play.

It checks the degree of similarity within training and tests concerning feature distribution. This validation is featured by merging train and test sets, labeling zero or one (zero – train, one-test), and analyzing a classification task of binary scores.

We have to create a new target variable which is 1 for each row in the train set and 0 for each row in the test set.

Now we combine the train and test datasets.

Using the above newly created target variable, we fit a classification model and predict each row’s probabilities to be in the test set.

6. Cross-Validation for time series

A time-series dataset cannot be randomly split as the time section messes up the data. In a time series problem, we perform Cross-Validation as shown below. 

For time-series Cross-Validation, we create folds in a fashion of forwarding chains.

If, for example, for n years, we have a time series for annual consumer demand for a particular product. We make the folds like this:

fold 1: training group 1, test group 2

fold 2: training group 1,2, test group 3

fold 3: training group 1,2,3, test group 4

fold 4: training group 1,2,3,4, test group 5

fold 5: training group 1,2,3,4,5, test group 6

fold n: training group 1 to n-1, test group n

A new train and test set are progressively selected. Initially, we start with a train set with a minimum number of observations required for fitting the model. Gradually, with every fold, we change our train and test sets.

R Code Snippet:

h = 1 means that we take into consideration the error for 1 step ahead forecasts. 

Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Example of K Fold Cross Validation in R

The method of K fold cross validation in R has an impressive computation speed. It is one of the most effective methods to predict the accuracy and error of a regression model. Let us consider an example of the K fold cross validation model.

For instance, you have this dataset in R:

#create data frame

df <- data.frame(y=c(6, 8, 12, 14, 14, 15, 17, 22, 24, 23),

                 x1=c(2, 5, 4, 3, 4, 6, 7, 5, 8, 9),

                 x2=c(14, 12, 12, 13, 7, 8, 7, 4, 6, 5))

#view data frame

df

y x1 x2

6 2 14

8 5 12

12 4 12

14 3 13

14 4 7

15 6 8

17 7 7

22 5 4

24 8 6

23 9 5

You can use the following code to accommodate a multiple linear regression model to this R dataset. The code will also help you perform K fold cross validation logistic regression R with k=5 folds to assess the model performance:

library(caret)
#specify the cross-validation method
ctrl <- trainControl(method = “cv”, number = 5)
#fit a regression model and use k-fold CV to evaluate performance
model <- train(y ~ x1 + x2, data = df, method = “lm”, trControl = ctrl)
#view summary of k-fold CV               
print(model)
Linear Regression 
10 samples
 2 predictor
No pre-processing
Resampling: Cross-Validated (5 fold) 
Summary of sample sizes: 8, 8, 8, 8, 8 
Resampling results:
  RMSE Rsquared MAE     
  3.018979 1 2.882348

Tuning parameter ‘intercept’ was held constant at a value of TRUE

You will be able to interpret the output on the basis of the following:

No pre-processing took place, which means the data wasn’t scaled in any way before fitting the models. 

The 5 k fold cross validation decision tree in R resampling method was used to assess the model. 

For each training set, the sample size was 8.

RMSE: The root means squared error calculates the average difference between the actual observations and the model predictions. A lower RMSE indicates that the model can predict the actual observations more accurately.

R-squared: The mean absolute error refers to the average absolute difference between the actual observations and the model predictions. A higher R-squared indicates that the model can predict the actual observations more accurately. 

MAE: The mean absolute error gives the average absolute difference between the actual observations and the model predictions. A lower MAE indicates that the model can predict the actual observations more accurately. 

The three metrics found in the output explain how well a model performed on unseen data. We can use different models and compare the metrics to determine which one offers the lowest test error rates and determine the best model. 

The code to assess the final model fit is as follows:

#view final model
model$finalModel
Call:
lm(formula = .outcome ~ ., data = dat)
Coefficients:
(Intercept) x1 x2  
    21.2672 0.7803 -1.1253 

The final model will be:

y = 21.2672 + 0.7803*(x1) – 1.12538(x2)

How to measure the model’s bias-variance?

With k-fold Cross-Validation, we obtain various k model estimation errors. For an ideal model, the errors sum up to zero. For the model to return its bias, the average of all the errors is taken and scaled. The lower average is considered appreciable for the model. 

For model variance calculation, we take the standard deviation of all the errors. Our model is not variable with different subsets of training data if the standard deviation is minor.

The focus should be on having a balance between bias and variance. If we reduce the variance and control model bias, we can be able to reach equilibrium to some extent. It will eventually make a model for better prediction. 

Also Read: Cross-Validation in Python: Everything You Need to Know

Wrapping up

In this article, we discussed Cross-Validation and its application in R. We also learned methods to avoid overfitting. We also discussed different procedures like the validation set approach, LOOCV, k-fold Cross-Validation, and stratified k-fold, followed by each approach’s implementation in R performed on the Iris dataset.

Frequently Asked Questions (FAQs)

1. What is R programming?

2. Where is cross-validation required?

3. What are the applications of R?

Rohit Sharma

606 articles published

Get Free Consultation

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Suggested Blogs