Data Preprocessing in R: Ultimate Tutorial [2024]
Updated on Nov 30, 2022 | 6 min read | 8.3k views
Share:
For working professionals
For fresh graduates
More
Updated on Nov 30, 2022 | 6 min read | 8.3k views
Share:
Table of Contents
In our following data preprocessing in R tutorial, you’ll learn the fundamentals of how to perform data preprocessing. This tutorial requires you to be familiar with the basics of R and programming:
We’ll start our data preprocessing in R tutorial by importing the data set first. After all, you can’t preprocess the data if you don’t have the data in the first place.
In our case, the data is stored in the data.csv file in the working directory. You can use the command setwd(“desired location”) and set the working directory.
Here’s how you’ll start the process:
dataset <- read.csv(“Data.csv”)
Here’s our dataset:
## | Country | Age | Salary | Purchased | |
## | 1 | France | 44 | 72000 | No |
## | 2 | Spain | 27 | 48000 | Yes |
## | 3 | Germany | 30 | 54000 | No |
## | 4 | Spain | 38 | 61000 | No |
## | 5 | Germany | 40 | NA | Yes |
## | 6 | France | 35 | 58000 | Yes |
## | 7 | Spain | NA | 52000 | No |
## | 8 | France | 48 | 79000 | Yes |
## | 9 | Germany | 50 | 83000 | No |
## | 10 | France | 37 | 67000 | Yes |
As you can see, there are missing values in the Salary and Age columns of our dataset. We have identified the issue present in our dataset so we can now start fixing the same
No other issues seem to be present in our dataset so we only have to handle the missing values. We can fix this problem by replacing the NA values with the average values of the respective columns. Here’s how:
dataset$Age <- ifelse(is.na(dataset$Age),
ave(dataset$Age, FUN = function(x)
mean(x, na.rm = TRUE)),
dataset$Age)
dataset$Salary <- ifelse(is.na(dataset$Salary),
ave(dataset$Salary, FUN = function(x)
mean(x, na.rm = TRUE)),
dataset$Salary)
Notice how we used the ave() function here. It takes the average of the specific column you have entered where FUN is a function of x that calculates the mean excluding NA values (na.rm=TRUE).
else,
take whatever present in dataset$Age
We’ll use the mean() function now:
#defining x = 1 2 3
x <- 1:3
#introducing missing value
x[1] <- NA
# mean = NA
mean(x)
## [1] NA
# mean = mean excluding the NA value
mean(x, na.rm = T)
## [1] 2.5
After identifying and fixing the problem, our dataset looks like this:
## | Country | Age | Salary | Purchased | |
## | 1 | France | 44 | 72000.00 | No |
## | 2 | Spain | 27 | 48000.00 | Yes |
## | 3 | Germany | 30 | 54000.00 | No |
## | 4 | Spain | 38 | 61000.00 | No |
## | 5 | Germany | 40 | 63777.78 | Yes |
## | 6 | France | 35 | 58000.00 | Yes |
## | 7 | Spain | 38 | 52000.00 | No |
## | 8 | France | 48 | 79000.00 | Yes |
## | 9 | Germany | 50 | 83000.00 | No |
## | 10 | France | 37 | 67000.00 | Yes |
Also Read: Career Opportunities in R Programming Language
Categorical data is non-numeric data that belongs to particular categories. The Country column in our dataset is categorical data. The read.csv() function in R would make all the string variables as categorical variables. However, we can’t use it in every case.
Here’s how you can create specific variables as factor variables:
dataset$Country = factor(dataset$Country,
levels = c(‘France’, ‘Spain’, ‘Germany’),
labels = c(1, 2, 3))
dataset$Purchased = factor(dataset$Purchased,
levels = c(‘No’, ‘Yes’),
labels = c(0, 1))
Now, we have to split our dataset into two separate datasets. One for training our machine learning model while the other one for testing the same.
To do so, we’ll first install the caTools package (if not available) and add it to our library. Afterwards, we’ll use the set.seed() function to ensure that the split is done randomly. Use the following code:
library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased,SplitRatio = 0.8)
training_set = subset(dataset,split == TRUE)
test_set = subset(dataset, split == FALSE)
You must have noticed that we have kept the split ratio as 80:20. This is because it is the most conventional split ratio for training sets and test sets. Our sample.split() method has taken the column and created a numeric array with randomized true and false values according to the split ratio.
Our learners also read: Top Python Free Courses
upGrad’s Exclusive Data Science Webinar for you –
Watch our Webinar on The Future of Consumer Data in an Open Data Economy
Feature scaling is required when different features in your dataset have different ranges. In our case, the Age and Salary columns have different ranges, which can cause problems in training our ML model.
When you have a feature with a significantly higher range than the other feature, the euclidean distance increases considerably, causing the model to give wrong results.
Note that most libraries in R fix this issue automatically but it’s important to know how to fix this. Do the following:
training_set[,2:3] = scale(training_set[,2:3])
test_set[,2:3] = scale(test_set[,2:3])
It would fix the issue and your training set’s features would have the same ranges, minimizing the chances of any errors during machine learning implementations.
Get data science certification from the World’s top Universities. Learn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
We hope that you found our data preprocessing in R tutorial helpful. It would be best to understand the tutorial before you try testing it out yourself. Understanding the concepts is much more important than using them.
What are your thoughts on our data preprocessing in R tutorial? Share them in the comments below.
If you are curious to learn about R, data science, check out our Executive PG in Data Science which is created for working professionals and offers 10+ case studies & projects, practical hands-on workshops, mentorship with industry experts, 1-on-1 with industry mentors, 400+ hours of learning and job assistance with top firms.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources