The Ultimate R Cheat Sheet for Data Science Enthusiasts
Updated on Feb 11, 2025 | 18 min read | 6.7k views
Share:
For working professionals
For fresh graduates
More
Updated on Feb 11, 2025 | 18 min read | 6.7k views
Share:
R is powering analysis across industries like healthcare, finance, and marketing such as predictive modeling, risk analysis, and customer segmentation. It offers quick access to essential functions like vector operations, string handling, statistical modeling, and machine learning techniques. Mastering these functions like regression analysis helps you transform raw data into actionable insights.
In this blog, we will cover the basics of vectors, strings, and data transformation, providing hands-on examples to help you get started.
Data transformation is a key component of any data analysis process. Without it, raw data can’t be effectively analyzed or used for decision-making. R provides powerful functions to handle large datasets, clean data, and prepare them for analysis.
For example, when dealing with inconsistent customer data, R’s dplyr and tidyr packages can clean, reshape, and organize the data into an analysis-ready format. These tools streamline the data wrangling process, minimizing human error and enhancing workflow efficiency.
dplyr helps clean large datasets by providing intuitive functions like mutate() for adding or modifying columns, filter() for subsetting data, and arrange() for sorting data.
On the other hand, tidyr prevents missing data errors by providing functions like spread() and gather(), which help reshape data in a tidy format, ensuring each variable forms its own column and reducing the risk of misaligned or missing data.
R offers a variety of functions to streamline data manipulation:
Also Read: Data Frames in Python: Python In-depth Tutorial
Understanding the basics of R will lay the foundation for mastering data transformation techniques.
Before diving into specific functions, it's important to understand some core concepts. A R programming cheat sheet can be a helpful reference as you familiarize yourself with these foundational ideas. These concepts set the stage for efficient R programming and help streamline your work with data.
You can access documentation for any function using the help() function or the ? operator. To get details on packages, use library(help = package_name). For quick references, explore R's official online documentation.
Additionally, R users often rely on external resources for troubleshooting and learning, such as RDocumentation.org for package-specific information, or Stack Overflow for community-driven support and practical coding solutions. These platforms provide valuable insights and answers to common R-related questions.
Example:
# Accessing help for the mean function using help()
help(mean)
# Or using the ? operator
?mean
Output:
This will display the documentation for the mean function in R.
Also Read: 10 Interesting R Project Ideas For Beginners [2025]
R packages enhance R’s functionality, offering more efficient solutions than base R for tasks like data wrangling or visualization. For example, dplyr simplifies data manipulation with concise, readable code.
To install a package, use install.packages("packageName"), and to load it, use library(packageName). Popular repositories include CRAN, Bioconductor (for bioinformatics), and GitHub, offering a vast selection of packages to streamline your analysis.
Example:
# Step 1: Install the dplyr package (this step is only needed once)
install.packages("dplyr")
# Step 2: Load the dplyr package into the R session
library(dplyr)
# Step 3: Example usage of a function from the dplyr package
# Creating a sample data frame
data <- data.frame(
Name = c("John", "Jane", "Sam", "Sue", "Alex"),
Age = c(25, 30, 22, 28, 35),
Score = c(85, 92, 78, 88, 91)
)
# Step 4: Use the filter function from dplyr to filter data
# Example: Filter individuals with Age greater than 25
filtered_data <- filter(data, Age > 25)
# Step 5: Display the filtered data
print(filtered_data)
Output:
Name Age Score
1 Jane 30 92
2 Sue 28 88
3 Alex 35 91
Also Read: Top 15 R Libraries for Data Science in 2024
The working directory is where R searches for files and saves results. Use getwd() to check it and setwd() to change it. Proper directory management keeps your project files organized.
After setting the working directory, use functions like read.csv() and write.csv() to read and write files. This ensures efficient file handling in your R projects.
Example:
# Print the current working directory
current_dir <- getwd()
cat("Current Working Directory:", current_dir, "\n")
# Set a new working directory
# Replace this path with the path of the folder you want to set as the working directory
new_dir <- "C:/Users/YourName/Documents"
setwd(new_dir)
# Verify the working directory has been changed
cat("New Working Directory:", getwd(), "\n")
# Create a new text file in the new working directory
file_name <- "example_file.txt"
file_path <- file.path(new_dir, file_name)
# Write a message to the file
writeLines("Hello, this is a test file.", file_path)
cat("File has been created at:", file_path, "\n")
# Read the contents of the file to verify it's been written
file_contents <- readLines(file_path)
cat("Contents of the file:", file_contents, "\n")
Output:
Current Working Directory: C:/Users/YourName/CurrentDirectory
New Working Directory: C:/Users/YourName/Documents
File has been created at: C:/Users/YourName/Documents/example_file.txt
Contents of the file: [1] "Hello, this is a test file."
Also Read: Why Should You Choose R for Data Science?
R offers various operators for different tasks: assignment operators (<-), arithmetic operators (+, -, *, /, etc.), logical operators (&, |, !), and comparison operators (==, !=, >, <, >=, <=).
These operators are essential for tasks like performing calculations, filtering data frames with logical conditions, and comparing values for decision-making in your analysis.
Example:
# Assignment Operator
x <- 5 # Assigning 5 to x
y <- 10 # Assigning 10 to y
# Arithmetic Operators
z <- x + y # Addition
w <- x - y # Subtraction
v <- x * y # Multiplication
u <- y / x # Division
t <- x %% y # Modulus (remainder)
s <- x^2 # Exponentiation (x squared)
# Comparison Operators
is_equal <- x == y # Check if x is equal to y
is_greater <- x > y # Check if x is greater than y
is_less <- x < y # Check if x is less than y
# Logical Operators
and_condition <- (x > 0 & y > 0) # Logical AND
or_condition <- (x > 0 | y < 0) # Logical OR
not_condition <- !(x == y) # Logical NOT
# Print the results
cat("Arithmetic results:\n")
cat("x + y =", z, "\n")
cat("x - y =", w, "\n")
cat("x * y =", v, "\n")
cat("y / x =", u, "\n")
cat("x %% y =", t, "\n")
cat("x^2 =", s, "\n\n")
cat("Comparison results:\n")
cat("Is x equal to y? ", is_equal, "\n")
cat("Is x greater than y? ", is_greater, "\n")
cat("Is x less than y? ", is_less, "\n\n")
cat("Logical results:\n")
cat("x > 0 AND y > 0? ", and_condition, "\n")
cat("x > 0 OR y < 0? ", or_condition, "\n")
cat("NOT (x == y)? ", not_condition, "\n")
Output:
Arithmetic results:
x + y = 15
x - y = -5
x * y = 50
y / x = 2
x %% y = 5
x^2 = 25
Comparison results:
Is x equal to y? FALSE
Is x greater than y? FALSE
Is x less than y? TRUE
Logical results:
x > 0 AND y > 0? TRUE
x > 0 OR y < 0? TRUE
NOT (x == y)? TRUE
Understanding these basics will help you feel comfortable navigating the R environment. Now that you’ve got the essentials, let’s move on to working with vectors.
Vectors are the foundation of R's data structure system, providing a simple and efficient way to store multiple elements of the same type. They are important for a wide range of operations and are the building blocks for more complex data structures. Below are some common operations and functions for working with vectors.
You can create vectors in R using the c() function, which stands for "combine." This function allows you to combine individual elements into a vector, such as numbers, characters, or logical values, forming a one-dimensional array.
Example:
# Creating a vector with numbers from 1 to 5
numbers <- c(1, 2, 3, 4, 5)
# Print the created vector
print("The vector 'numbers' is:")
print(numbers)
# Adding 10 to each element of the vector
numbers_plus_ten <- numbers + 10
print("The vector 'numbers' after adding 10 to each element is:")
print(numbers_plus_ten)
# Calculating the sum of all elements in the vector
sum_numbers <- sum(numbers)
print("The sum of elements in the vector 'numbers' is:")
print(sum_numbers)
# Finding the length of the vector
length_numbers <- length(numbers)
print("The length of the vector 'numbers' is:")
print(length_numbers)
# Accessing specific elements of the vector
third_element <- numbers[3]
print("The third element in the vector 'numbers' is:")
print(third_element)
Output:
The vector 'numbers' is:
[1] 1 2 3 4 5
The vector 'numbers' after adding 10 to each element is:
[1] 11 12 13 14 15
The sum of elements in the vector 'numbers' is:
[1] 15
The length of the vector 'numbers' is:
[1] 5
The third element in the vector 'numbers' is:
[1] 3
Vector functions perform various operations on vectors. Common functions include length() for calculating magnitude, sum() for adding elements, and mean() for computing the average. These operations are essential for manipulating and analyzing data in vectorized formats.
Example:
import numpy as np
# Define the vector
numbers = np.array([1, 2, 3, 4, 5])
# Define functions for length, sum, and mean
def length(vec):
return len(vec)
def sum_vector(vec):
return np.sum(vec)
def mean_vector(vec):
return np.mean(vec)
# Call the functions
vector_length = length(numbers)
vector_sum = sum_vector(numbers)
vector_mean = mean_vector(numbers)
print("Length:", vector_length)
print("Sum:", vector_sum)
print("Mean:", vector_mean)
Output:
Length: 5
Sum: 15
Mean: 3.0
To select specific elements from a vector, use indexing with square brackets. Indexing starts at 1 in most programming languages, allowing you to retrieve or modify individual values. Negative indices can be used to access elements from the end.
Example:
# Define the list (vector) of numbers
numbers = [0, 1, 2, 3, 4, 5, 6, 7, 8, 9]
# Select the 4th element using single indexing (index starts from 0)
print("The 4th element (index 3):", numbers[3])
# Select a range of elements using slicing (index 1 to 3, inclusive of 1 but exclusive of 4)
print("Elements from index 1 to 3:", numbers[1:4])
Output:
The 4th element (index 3): 3
Elements from index 1 to 3: [1, 2, 3]
R allows vectorized operations, enabling efficient calculations across entire vectors. For instance, adding 2 to every element in a vector is straightforward. Vectorized operations eliminate the need for explicit loops, making code faster and more concise.
Example:
# Create a vector of numbers
numbers <- c(1, 2, 3, 4, 5)
# Add 2 to every element in the vector using vectorized operation
result <- numbers + 2
# Print the result
print(result)
Output:
[1] 3 4 5 6 7
Now that you know how to work with vectors, let’s move on to handling strings, which are another common data type in R.
String manipulation is a frequent task in data processing, and R offers several functions for finding, subsetting, and modifying strings. Base R provides functions like grep(), sub(), and gsub(), but for more efficient and user-friendly string operations, the stringr package is highly recommended.
Functions like str_detect(), str_replace(), and str_sub() from stringr are faster and offer a more consistent syntax, making them useful for complex string manipulations.
The grep() function is used to search for elements in a dataset that match a specified pattern. It returns a subset of elements that fit the pattern, allowing for efficient filtering or extraction of relevant data from larger datasets.
Example:
# Create a vector of strings
text <- c("apple", "banana", "cherry")
# Use grep() to find elements that match the pattern "an"
matches <- grep("an", text)
# Print the results
print(matches)
Output:
[1] 2 3
You can extract parts of strings using the substr() function, which allows you to specify the starting position and length of the substring you want to extract, providing a flexible way to manipulate string data efficiently.
Example:
# Define the string
string <- "banana"
# Extract the substring from position 1 to position 3
substring_result <- substr(string, 1, 3)
# Print the result
print(substring_result)
Output:
[1] "ban"
The gsub() function in R is used to replace all instances of a specified pattern in a string with a new value. It allows for powerful string manipulation by applying regular expressions to search and modify text.
Example:
# Example program for mutating strings with gsub()
# Define the string
original_string <- "I love banana"
# Use gsub() to replace "banana" with "orange"
mutated_string <- gsub("banana", "orange", original_string)
# Print the original and mutated strings
cat("Original String: ", original_string, "\n")
cat("Mutated String: ", mutated_string, "\n")
Output:
Original String: I love banana
Mutated String: I love orange
The paste() function combines multiple strings into one by inserting a separator, if specified. strsplit() does the opposite, breaking a string into a list of substrings based on a delimiter, useful for data parsing and manipulation.
Example:
# Joining strings using paste()
joined_string <- paste("Hello", "World", sep = " ")
print(joined_string)
# Splitting strings using strsplit()
splitted_strings <- strsplit("apple,orange,banana", ",")
print(splitted_strings)
Output:
[1] "Hello World"
[[1]]
[1] "apple" "orange" "banana"
These string-handling functions are critical for text data processing in R. Let's take a closer look at working with data frames, which are central to R data manipulation.
Data frames are the go-to structure for handling tabular data. R provides a rich set of functions to manipulate and transform data within data frames. For a quick reference, you can consult an R programming cheat sheet to streamline your work with data frames.
# Create a data frame using the data.frame() function
df <- data.frame(Name = c("John", "Anna", "Peter"),
Age = c(23, 25, 30))
# Print the data frame
print(df)
Output:
Name Age
1 John 23
2 Anna 25
3 Peter 30
To access columns in a data frame, use the $ operator followed by the column name. For example, df$column_name will retrieve the data in that specific column, making it easy to reference and manipulate data directly.
Example:
# Creating a data frame
data <- data.frame(
Name = c('Alice', 'Bob', 'Charlie', 'David'),
Age = c(25, 30, 35, 40),
City = c('New York', 'Los Angeles', 'Chicago', 'Houston')
)
# Accessing columns using $ operator
name_column <- data$Name # Accessing the Name column
age_column <- data$Age # Accessing the Age column
city_column <- data$City # Accessing the City column
# Display the results
cat("Name Column:\n")
print(name_column)
cat("\nAge Column:\n")
print(age_column)
cat("\nCity Column:\n")
print(city_column)
Output:
Name Column:
[1] "Alice" "Bob" "Charlie" "David"
Age Column:
[1] 25 30 35 40
City Column:
[1] "New York" "Los Angeles" "Chicago" "Houston"
Subsetting data frames allows you to extract specific rows or columns using indexing techniques. You can use single or double square brackets to access parts of the dataframe, filtering data based on conditions or selecting desired columns efficiently.
Example:
# Creating a sample data frame
data <- data.frame(
Name = c('Alice', 'Bob', 'Charlie', 'David', 'Eve'),
Age = c(25, 30, 35, 40, 45),
City = c('New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix')
)
# Subsetting examples
first_row <- data[1, ]
second_column <- data[, 2]
# Print results
cat("First Row:\n")
print(first_row)
cat("\nSecond Column:\n")
print(second_column)
Output:
First Row:
Name Age City
1 Alice 25 New York
Second Column:
[1] 25 30 35 40 45
Mutating data frames involves adding new columns, removing existing ones, or modifying current data. Common operations include applying functions, creating new variables based on conditions, or transforming existing values to meet specific requirements in data analysis.
Example:
# Creating a sample DataFrame
data <- data.frame(
Name = c('John', 'Alice', 'Bob'),
Age = c(23, 25, 22)
)
# Displaying the original DataFrame
cat("Original DataFrame:\n")
print(data)
# Adding a new column 'Gender'
data$Gender <- c('M', 'F', 'M')
# Modifying the 'Age' column (e.g., adding 1 year to each person's age)
data$Age <- data$Age + 1
# Displaying the mutated DataFrame
cat("\nMutated DataFrame:\n")
print(data)
Output:
Original DataFrame:
Name Age
1 John 23
2 Alice 25
3 Bob 22
Mutated DataFrame:
Name Age Gender
1 John 24 M
2 Alice 26 F
3 Bob 23 M
With a solid understanding of data frames, let's explore how to load and import data into R to make the most of your R data manipulation skills.
Working with external data is crucial for analysis. R provides various functions to load data from different sources, such as CSV files, Excel, and databases.
The readRDS() function is specifically used to load R-specific objects, including those with metadata, which are saved in .rds format. Unlike read.csv(), which is used for tabular data, readRDS() preserves R objects' structure and attributes.
The following table summarizes key functions used to import data into R.
Function |
What It Does |
Example Code |
read.csv() | Loads data from a CSV file. | data <- read.csv("file.csv") |
read.table() | Loads data from a general text file. | data <- read.table("file.txt", header=TRUE) |
readRDS() | Reads an R object saved as an RDS file. | data <- readRDS("data.rds") |
library(readxl) | Reads Excel files after loading the readxl package. | data <- read_excel("file.xlsx") |
Also Read: Data Preprocessing in Machine Learning: 7 Key Steps to Follow, Strategies and Applications
Now that we've covered key transformation functions, let’s explore techniques for generating and manipulating random data in data tables.
Generating random data is a common task in R, useful for testing algorithms or simulating datasets. R provides several functions to create random values from different distributions. Once the data is generated, transforming it is equally important for analysis.
Here, you will learn various ways to generate random data and efficiently convert it within data tables.
R offers powerful functions for generating random data from various statistical distributions. These functions include sample(), rnorm(), and runif(). Below, we will explain these functions and provide examples for generating random numbers, normal distributions, and uniform distributions.
sample(1:10, 5) # Randomly selects 5 numbers from 1 to 10
Example:
rnorm(5, mean = 0, sd = 1) # Generates 5 random numbers from a standard normal distribution
Example:
runif(5, min = 0, max = 1) # Generates 5 random numbers between 0 and 1
Also Read: 20 Common R Interview Questions & Answers
Generating random data can be especially useful for simulations or when creating synthetic datasets. Next, we will look at how to ensure reproducibility in random sampling.
upGrad’s Exclusive Data Science Webinar for you –
Transformation & Opportunities in Analytics & Insights
Random sampling is often used in data analysis for selecting subsets of data. To ensure reproducibility of your results, it is essential to control the random number generation. This is where the set.seed() function comes in.
set.seed(42)
sample(1:10, 5) # Always returns the same set of numbers with seed 42
You can use sample() for both random sampling with or without replacement.
Example:
sample(1:10, 5, replace = TRUE) # Randomly selects 5 numbers with replacement
Also Read: Data Cleaning Techniques: Learn Simple & Effective Ways to Clean Data
After learning random sampling, it's important to learn how to transform data efficiently. Let's now explore how to transform data using data.table and dplyr packages.
Once data is generated or imported into R, transforming it is essential for analysis. The data.table and dplyr packages provide powerful tools to manipulate data tables.
Example:
library(data.table)
dt <- data.table(A = 1:5, B = letters[1:5])
dt[, C := A * 2] # Adds a new column 'C' which is twice the value of 'A'
You can chain operations in data.table using the . operator to make transformations more efficient.
Example:
dt[, .(Sum = sum(A), Mean = mean(A)), by = B] # Groups by 'B' and calculates sum and mean for 'A'
Example:
library(dplyr)
df <- data.frame(A = 1:5, B = letters[1:5])
df %>% mutate(C = A * 2) # Adds a new column 'C' to the data frame
The pipe operator %>% is widely used to chain multiple operations together in dplyr. This enhances readability and efficiency when applying transformations.
By chaining multiple transformations, you can create more complex data manipulation pipelines in a single, readable line of code. This method greatly enhances the clarity and efficiency of your data processing.
Example:
df %>% filter(A > 2) %>% select(A) # Filters rows where A > 2 and selects column A
Also Read: 11 Essential Data Transformation Methods in Data Mining
With your data skills on track, let’s dive into how upGrad can fast-track your journey to becoming a data science pro!
upGrad offers a comprehensive suite of Data Science courses tailored to meet the needs of both beginners and advanced learners. It helps you bridge the gap between learning and applying data science techniques in real-world scenarios. The courses are designed by industry experts and supported by hands-on projects to sharpen your skills.
Here are some recommended courses:
Do you need help deciding which courses can help you excel in R programming? Contact upGrad for personalized counselling and valuable insights. For more details, you can also visit your nearest upGrad offline center.
Unlock the power of data with our popular Data Science courses, designed to make you proficient in analytics, machine learning, and big data!
Elevate your career by learning essential Data Science skills such as statistical modeling, big data processing, predictive analytics, and SQL!
Stay informed and inspired with our popular Data Science articles, offering expert insights, trends, and practical tips for aspiring data professionals!
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources