Cluster Analysis in R: A Complete Guide You Will Ever Need
By Rohit Sharma
Updated on Mar 28, 2025 | 9 min read | 6.5k views
Share:
For working professionals
For fresh graduates
More
By Rohit Sharma
Updated on Mar 28, 2025 | 9 min read | 6.5k views
Share:
Table of Contents
If you’ve ever stepped even a toe in the world of data science or Python, you would have heard of R. Cluster analysis in R is a powerful data segmentation and pattern recognition technique. However, assessing the quality and validity of the obtained clusters is essential to ensure meaningful insights.
Developed as a GNU project, R is both a language and an environment designed for graphics and statistical computing. It is similar to the S language, and can thus, be considered as its implementation.
As a language, R is highly extensible. It provides a variety of statistical and graphical techniques like time-series analysis, linear modeling, non-linear modeling, clustering, classification, classical statistical tests.
It is one of these techniques that we will be exploring more deeply and that is clustering or cluster analysis!
In the simplest of terms, clustering is a data segmentation method whereby data is partitioned into several groups on the basis of similarity.
How is the similarity assessed? On the basis of inter-observation distance measures. These can be either Euclidean or correlation-based distance measures.
Cluster analysis is one of the most popular and in a way, intuitive, methods of data analysis and data mining. It is ideal for cases where there is voluminous data and we have to extract insights from it. In this case, the bulk data can be broken down into smaller subsets or groups.
The little groups that are formed and derived from the whole dataset are known as clusters. These are obtained by performing one or more statistical operations. Each cluster, though containing different elements, share the following properties:
Even without the ‘fancy’ name of cluster analysis, the same is used a lot in day-to-day life.
At the individual level, we make clusters of the things we need to pack when we set out on a vacation. First clothes, then toiletries, then books, and so on. We make categories and then tackle them individually.
Companies use cluster analysis, too, when they carry out segmentation on their email lists and categorize customers on the basis of age, economic background, previous buying behaviour, etc.
Cluster analysis is also referred to as ‘unsupervised machine learning’ or pattern recognition. Unsupervised because we aren’t looking to categorize particular samples in particular samples only. Learning because the algorithm also learns how to cluster.
We have three methods that are most often used for clustering. These are:
This is the most common type of hierarchical clustering. The algorithm for AHC works in a bottom-up manner. It begins by regarding each data point as a cluster in itself (called a leaf).
It then combines together the two clusters that are the most similar. These new and bigger clusters are called nodes. The grouping is repeated until the entire dataset comes together as a single, big cluster called the root.
Visualizing and drawing each step of the AHC process leads to the generation of a tree called a dendrogram.
Reversing the AHC process leads to divisive clustering and the generation of clusters.
The dendrogram can also be visualized as:
In conclusion, if you want an algorithm that is good at identifying small clusters, go for AHC. If you want one that is good at identifying large clusters, then the divisive clustering method should be your choice.
‘Clustering by Similarity Aggregation’ is another name for this method. It works as follows:
The individual objects in pairs that build up the global clustering are compared. To vectors m(A, B) and d(A, B), a pair of individual values (A, B) is assigned. In the vector b(A, B), both A and B have the same values, whereas, in the vector d(A, B), both of them have different values).
The two individual values of A and B are said to follow the Condorcet criterion as follows:
c(A, B) = m(A, B)- d(A, B)
For an individual value like A and a cluster called S, the Condorcet criterion stands as:
c(A,S) = Σic(A,Bi)
The overall summation is Bi ∈ S.
With the above conditions having been met, clusters of the form c(A, S) are constructed. A can have the least value of 0 and is the largest of all the data points in the cluster.
Finally, the global Condorcet criterion is calculated. This is done by performing a summation of the individual data points present in A and the cluster SA which contains them.
The above steps are repeated until the global Condorcet criterion doesn’t improve or the largest number of iterations is reached.
Our learners also read: Free Online Python Course for Beginners
This is one of the most popular partitioning algorithms. All of the available data (also called data points/ observations sometimes) will be grouped into these clusters only. Here is a breakdown of how the algorithm proceeds:
The distance between a data point and a centroid is calculated using one of the following methods:
The most popular of these- the Euclidean distance- is calculated as follows:
Each time that the algorithm is run, different groups are returned as a result. The very first assignment to the variable k is completely random. This makes k-means very sensitive to the first choice. As a result, it becomes almost impossible to get the same clustering unless the number of groups and overall observations is small.
In the beginning, we’ll randomly assign a value to k which will dictate the direction that the results head in. To ensure that the best choice is made, it is helpful to keep in mind the following formula:
Here, n is the number of data points in the dataset.
Regardless of the presence of a formula, the number of clusters would be heavily dependent on the nature of the dataset, the industry and business it belongs to, etc. Hence, it is advisable to pay heed to one’s own experience and intuition as well.
With the wrong cluster size, the grouping may not be as effective and can lead to overfitting. Due to overfitting, new data points might not be able to find a place in the cluster since the algorithm has eeked out the little details and all generalization is lost.
The Silhouette Coefficient measures the compactness and separation of clusters. It quantifies how well each data point fits within its assigned cluster compared to neighboring clusters. The coefficient ranges from -1 to 1, with values closer to 1 indicating better cluster quality.
The Dunn Index evaluates cluster separation by considering the ratio between the smallest inter-cluster distance and the largest intra-cluster distance. Higher Dunn Index values indicate better-defined and well-separated clusters.
The Calinski-Harabasz Index measures the ratio of between-cluster dispersion to within-cluster dispersion. It seeks to maximize the inter-cluster distance while minimizing the intra-cluster distance. Higher index values indicate better cluster quality.
The Elbow method helps determine the optimal number of clusters by plotting the sum of squared distances (SSD) against different values of k. The point at which the SSD curve exhibits an “elbow” shape suggests the appropriate number of clusters, balancing compactness and separation.
The Gap statistic compares the observed within-cluster dispersion to an expected reference distribution. It calculates the optimal number of clusters based on the maximum gap between the observed and expected values. This technique helps avoid overfitting and provides more robust cluster validation.
Hierarchical Consensus Clustering combines multiple clustering runs to generate a consensus dendrogram. It enhances the stability and robustness of clustering results by identifying stable clusters. By assessing the consensus among different clustering outcomes, this technique improves the reliability of the clustering process.
Bootstrap Evaluation involves resampling the dataset and applying the clustering algorithm multiple times. It helps estimate the stability and uncertainty of the clustering results. By examining the consistency of cluster assignments across different bootstrap samples, one can assess the reliability and robustness of the clusters.
So, where exactly are the powerful clustering methods used? We cursorily mentioned a few examples above. Below are some more instances:
On the basis of the patients’ age and genetic makeup, doctors are able to provide a better diagnosis. This ultimately leads to treatment that is more beneficial and aligned. New medicines can also be discovered this way. Clustering in medicine is termed as nosology.
In social spheres, clustering people on the basis of demographics, age, occupation, residence location, etc. helps the government to enforce laws and shape policies that suit diverse groups.
In marketing, the term clustering is replaced by segmentation / typological analysis. It is used to explore and select potential buyers of a particular product. Companies then test the elements of each cluster to know which customers display pro-retainment behavior.
upGrad’s Exclusive Data Science Webinar for you –
Transformation & Opportunities in Analytics & Insights
As an input for the clustering algorithm that will be implemented here, past web pages accessed by a user are inputted. These web pages are then clustered. In the end, a profile of the user, based on his browsing activity, is generated. From personalization to cyber safety, this result can be leveraged anywhere.
Retail
Outlets also benefit from clustering customers on the basis of age, colour preferences, style preferences, past purchases, etc. This helps retailers to create customized experiences and also plan future offerings aligned to customer desires.
To ensure accurate cluster analysis, consider the following best practices:
As is evident, cluster analysis is a highly valuable method- no matter the language or environment it is implemented in. Whether one wants to derive insights, eke out patterns, or carve out profiles, cluster analysis is a highly useful tool with results that can be practically implemented. Proficiency in working with the various clustering algorithms can lead one to perform accurate and truly valuable data analysis.
Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Start Your Career in Data Science Today
Top Resources