View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All
View All

Cluster Analysis in R: A Complete Guide You Will Ever Need

By Rohit Sharma

Updated on Mar 28, 2025 | 9 min read | 6.5k views

Share:

If you’ve ever stepped even a toe in the world of data science or Python, you would have heard of R. Cluster analysis in R is a powerful data segmentation and pattern recognition technique. However, assessing the quality and validity of the obtained clusters is essential to ensure meaningful insights.

Developed as a GNU project, R is both a language and an environment designed for graphics and statistical computing. It is similar to the S language, and can thus, be considered as its implementation.

As a language, R is highly extensible. It provides a variety of statistical and graphical techniques like time-series analysis, linear modeling, non-linear modeling, clustering, classification, classical statistical tests.

It is one of these techniques that we will be exploring more deeply and that is clustering or cluster analysis! 

What is cluster analysis?

In the simplest of terms, clustering is a data segmentation method whereby data is partitioned into several groups on the basis of similarity. 

How is the similarity assessed? On the basis of inter-observation distance measures. These can be either Euclidean or correlation-based distance measures.

Cluster analysis is one of the most popular and in a way, intuitive, methods of data analysis and data mining. It is ideal for cases where there is voluminous data and we have to extract insights from it. In this case, the bulk data can be broken down into smaller subsets or groups.

The little groups that are formed and derived from the whole dataset are known as clusters. These are obtained by performing one or more statistical operations. Each cluster, though containing different elements, share the following properties:

  1. Their numbers are not known in advance.
  2. They are obtained by carrying out a statistical operation.
  3. Each cluster contains objects that are similar and have common characteristics.

Even without the ‘fancy’ name of cluster analysis, the same is used a lot in day-to-day life.

At the individual level, we make clusters of the things we need to pack when we set out on a vacation. First clothes, then toiletries, then books, and so on. We make categories and then tackle them individually.

Companies use cluster analysis, too, when they carry out segmentation on their email lists and categorize customers on the basis of age, economic background, previous buying behaviour, etc. 

Cluster analysis is also referred to as ‘unsupervised machine learning’ or pattern recognition. Unsupervised because we aren’t looking to categorize particular samples in particular samples only. Learning because the algorithm also learns how to cluster.

3 Methods of Clustering

We have three methods that are most often used for clustering. These are:

  1. Agglomerative Hierarchical Clustering
  2. Relational clustering/ Condorcet method
  3. k-means clustering

1. Agglomerative Hierarchical Clustering

This is the most common type of hierarchical clustering. The algorithm for AHC works in a bottom-up manner. It begins by regarding each data point as a cluster in itself (called a leaf). 

It then combines together the two clusters that are the most similar. These new and bigger clusters are called nodes. The grouping is repeated until the entire dataset comes together as a single, big cluster called the root.

Visualizing and drawing each step of the AHC process leads to the generation of a tree called a dendrogram. 

Reversing the AHC process leads to divisive clustering and the generation of clusters.

The dendrogram can also be visualized as:

Source

In conclusion, if you want an algorithm that is good at identifying small clusters, go for AHC. If you want one that is good at identifying large clusters, then the divisive clustering method should be your choice.

2. Relational clustering/ Condorcet method

‘Clustering by Similarity Aggregation’ is another name for this method. It works as follows:

The individual objects in pairs that build up the global clustering are compared. To vectors m(A, B) and d(A, B), a pair of individual values (A, B) is assigned. In the vector b(A, B), both A and B have the same values, whereas, in the vector d(A, B), both of them have different values).

The two individual values of A and B are said to follow the Condorcet criterion as follows:

c(A, B) = m(A, B)- d(A, B)

For an individual value like A and a cluster called S, the Condorcet criterion stands as:

c(A,S) = Σic(A,Bi)

The overall summation is Bi ∈ S.

With the above conditions having been met, clusters of the form c(A, S) are constructed. A can have the least value of 0 and is the largest of all the data points in the cluster.

Finally, the global Condorcet criterion is calculated. This is done by performing a summation of the individual data points present in A and the cluster SA which contains them.

The above steps are repeated until the global Condorcet criterion doesn’t improve or the largest number of iterations is reached.

Our learners also read: Free Online Python Course for Beginners

3. k-means clustering

This is one of the most popular partitioning algorithms. All of the available data (also called data points/ observations sometimes) will be grouped into these clusters only. Here is a breakdown of how the algorithm proceeds:

  1. Select k clusters at random. These k rows will also mean finding k centroids for each cluster.
  2. Each data point is then assigned to the centroid closest to it.
  3. As more and more data points get assigned, centroids are recalculated as the average of all the data points (being) added.
  4. Continue assigning data points and shifting the centroid as needed.
  5. Repeat steps 3 and 4 until no data points change cluster.

The distance between a data point and a centroid is calculated using one of the following methods:

  1. Euclidean distance
  2. Manhattan distance
  3. Minlowski distance

The most popular of these- the Euclidean distance- is calculated as follows:

Each time that the algorithm is run, different groups are returned as a result. The very first assignment to the variable k is completely random. This makes k-means very sensitive to the first choice. As a result, it becomes almost impossible to get the same clustering unless the number of groups and overall observations is small.

How to assign a value to k?

In the beginning, we’ll randomly assign a value to k which will dictate the direction that the results head in. To ensure that the best choice is made, it is helpful to keep in mind the following formula:

Here, n is the number of data points in the dataset.

Regardless of the presence of a formula, the number of clusters would be heavily dependent on the nature of the dataset, the industry and business it belongs to, etc. Hence, it is advisable to pay heed to one’s own experience and intuition as well.

With the wrong cluster size, the grouping may not be as effective and can lead to overfitting. Due to overfitting, new data points might not be able to find a place in the cluster since the algorithm has eeked out the little details and all generalization is lost.

Cluster Validity Metrics

Silhouette Coefficient

The Silhouette Coefficient measures the compactness and separation of clusters. It quantifies how well each data point fits within its assigned cluster compared to neighboring clusters. The coefficient ranges from -1 to 1, with values closer to 1 indicating better cluster quality.

Dunn Index

The Dunn Index evaluates cluster separation by considering the ratio between the smallest inter-cluster distance and the largest intra-cluster distance. Higher Dunn Index values indicate better-defined and well-separated clusters.

Calinski-Harabasz Index

The Calinski-Harabasz Index measures the ratio of between-cluster dispersion to within-cluster dispersion. It seeks to maximize the inter-cluster distance while minimizing the intra-cluster distance. Higher index values indicate better cluster quality.

Cluster Validity Techniques:

Elbow Method

The Elbow method helps determine the optimal number of clusters by plotting the sum of squared distances (SSD) against different values of k. The point at which the SSD curve exhibits an “elbow” shape suggests the appropriate number of clusters, balancing compactness and separation.

Gap Statistic

The Gap statistic compares the observed within-cluster dispersion to an expected reference distribution. It calculates the optimal number of clusters based on the maximum gap between the observed and expected values. This technique helps avoid overfitting and provides more robust cluster validation.

Hierarchical Consensus Clustering

Hierarchical Consensus Clustering combines multiple clustering runs to generate a consensus dendrogram. It enhances the stability and robustness of clustering results by identifying stable clusters. By assessing the consensus among different clustering outcomes, this technique improves the reliability of the clustering process.

Bootstrap Evaluation

Bootstrap Evaluation involves resampling the dataset and applying the clustering algorithm multiple times. It helps estimate the stability and uncertainty of the clustering results. By examining the consistency of cluster assignments across different bootstrap samples, one can assess the reliability and robustness of the clusters.

Applications of Cluster Analysis

So, where exactly are the powerful clustering methods used? We cursorily mentioned a few examples above. Below are some more instances:

Medicine and health

On the basis of the patients’ age and genetic makeup, doctors are able to provide a better diagnosis. This ultimately leads to treatment that is more beneficial and aligned. New medicines can also be discovered this way. Clustering in medicine is termed as nosology.

Sociology

In social spheres, clustering people on the basis of demographics, age, occupation, residence location, etc. helps the government to enforce laws and shape policies that suit diverse groups.

Marketing

In marketing, the term clustering is replaced by segmentation / typological analysis. It is used to explore and select potential buyers of a particular product. Companies then test the elements of each cluster to know which customers display pro-retainment behavior. 

upGrad’s Exclusive Data Science Webinar for you –

Transformation & Opportunities in Analytics & Insights

Cyber profiling

As an input for the clustering algorithm that will be implemented here, past web pages accessed by a user are inputted. These web pages are then clustered. In the end, a profile of the user, based on his browsing activity, is generated. From personalization to cyber safety, this result can be leveraged anywhere.

Retail

Outlets also benefit from clustering customers on the basis of age, colour preferences, style preferences, past purchases, etc. This helps retailers to create customized experiences and also plan future offerings aligned to customer desires.

background

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree18 Months

Placement Assistance

Certification8-8.5 Months

Best Practices for Cluster Validity Assessment

To ensure accurate cluster analysis, consider the following best practices:

  1. Preprocess the data: Cleanse and normalize the data to remove noise and ensure consistent scaling before performing clustering analysis.
  2. Evaluate multiple metrics: Relying on a single metric may provide limited insights. Assess cluster validity using multiple metrics to obtain a comprehensive understanding.
  3. Combine multiple techniques: Employ a combination of evaluation techniques to validate clustering results from different perspectives, enhancing their reliability.
  4. Consider domain knowledge: Incorporate domain expertise to interpret and validate the clustering outcomes in the specific problem or application context.

Conclusion 

As is evident, cluster analysis is a highly valuable method- no matter the language or environment it is implemented in. Whether one wants to derive insights, eke out patterns, or carve out profiles, cluster analysis is a highly useful tool with results that can be practically implemented. Proficiency in working with the various clustering algorithms can lead one to perform accurate and truly valuable data analysis.

Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Rohit Sharma

708 articles published

Get Free Consultation

+91

By submitting, I accept the T&C and
Privacy Policy

Start Your Career in Data Science Today

Top Resources

Recommended Programs

upGrad Logo

Certification

3 Months

Liverpool John Moores University Logo
bestseller

Liverpool John Moores University

MS in Data Science

Dual Credentials

Master's Degree

18 Months

IIIT Bangalore logo
bestseller

The International Institute of Information Technology, Bangalore

Executive Diploma in Data Science & AI

Placement Assistance

Executive PG Program

12 Months