- Blog Categories
- Software Development
- Data Science
- AI/ML
- Marketing
- General
- MBA
- Management
- Legal
- Software Development Projects and Ideas
- 12 Computer Science Project Ideas
- 28 Beginner Software Projects
- Top 10 Engineering Project Ideas
- Top 10 Easy Final Year Projects
- Top 10 Mini Projects for Engineers
- 25 Best Django Project Ideas
- Top 20 MERN Stack Project Ideas
- Top 12 Real Time Projects
- Top 6 Major CSE Projects
- 12 Robotics Projects for All Levels
- Java Programming Concepts
- Abstract Class in Java and Methods
- Constructor Overloading in Java
- StringBuffer vs StringBuilder
- Java Identifiers: Syntax & Examples
- Types of Variables in Java Explained
- Composition in Java: Examples
- Append in Java: Implementation
- Loose Coupling vs Tight Coupling
- Integrity Constraints in DBMS
- Different Types of Operators Explained
- Career and Interview Preparation in IT
- Top 14 IT Courses for Jobs
- Top 20 Highest Paying Languages
- 23 Top CS Interview Q&A
- Best IT Jobs without Coding
- Software Engineer Salary in India
- 44 Agile Methodology Interview Q&A
- 10 Software Engineering Challenges
- Top 15 Tech's Daily Life Impact
- 10 Best Backends for React
- Cloud Computing Reference Models
- Web Development and Security
- Find Installed NPM Version
- Install Specific NPM Package Version
- Make API Calls in Angular
- Install Bootstrap in Angular
- Use Axios in React: Guide
- StrictMode in React: Usage
- 75 Cyber Security Research Topics
- Top 7 Languages for Ethical Hacking
- Top 20 Docker Commands
- Advantages of OOP
- Data Science Projects and Applications
- 42 Python Project Ideas for Beginners
- 13 Data Science Project Ideas
- 13 Data Structure Project Ideas
- 12 Real-World Python Applications
- Python Banking Project
- Data Science Course Eligibility
- Association Rule Mining Overview
- Cluster Analysis in Data Mining
- Classification in Data Mining
- KDD Process in Data Mining
- Data Structures and Algorithms
- Binary Tree Types Explained
- Binary Search Algorithm
- Sorting in Data Structure
- Binary Tree in Data Structure
- Binary Tree vs Binary Search Tree
- Recursion in Data Structure
- Data Structure Search Methods: Explained
- Binary Tree Interview Q&A
- Linear vs Binary Search
- Priority Queue Overview
- Python Programming and Tools
- Top 30 Python Pattern Programs
- List vs Tuple
- Python Free Online Course
- Method Overriding in Python
- Top 21 Python Developer Skills
- Reverse a Number in Python
- Switch Case Functions in Python
- Info Retrieval System Overview
- Reverse a Number in Python
- Real-World Python Applications
- Data Science Careers and Comparisons
- Data Analyst Salary in India
- Data Scientist Salary in India
- Free Excel Certification Course
- Actuary Salary in India
- Data Analyst Interview Guide
- Pandas Interview Guide
- Tableau Filters Explained
- Data Mining Techniques Overview
- Data Analytics Lifecycle Phases
- Data Science Vs Analytics Comparison
- Artificial Intelligence and Machine Learning Projects
- Exciting IoT Project Ideas
- 16 Exciting AI Project Ideas
- 45+ Interesting ML Project Ideas
- Exciting Deep Learning Projects
- 12 Intriguing Linear Regression Projects
- 13 Neural Network Projects
- 5 Exciting Image Processing Projects
- Top 8 Thrilling AWS Projects
- 12 Engaging AI Projects in Python
- NLP Projects for Beginners
- Concepts and Algorithms in AIML
- Basic CNN Architecture Explained
- 6 Types of Regression Models
- Data Preprocessing Steps
- Bagging vs Boosting in ML
- Multinomial Naive Bayes Overview
- Bayesian Network Example
- Bayes Theorem Guide
- Top 10 Dimensionality Reduction Techniques
- Neural Network Step-by-Step Guide
- Technical Guides and Comparisons
- Make a Chatbot in Python
- Compute Square Roots in Python
- Permutation vs Combination
- Image Segmentation Techniques
- Generative AI vs Traditional AI
- AI vs Human Intelligence
- Random Forest vs Decision Tree
- Neural Network Overview
- Perceptron Learning Algorithm
- Selection Sort Algorithm
- Career and Practical Applications in AIML
- AI Salary in India Overview
- Biological Neural Network Basics
- Top 10 AI Challenges
- Production System in AI
- Top 8 Raspberry Pi Alternatives
- Top 8 Open Source Projects
- 14 Raspberry Pi Project Ideas
- 15 MATLAB Project Ideas
- Top 10 Python NLP Libraries
- Naive Bayes Explained
- Digital Marketing Projects and Strategies
- 10 Best Digital Marketing Projects
- 17 Fun Social Media Projects
- Top 6 SEO Project Ideas
- Digital Marketing Case Studies
- Coca-Cola Marketing Strategy
- Nestle Marketing Strategy Analysis
- Zomato Marketing Strategy
- Monetize Instagram Guide
- Become a Successful Instagram Influencer
- 8 Best Lead Generation Techniques
- Digital Marketing Careers and Salaries
- Digital Marketing Salary in India
- Top 10 Highest Paying Marketing Jobs
- Highest Paying Digital Marketing Jobs
- SEO Salary in India
- Content Writer Salary Guide
- Digital Marketing Executive Roles
- Career in Digital Marketing Guide
- Future of Digital Marketing
- MBA in Digital Marketing Overview
- Digital Marketing Techniques and Channels
- 9 Types of Digital Marketing Channels
- Top 10 Benefits of Marketing Branding
- 100 Best YouTube Channel Ideas
- YouTube Earnings in India
- 7 Reasons to Study Digital Marketing
- Top 10 Digital Marketing Objectives
- 10 Best Digital Marketing Blogs
- Top 5 Industries Using Digital Marketing
- Growth of Digital Marketing in India
- Top Career Options in Marketing
- Interview Preparation and Skills
- 73 Google Analytics Interview Q&A
- 56 Social Media Marketing Q&A
- 78 Google AdWords Interview Q&A
- Top 133 SEO Interview Q&A
- 27+ Digital Marketing Q&A
- Digital Marketing Free Course
- Top 9 Skills for PPC Analysts
- Movies with Successful Social Media Campaigns
- Marketing Communication Steps
- Top 10 Reasons to Be an Affiliate Marketer
- Career Options and Paths
- Top 25 Highest Paying Jobs India
- Top 25 Highest Paying Jobs World
- Top 10 Highest Paid Commerce Job
- Career Options After 12th Arts
- Top 7 Commerce Courses Without Maths
- Top 7 Career Options After PCB
- Best Career Options for Commerce
- Career Options After 12th CS
- Top 10 Career Options After 10th
- 8 Best Career Options After BA
- Projects and Academic Pursuits
- 17 Exciting Final Year Projects
- Top 12 Commerce Project Topics
- Top 13 BCA Project Ideas
- Career Options After 12th Science
- Top 15 CS Jobs in India
- 12 Best Career Options After M.Com
- 9 Best Career Options After B.Sc
- 7 Best Career Options After BCA
- 22 Best Career Options After MCA
- 16 Top Career Options After CE
- Courses and Certifications
- 10 Best Job-Oriented Courses
- Best Online Computer Courses
- Top 15 Trending Online Courses
- Top 19 High Salary Certificate Courses
- 21 Best Programming Courses for Jobs
- What is SGPA? Convert to CGPA
- GPA to Percentage Calculator
- Highest Salary Engineering Stream
- 15 Top Career Options After Engineering
- 6 Top Career Options After BBA
- Job Market and Interview Preparation
- Why Should You Be Hired: 5 Answers
- Top 10 Future Career Options
- Top 15 Highest Paid IT Jobs India
- 5 Common Guesstimate Interview Q&A
- Average CEO Salary: Top Paid CEOs
- Career Options in Political Science
- Top 15 Highest Paying Non-IT Jobs
- Cover Letter Examples for Jobs
- Top 5 Highest Paying Freelance Jobs
- Top 10 Highest Paying Companies India
- Career Options and Paths After MBA
- 20 Best Careers After B.Com
- Career Options After MBA Marketing
- Top 14 Careers After MBA In HR
- Top 10 Highest Paying HR Jobs India
- How to Become an Investment Banker
- Career Options After MBA - High Paying
- Scope of MBA in Operations Management
- Best MBA for Working Professionals India
- MBA After BA - Is It Right For You?
- Best Online MBA Courses India
- MBA Project Ideas and Topics
- 11 Exciting MBA HR Project Ideas
- Top 15 MBA Project Ideas
- 18 Exciting MBA Marketing Projects
- MBA Project Ideas: Consumer Behavior
- What is Brand Management?
- What is Holistic Marketing?
- What is Green Marketing?
- Intro to Organizational Behavior Model
- Tech Skills Every MBA Should Learn
- Most Demanding Short Term Courses MBA
- MBA Salary, Resume, and Skills
- MBA Salary in India
- HR Salary in India
- Investment Banker Salary India
- MBA Resume Samples
- Sample SOP for MBA
- Sample SOP for Internship
- 7 Ways MBA Helps Your Career
- Must-have Skills in Sales Career
- 8 Skills MBA Helps You Improve
- Top 20+ SAP FICO Interview Q&A
- MBA Specializations and Comparative Guides
- Why MBA After B.Tech? 5 Reasons
- How to Answer 'Why MBA After Engineering?'
- Why MBA in Finance
- MBA After BSc: 10 Reasons
- Which MBA Specialization to choose?
- Top 10 MBA Specializations
- MBA vs Masters: Which to Choose?
- Benefits of MBA After CA
- 5 Steps to Management Consultant
- 37 Must-Read HR Interview Q&A
- Fundamentals and Theories of Management
- What is Management? Objectives & Functions
- Nature and Scope of Management
- Decision Making in Management
- Management Process: Definition & Functions
- Importance of Management
- What are Motivation Theories?
- Tools of Financial Statement Analysis
- Negotiation Skills: Definition & Benefits
- Career Development in HRM
- Top 20 Must-Have HRM Policies
- Project and Supply Chain Management
- Top 20 Project Management Case Studies
- 10 Innovative Supply Chain Projects
- Latest Management Project Topics
- 10 Project Management Project Ideas
- 6 Types of Supply Chain Models
- Top 10 Advantages of SCM
- Top 10 Supply Chain Books
- What is Project Description?
- Top 10 Project Management Companies
- Best Project Management Courses Online
- Salaries and Career Paths in Management
- Project Manager Salary in India
- Average Product Manager Salary India
- Supply Chain Management Salary India
- Salary After BBA in India
- PGDM Salary in India
- Top 7 Career Options in Management
- CSPO Certification Cost
- Why Choose Product Management?
- Product Management in Pharma
- Product Design in Operations Management
- Industry-Specific Management and Case Studies
- Amazon Business Case Study
- Service Delivery Manager Job
- Product Management Examples
- Product Management in Automobiles
- Product Management in Banking
- Sample SOP for Business Management
- Video Game Design Components
- Top 5 Business Courses India
- Free Management Online Course
- SCM Interview Q&A
- Fundamentals and Types of Law
- Acceptance in Contract Law
- Offer in Contract Law
- 9 Types of Evidence
- Types of Law in India
- Introduction to Contract Law
- Negotiable Instrument Act
- Corporate Tax Basics
- Intellectual Property Law
- Workmen Compensation Explained
- Lawyer vs Advocate Difference
- Law Education and Courses
- LLM Subjects & Syllabus
- Corporate Law Subjects
- LLM Course Duration
- Top 10 Online LLM Courses
- Online LLM Degree
- Step-by-Step Guide to Studying Law
- Top 5 Law Books to Read
- Why Legal Studies?
- Pursuing a Career in Law
- How to Become Lawyer in India
- Career Options and Salaries in Law
- Career Options in Law India
- Corporate Lawyer Salary India
- How To Become a Corporate Lawyer
- Career in Law: Starting, Salary
- Career Opportunities: Corporate Law
- Business Lawyer: Role & Salary Info
- Average Lawyer Salary India
- Top Career Options for Lawyers
- Types of Lawyers in India
- Steps to Become SC Lawyer in India
- Tutorials
- Software Tutorials
- C Tutorials
- Recursion in C: Fibonacci Series
- Checking String Palindromes in C
- Prime Number Program in C
- Implementing Square Root in C
- Matrix Multiplication in C
- Understanding Double Data Type
- Factorial of a Number in C
- Structure of a C Program
- Building a Calculator Program in C
- Compiling C Programs on Linux
- Java Tutorials
- Handling String Input in Java
- Determining Even and Odd Numbers
- Prime Number Checker
- Sorting a String
- User-Defined Exceptions
- Understanding the Thread Life Cycle
- Swapping Two Numbers
- Using Final Classes
- Area of a Triangle
- Skills
- Explore Skills
- Management Skills
- Software Engineering
- JavaScript
- Data Structure
- React.js
- Core Java
- Node.js
- Blockchain
- SQL
- Full stack development
- Devops
- NFT
- BigData
- Cyber Security
- Cloud Computing
- Database Design with MySQL
- Cryptocurrency
- Python
- Digital Marketings
- Advertising
- Influencer Marketing
- Performance Marketing
- Search Engine Marketing
- Email Marketing
- Content Marketing
- Social Media Marketing
- Display Advertising
- Marketing Analytics
- Web Analytics
- Affiliate Marketing
- MBA
- MBA in Finance
- MBA in HR
- MBA in Marketing
- MBA in Business Analytics
- MBA in Operations Management
- MBA in International Business
- MBA in Information Technology
- MBA in Healthcare Management
- MBA In General Management
- MBA in Agriculture
- MBA in Supply Chain Management
- MBA in Entrepreneurship
- MBA in Project Management
- Management Program
- Consumer Behaviour
- Supply Chain Management
- Financial Analytics
- Introduction to Fintech
- Introduction to HR Analytics
- Fundamentals of Communication
- Art of Effective Communication
- Introduction to Research Methodology
- Mastering Sales Technique
- Business Communication
- Fundamentals of Journalism
- Economics Masterclass
- Free Courses
Hierarchical Clustering in Python [Concepts and Analysis]
Updated on 25 November, 2022
9.49K+ views
• 15 min read
Table of Contents
With the increase in the flow of raw data and the need for analysis, the concept of unsupervised learning became popular over time. It is used to draw insights from datasets consisting of input data without labeled target values. Before we start discussing hierarchical clustering in Python and applying the algorithm on various datasets, let us revisit the clustering’s basic idea.
Clustering mainly deals with the classification of raw data. It comprises grouping different data points together, which are most similar to each other. These groups are called clusters, which are formed based on the similarity or clustering metric defined.
Introduction
Hierarchical clustering deals with data in the form of a tree or a well-defined hierarchy. The process involves dealing with two clusters at a time. The algorithm relies on a similarity or distance matrix for computational decisions. Meaning, which two clusters to merge or how to divide a cluster into two. With these two options in mind, we have two types of hierarchical clustering. If you are a beginner and interested to learn more about data science, check out our data science courses from top universities.
One of the algorithm’s critical aspects is the similarity matrix (also known as proximity matrix), as the whole algorithm proceeds based on it. There are many proximity methods which are discussed further along in the article.
Types
Hierarchical clustering has two types:
- Agglomerative clustering
- Divisive clustering
The types are per the fundamental functionality: the way of developing hierarchy. Agglomerative is a bottom-up hierarchy generator, whereas divisive is a top-down hierarchy generator.
Agglomerative takes all points as individual clusters and then merges them on each iteration, two at a time. Divisive starts by assuming the entire data as one cluster and divides it until all points become individual clusters.
The result is a set of nested clusters that can be perceived as a hierarchical tree. The best way to view it is to convert the set structure into a dendrogram to view the hierarchy.
The following gives a simple example of a dendrogram versus the cluster representation:
Here, the clustering may work either way, but the result will be a collection of clusters. The data points 1, 2, 3, 4, 5, and 6 are clustered into two at a time. And the hierarchy formation can be seen in the left figure, which deals with the dendrogram of the same. The same analysis would help in understanding the decision of clusters.
Deciding the number of clusters
One of the most useful features of this algorithm is that you may extract as many clusters as you want once the algorithm terminates. It is quite different from the K-means algorithm. In K-means, we need to pass the no-of-clusters hyperparameter. It means that once the algorithm completes computation, we would have that many clusters. But, if we need more clusters later, we cannot tune that easily. The only option would be to change the parameter and train the model again.
Whereas, when it comes to the hierarchical clustering, you can set the number of clusters later. You can take two clusters at the end. If not satisfied, you may take the five clusters formed at the penultimate or higher-level step. It depends on you. Hence, once trained, you do not need to retrain the model to get more or fewer clusters. It can be accomplished by simply cutting the dendrogram at the level you desire.
As we have the concepts down, let us discuss the working of hierarchical clustering in Python.
For the experiment, we are going to use the sci-kit learn library for the clustering algorithms. We would also use the cluster.dendrogram module from SciPy to visualize and understand the “cutting” process for limiting the number of clusters.
import numpy as np
X = np.array([[3,5],
[12,9],
[13,17],
[14,14],
[60,52],
[55,63],
[69,59],])
It would look something like this on a plot:
Well, we do see that we have two definitive clusters, at the top and bottom corners. Let us see if the algorithm can figure it out or not.
We would be using the AgglomerativeClustering function from the sklearn.clustering module.
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=2, affinity=’euclidean’, linkage=’ward’)
cluster.fit_predict(X)
Here, we do specify the clusters, which is not a hyperparameter. However, we just pass it to make the prediction classes clear. We would use the fit_predict function to train as well as predict the classes over X.
It is important to note that agglomerative clustering is more used than divisive as it is simpler to execute. The idea of merging clusters based on proximity matrices does seem easier than to divide a cluster into two via some mechanism.
Read: Scikit-learn in Python: Features, Prerequisites, Pros & Cons
To clearly understand what happened above, look at the steps involved in the algorithm:
Working of the algorithm
Here are the steps to execute the agglomerative clustering:
- Define each data point as a cluster
- Calculate the initial proximity metric
- Merge two clusters which are the “closest” or similar based on the metric
- Revise the proximity metric and repeat the third step until a single cluster remains.
So, here the only thing remaining to understand is the impact of different proximity methods. As you know, mainly, there are four types of proximity methods in hierarchical clustering. This is also known as Inter-cluster similarity.
The methods (or linkage, as defined in code) include:
- MIN or Single linkage
- MAX or Complete linkage
- Average linkage
- Centroid linkage
- Exclusive functions of objective functions
The results of the same can be easily visualized by applying the linkage option while creating the dendrograms.
Explore our Popular Data Science Certifications
To visualize the output of the model, we just need a small code snippet as follows:
plt.scatter(X[:,0],X[:,1], c=cluster.labels_, cmap=’winter’)
As you can see, there are two different clusters on the opposite corners. You may as well play around with cluster numbers and see different results. The whole thing can be driven by cutting dendrograms. To understand that, let us write a small snippet for the visualization of dendrograms creation.
We are going to use dendrogram and linkage functions from the scipy.cluster.hierarchy module. Here, we define the linkage that we want to use. We need to pass that object to the dendrogram function to generate the hierarchy.
from scipy.cluster.hierarchy import dendrogram, linkage
linked = linkage(X, ‘complete’)
labelList = range(1, 8)
plt.figure(figsize=(10, 7))
dendrogram(linked,
orientation=’top’,
labels=labelList,
distance_sort=’descending’,
show_leaf_counts=True)
plt.show()
Read our popular Data Science Articles
Here, you can visualize how the clusters are formed on each iteration. So, you can cut the dendrogram on any level that you want, and you would end up with that many clusters. Hence, due to this hierarchy creation, you may vary the number of clusters after just one run through the algorithm and data. It is what gives hierarchical clustering an edge upon other algorithms like K-means.
Now, let us look at how to use hierarchical clustering in Python on a commonly used dataset: IRIS. We would be reading the dataset from a local csv. and just have a glance at how the dataset looks and what we need to classify.
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
%matplotlib inline
data = pd.read_csv(‘iris.csv’)
data.head()
As you can see, the target variable is the ‘variety’ class. This is in string format which needs to be converted into numbers, as the model requires encoded labels. To do this, we would be using the label encoder from sklearn’s preprocessing library. A simple fit and transform to convert them into numbers.
Top Data Science Skills to Learn
from sklearn import preprocessing
le = preprocessing.LabelEncoder()
le.fit(data[‘variety’])
data[‘variety’] = le.transform(data[‘variety’])
Now, if we create a dendrogram on this, we find various iterations and maps. This is how it looks with a single linkage. If we use the same code and run it with complete or centroid linkage, the dendrograms would differ a bit. The logic remains the same, but different linkages would definitely affect the order of the merger of clusters.
from scipy.cluster.hierarchy import dendrogram, linkage
linked = linkage(data, ‘ward’)
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.show()
Now, applying clustering on the dataset, we would use two different linkages, and you would clearly see what difference it really has while defining the clusters. As we already have seen from the label encoder that we have 3 different classes, so we may apply 3 clusters at first.
from sklearn.cluster import AgglomerativeClustering
cluster = AgglomerativeClustering(n_clusters=3, affinity=’euclidean’, linkage=’complete’)
cluster.fit_predict(data)
plt.figure(figsize=(10, 7))
plt.scatter(data[‘sepal.length’], data[‘petal.length’], c=cluster.labels_)
As you can see from the figure above, in the 3-cluster classification, the linkages show visible changes in prediction. Look at the ward linkage first. It correctly predicts the labels by keeping the above cluster defined, even though there is a small mix up of values in the two clusters. But, when we see the complete linkage, it breaks down the cluster and misclassified some of the values.
As we know in the proximity methods, the complete linkage does tend to break the larger clusters, as we can see above. The ward’s method or the single linkage method is less vulnerable to these issues. This was for the simple datasets. Let us see how the algorithm suffers and is affected by some noisy datasets.
One such dataset is the Pulsar prediction dataset or HTRU2 dataset. The dataset is larger, as it contains about 18,000 samples. If seen with an ML perspective, the dataset is fairly regular size, or even lower. But, comparatively, it is heavier than the IRIS dataset. The need for implementation on a varied dataset is to analyze the performance of hierarchical clustering in Python. To clearly understand the ways and perks of implementations,
pulsar_data = pd.read_csv(‘pulsar_stars.csv’)
pulsar_data.head()
we would need to normalize the dataset so that it doesn’t get biased due to extreme values.
from sklearn.preprocessing import normalize
pulsar_data = normalize(pulsar_data)
We would be using the standard code, but this time, we are timing both computations.
%%time
from scipy.cluster.hierarchy import dendrogram, linkage
linked = linkage(pulsar_data, ‘ward’)
plt.figure(figsize=(10, 7))
dendrogram(linked)
plt.show()
The timing for generating a dendrogram on the IRIS dataset was 6 seconds. The timing for generating a dendrogram on HTRU2 dataset was 13min 54 seconds. But, this is nothing compared to the change in predictions due to different linkages, which you observe in the model trained with the HTRU2 dataset.
Let us follow the same procedure as we did before. This time we would make predictions on every linkage.
The following figure shows the predictions of clustering with each linkage:
cluster = AgglomerativeClustering(n_clusters=2, affinity=’euclidean’, linkage=’average’) #as well as complete,ward and single
cluster.fit_predict(pulsar_data)
plt.figure(figsize=(10, 7))
plt.scatter(pulsar_data[:,1], pulsar_data[:,7], c=cluster.labels_)
Yes, it is indeed surprising how much the predictions differ from each other. This shows the importance of the proximity matrix in hierarchical clustering.
As you can see, the single linkage takes in almost all the points as the minimum distance between two clusters defines the proximity metric. This makes it vulnerable to noisy data. If we see the complete linkage, it definitely splits the data into two clusters, but it may have broken the large cluster just due to its proximity.
The average-linkage is a trade-off between the two. It is less affected by noise, but it still may break large clusters, but with a lesser probability. And, it does handle the classification better.
Objective functions like the ward’s method are sometimes used for initializing other clustering methods like K-means. This method, just like the average-linkage, has a trade-off between the single and complete-linkage methods. Objective functions like the ward’s method are mainly used in customized solutions to lessen the probability of misclassification. And, we do see it performing well.
Learn: Cluster Analysis in Data Mining: Applications, Methods & Requirements
upGrad’s Exclusive Data Science Webinar for you –
How upGrad helps for your Data Science Career?
Time and Space Complexity
Just to give an understanding, consider the way the proximity metric is defined and calculated. The proximity metric requires to store the distance between every pair of clusters inside the data map. It makes for space complexity: O(n2). It is a large number. To put it in perspective, imagine we have 1,000,000 points. That will take the space requirements to 1012 points. Taking a rough and heavy average by approximating the size of one point as a byte, we get the data size at 1TB. And this needs to be stored in RAM, not on the hard drive.
Second, comes the time complexity. For the need of scanning the proximity matrix at every iteration and considering that we take n steps, we get the complexity as O(n3). It is computationally expensive, especially on big datasets.
It may be possible to bring it down to O(n2logn), but, it still is too expensive compared to other clustering algorithms, like K-means. If you want to learn more on analyzing the space and time complexity of algorithms and optimizing the cost functions, you may head down to upGrad’s Programs in Data Science and Machine Learning.
Limitations
- We have already discussed the first limitation: Space and time complexity. It is obvious that hierarchical clustering is not favourable in the case of big datasets. Even if time complexity is managed with faster computational machines, the space complexity is too high. Especially when we load it in the RAM. And, the issue of speed increases even more when we are implementing the hierarchical clustering in Python. Python is slow, and if big tasks are concerned, it will definitely suffer.
- Secondly, there is no optimized technique with proximity. If we see each has multiple problems and limitations, this makes the internal mechanism of the algorithm unoptimized.
- When we look at the clustering decisions, they are not retractable. Meaning- once the clustering has been applied for a definite iteration, it will not be changed in further iterations till termination. So, if due to structural inaccuracies, the algorithm, at any point, chooses wrong clusters to combine or split, it is irrevocable.
- If we look closely at the algorithm, we do not have a clear objective function that is being minimized. In other algorithms, there is a definite function that we try to optimize. For example, in K-means we have a clear cost function which we minimize, which is not the case with hierarchical clustering.
Check out: Top 9 Data Science Algorithms Every Data Scientist Should Know
Conclusion
Even though there are certain limitations when it comes to large datasets, this type of clustering algorithm is appealing while dealing with small to medium scale datasets. The hierarchical clustering algorithm in Python has not seen much development in architecture or schema due to its alarming need for time and space complexity.
And, it is true that right now, the time is of Big Data. It means we do require algorithms that scale better. But, still, in cases when we are not sure of the number of clusters, or we need to refine the analysis efficiently, hierarchical clustering in Python can be a satisfactory choice.
With this, you now know how to implement hierarchical clustering in Python.
For understanding more such algorithms and applications of methods in Machine Learning and Data Science, do have a look at the course offerings by upGrad. We have cumulative programs for any of the career paths that you want to follow.
The programs are curated by top professionals as well as the professors at IIIT-B. For more information, head down to the upGrad.If you are curious about learning data science to be in the front of fast-paced technological advancements, check out upGrad & IIIT-B’s Executive PG Programme in Data Science.
Frequently Asked Questions (FAQs)
1. How to perform hierarchical clustering in Python?
Hierarchical Clustering is a type of unsupervised machine learning algorithm that is used for labeling the data points. Hierarchical clustering groups the elements together based on the similarities in their characteristics. For performing hierarchical clustering, you need to follow the below steps:
Every data point has to be treated as a cluster in the beginning. So, the number of clusters in the beginning, will be K, where K is an integer representing the total number of data points.
Build a cluster by joining the two closest data points so that you are left with K-1 clusters.
Continue forming more clusters to result in K-2 clusters and so on.
Repeat this step until you find that there is a big cluster formed in front of you.
Once you are left only with a single big cluster, dendrograms are used to divide those clusters into multiple clusters based on the problem statement.
This is the entire process for performing hierarchical clustering in Python.
2. Which are the two types of hierarchical clustering?
There are two main types of hierarchical clustering. They are:
Agglomerative Clustering
This clustering method is also known as AGNES (Agglomerative Nesting). This algorithm uses the bottom-up approach. Here, every object is considered to be a single-element cluster. The two clusters with similar characteristics are combined to form a bigger cluster. This method is followed until you are left with a single big cluster.
Divisive Hierarchical Clustering
This clustering method is also known as DIANA (Divisive Analysis). This algorithm follows the top-down approach, which is the inverse of the one used by AGNES. Here, the root node will consist of a huge cluster of all the elements. After every step, the most heterogeneous cluster is divided, and this process is continued until you are left with a single cluster.
3. Which type of hierarchical clustering algorithm is more widely used?
As you know, there are two types of hierarchical clustering algorithms – Agglomerative and Divisive Clustering. Among both the algorithms, the Agglomerative algorithm is more commonly preferred for performing hierarchical clustering.
In this method, you group all the objects based on their similarities with the help of a bottom-up approach. Starting from a single node, you reach up to a single big cluster filled with nodes carrying similar characteristics.