Explore Courses
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Birla Institute of Management Technology Birla Institute of Management Technology Post Graduate Diploma in Management (BIMTECH)
  • 24 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Popular
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science & AI (Executive)
  • 12 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
University of MarylandIIIT BangalorePost Graduate Certificate in Data Science & AI (Executive)
  • 8-8.5 Months
upGradupGradData Science Bootcamp with AI
  • 6 months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
OP Jindal Global UniversityOP Jindal Global UniversityMaster of Design in User Experience Design
  • 12 Months
Popular
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Rushford, GenevaRushford Business SchoolDBA Doctorate in Technology (Computer Science)
  • 36 Months
IIIT BangaloreIIIT BangaloreCloud Computing and DevOps Program (Executive)
  • 8 Months
New
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Popular
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
Golden Gate University Golden Gate University Doctor of Business Administration in Digital Leadership
  • 36 Months
New
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
Popular
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
Bestseller
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
IIIT BangaloreIIIT BangalorePost Graduate Certificate in Machine Learning & Deep Learning (Executive)
  • 8 Months
Bestseller
Jindal Global UniversityJindal Global UniversityMaster of Design in User Experience
  • 12 Months
New
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in AI and Emerging Technologies (Blended Learning Program)
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
ESGCI, ParisESGCI, ParisDoctorate of Business Administration (DBA) from ESGCI, Paris
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration From Golden Gate University, San Francisco
  • 36 Months
Rushford Business SchoolRushford Business SchoolDoctor of Business Administration from Rushford Business School, Switzerland)
  • 36 Months
Edgewood CollegeEdgewood CollegeDoctorate of Business Administration from Edgewood College
  • 24 Months
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with Concentration in Generative AI
  • 36 Months
Golden Gate University Golden Gate University DBA in Digital Leadership from Golden Gate University, San Francisco
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA by Liverpool Business School
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA (Master of Business Administration)
  • 15 Months
Popular
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Business Administration (MBA)
  • 12 Months
New
Deakin Business School and Institute of Management Technology, GhaziabadDeakin Business School and IMT, GhaziabadMBA (Master of Business Administration)
  • 12 Months
Liverpool John Moores UniversityLiverpool John Moores UniversityMS in Data Science
  • 18 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityMaster of Science in Artificial Intelligence and Data Science
  • 12 Months
Bestseller
IIIT BangaloreIIIT BangalorePost Graduate Programme in Data Science (Executive)
  • 12 Months
Bestseller
O.P.Jindal Global UniversityO.P.Jindal Global UniversityO.P.Jindal Global University
  • 12 Months
WoolfWoolfMaster of Science in Computer Science
  • 18 Months
New
Liverpool John Moores University Liverpool John Moores University MS in Machine Learning & AI
  • 18 Months
Popular
Golden Gate UniversityGolden Gate UniversityDBA in Emerging Technologies with concentration in Generative AI
  • 3 Years
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (AI/ML)
  • 36 Months
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDBA Specialisation in AI & ML
  • 36 Months
Golden Gate University Golden Gate University Doctor of Business Administration (DBA)
  • 36 Months
Bestseller
Ecole Supérieure de Gestion et Commerce International ParisEcole Supérieure de Gestion et Commerce International ParisDoctorate of Business Administration (DBA)
  • 36 Months
Rushford, GenevaRushford Business SchoolDoctorate of Business Administration (DBA)
  • 36 Months
Liverpool Business SchoolLiverpool Business SchoolMBA with Marketing Concentration
  • 18 Months
Bestseller
Golden Gate UniversityGolden Gate UniversityMBA with Marketing Concentration
  • 15 Months
Popular
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Corporate & Financial Law
  • 12 Months
Bestseller
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Intellectual Property & Technology Law
  • 12 Months
Jindal Global Law SchoolJindal Global Law SchoolLL.M. in Dispute Resolution
  • 12 Months
IIITBIIITBExecutive Program in Generative AI for Leaders
  • 4 Months
New
IIIT BangaloreIIIT BangaloreExecutive Post Graduate Programme in Machine Learning & AI
  • 13 Months
Bestseller
upGradupGradData Science Bootcamp with AI
  • 6 Months
New
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
KnowledgeHut upGradKnowledgeHut upGradSAFe® 6.0 Certified ScrumMaster (SSM) Training
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutCertified ScrumMaster®(CSM) Training
  • 16 Hours
upGrad KnowledgeHutupGrad KnowledgeHutLeading SAFe® 6.0 Certification
  • 16 Hours
KnowledgeHut upGradKnowledgeHut upGradPMP® certification
  • Self-Paced
upGrad KnowledgeHutupGrad KnowledgeHutAWS Solutions Architect Certification
  • 32 Hours
upGrad KnowledgeHutupGrad KnowledgeHutAzure Administrator Certification (AZ-104)
  • 24 Hours
KnowledgeHut upGradKnowledgeHut upGradAWS Cloud Practioner Essentials Certification
  • 1 Week
KnowledgeHut upGradKnowledgeHut upGradAzure Data Engineering Training (DP-203)
  • 1 Week
MICAMICAAdvanced Certificate in Digital Marketing and Communication
  • 6 Months
Bestseller
MICAMICAAdvanced Certificate in Brand Communication Management
  • 5 Months
Popular
IIM KozhikodeIIM KozhikodeProfessional Certification in HR Management and Analytics
  • 6 Months
Bestseller
Duke CEDuke CEPost Graduate Certificate in Product Management
  • 4-8 Months
Bestseller
Loyola Institute of Business Administration (LIBA)Loyola Institute of Business Administration (LIBA)Executive PG Programme in Human Resource Management
  • 11 Months
Popular
Goa Institute of ManagementGoa Institute of ManagementExecutive PG Program in Healthcare Management
  • 11 Months
IMT GhaziabadIMT GhaziabadAdvanced General Management Program
  • 11 Months
Golden Gate UniversityGolden Gate UniversityProfessional Certificate in Global Business Management
  • 6-8 Months
upGradupGradContract Law Certificate Program
  • Self paced
New
IU, GermanyIU, GermanyMaster of Business Administration (90 ECTS)
  • 18 Months
Bestseller
IU, GermanyIU, GermanyMaster in International Management (120 ECTS)
  • 24 Months
Popular
IU, GermanyIU, GermanyB.Sc. Computer Science (180 ECTS)
  • 36 Months
Clark UniversityClark UniversityMaster of Business Administration
  • 23 Months
New
Golden Gate UniversityGolden Gate UniversityMaster of Business Administration
  • 20 Months
Clark University, USClark University, USMS in Project Management
  • 20 Months
New
Edgewood CollegeEdgewood CollegeMaster of Business Administration
  • 23 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
The American Business SchoolThe American Business SchoolMBA with specialization
  • 23 Months
New
Aivancity ParisAivancity ParisMSc Artificial Intelligence Engineering
  • 24 Months
Aivancity ParisAivancity ParisMSc Data Engineering
  • 24 Months
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGrad KnowledgeHutupGrad KnowledgeHutData Engineer Bootcamp
  • Self-Paced
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
KnowledgeHut upGradKnowledgeHut upGradBackend Development Bootcamp
  • Self-Paced
upGradupGradUI/UX Bootcamp
  • 3 Months
upGradupGradCloud Computing Bootcamp
  • 7.5 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 5 Months
upGrad KnowledgeHutupGrad KnowledgeHutSAFe® 6.0 POPM Certification
  • 16 Hours
upGradupGradDigital Marketing Accelerator Program
  • 05 Months
upGradupGradAdvanced Certificate Program in GenerativeAI
  • 4 Months
New
upGradupGradData Science Bootcamp with AI
  • 6 Months
Popular
upGradupGradFull Stack Software Development Bootcamp
  • 6 Months
Bestseller
upGradupGradUI/UX Bootcamp
  • 3 Months
PwCupGrad CampusCertification Program in Financial Modelling & Analysis in association with PwC India
  • 4 Months
upGradupGradCertificate Course in Business Analytics & Consulting in association with PwC India
  • 06 Months
upGradupGradDigital Marketing Accelerator Program
  • 05 Months

Stemming & Lemmatization in Python: Which One To Use?

Updated on 30 June, 2023

6.76K+ views
9 min read

Natural Language Processing (NLP) is a communication processing technique that involves extracting important features from the language. It is an advancement in Artificial intelligence that involves building intelligent agents with previous experience. The previous experience here refers to the training that is performed over humongous datasets involving textual data from sources including social media, web scraping, survey forms, and many other data collection techniques.

The initial step after data gathering is the cleaning of this data and conversion into the machine-readable form, the numerical form that the machine can interpret. While the conversion process is a whole another thing, the cleaning process is the first step to be performed. In this cleaning task, inflection is an important concept that needs a clear understanding before moving on to stemming and lemmatization. 

Inflection

We know textual data comprises sentences with words and other characters that may or may not impact our predictions. The sentences comprise words and the words which are commonly used such as is, there, and, are called stop words. These can be removed easily by forming a corpus for them, but what about different forms of the same word? 

Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

You don’t want your machine to consider ‘study’ and ‘studying’ as different words as the intent behind these words remains the same and both convey the same meaning. Handling this type of case is a common practice in NLP, and this is known as inflection. This is the base idea of stemming and lemmatization with different approaches. Let’s discover the differences between them and have a look at which one is better to use. 

Stemming

It is one of the text normalization techniques that focuses on reducing the ambiguity of words. The stemming focuses on stripping the word round to the stem word. It does so by removing the prefixes or suffixes, depending upon the word under consideration. This technique reduces the words according to the defined set of rules. 

The resulted words may or may not have any actual meaningful root words. Its main purpose is to form groups of similar words together so that further preprocessing can be optimized. For example, words like play, playing, and played all belong to the stem word “play”. This also helps in reducing the search time in search engines, as now more focus is given on the key element. 

Two cases need to be discussed regarding stemming, i.e., over steaming and under stemming. While removing the prefixes and suffixes from the word solves some cases, some words are stripped more than the requirements.

This can lead to more trash words with no meanings. Though this is the disadvantage of stemming as a whole, and if it happens more drastically, it is known as over stemming. Under stemming is the reverse where the stemming process results in very little or difference in words.

NLTK Stemming: Understand With This Program

Here is a program that helps you better understand NLTK stemming

from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
    rootWord=ps.stem(w)
    print(rootWord)

 

Stemming Pros

Evaluating the pros and cons of stemming and lemmatization in Python can help you better compare the two and conclude which one is the best. So, let’s start with the pros of stemming: 

  • Enhanced Model Performance: Stemming lowers the number of distinct words that an algorithm must process, which can enhance model performance. It can also speed up and improve the efficiency of the algorithm. 
  • Organizing Comparable Terms: Even though they have different forms, words with similar implications can be clustered together. When identifying relevant subjects or themes inside a document, as is the case with activities like document classification, this technique might be helpful. 
  • Easy to Compare and Comprehend: Since stemming often shrinks the vocabulary, texts are considerably simpler to compare, analyze, and comprehend. This is beneficial for projects like sentiment analysis, where the objective is to ascertain the sentiment of a document. 

Stemming Drawbacks

  • Overstemming: It occurs when a stemming algorithm lowers distinct conjugated words to the same word stem despite the fact that they are unrelated. For instance, the Porter-Stemmer algorithm stems the words “universal,” “university,” and “universe” to the same word stem. 
  • Understemming: When an inflected word’s word stem is changed when it should be the same, this is known as understemming or a false negative. 
  • Language Difficulties: It becomes more challenging to create stemmers as the spelling, morphology, and character encoding of the target language become more complex.

Lemmatization

Another approach for normalizing the text and converting them to root meanings is Lemmatization. This has the same motive of grouping similar intent words into one group, but the difference is that here the resultant words are meaningful.

They are not stripped off with pre-defined rules but are formed using a dictionary or we call it Lemma. Here the process of conversion takes more time because first, the words are matched with their parts of speech, which itself is time taking process. 

This ensures that the root word has a literal meaning that helps in deriving good results in analysis. This is useful when we don’t want to spend much time on data cleaning, and cleaner data is required for further analysis. One drawback of this technique is that as it focuses more on the grammar of the words, different languages would require separate corpora leading to more and more data handling. 

Checkout: Deep Learning Project Ideas for Beginners

Lemmatization In Python: Use Cases

Lemmatization in Python reduces ambiguity in writing. The root word bicycle is formed from examples like bicycle or bicycles. In essence, it will change all words with the same meaning but distinct interpretations to their original forms. 

It lessens the number of words in the provided text and aids in creating precise features for the machine-learning training system. Your machine-learning system will be smarter and more precise the clean the data is.

Lemmatization Pros

  • Unlike stemming algorithms, lemmatization does more than simply clip words off. 
  • Words are examined depending on their POS to produce lemmas that take context into account. 
  • Lemmatization also creates terms that belong in dictionaries. 

Lemmatization Drawbacks

  • Lemmatization takes longer than stemming because it is a slower process. 
  • This is so that words’ meanings may be determined through morphological analysis and dictionary use during lemmatization. 

Stemming and Lemmatization In Python: Code To Distinguish Between Them

Stemming Code:
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer  = PorterStemmer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Stemming for {} is {}".format(w,porter_stemmer.stem(w)))

 

Output:

Stemming for studies is studi
Stemming for studying is studi
Stemming for cries is cri
Stemming for cry is cri

 

Lemmatization Code:

import nltk
from nltk.stem import  WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
 for w in tokenization:
print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))

 

Output:

Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry

 

Evaluating The Output

When stemming from studies and studying, the output is the same (studi). However, the NLTK lemmatizer offers a separate lemma for the terms study for studies and studying for studying. Lemmatization would therefore be excellent if it were the dominant method when creating feature sets to teach machines. 

Which One to Use?

Now comes the point of picking the one between the two of them. It is highly subjective to choose anyone as the use case you are targeting plays a major role here. 

If you want to analyze a chunk of text but time is a constraint, then you can opt for stemming as it performs this action in less time but with a low success rate, and the stems are provided via an algorithmic way that may not have any meaning. 

Adopting Lemmatization gives an added advantage of getting meaningful and accurate root words clubbed from different forms. If you can afford good computing resources with more time, then this is can be a better choice. This should be adopted where we want precise analysis. It can also be the case of some searching techniques on the search engines where the root word is enough to fetch the results user wants. 

Python Implementation

The NLTK (Natural Language Tool Kit) package is the Python implementation of the tasks around the NLP. This library has all the required tools such as Stemmers. Lemmatizers, stop words removal, creating custom parser trees, and much more. It also contains the corpus data from prominent sources included in the package itself. 

The stemming technique has many implementations, but the most popular and oldest one is the Porter Stemmer algorithm. Snowball stemmer is also used in some projects. For understanding the difference between stemming and lemmatization more clearly, look at the code below and the output of the same:

import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
word_stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('flies'))
print(word_stemmer.stem('flies'))

 

Output:

fly
fli

 

The first output is from the lemmatizer and the second from the stemmer. You can see the difference that the lemmatizer gave the root word as the output while the stemmer just trimmed the word from the end. 

Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.

Also Read: Machine Learning Project Ideas

Conclusion

NLP is growing every day and new methods evolve with time. Most of them focus on how to efficiently extract the right information from the text data with minimum loss and eliminating all the noises. Both the techniques are popularly used. All it matters is that the analysis is carried on clean data.

Frequently Asked Questions (FAQs)

1. What are the two types of AI algorithms used to cluster documents?

Hierarchical clustering and non-hierarchical clustering are the two types of AI algorithms used to cluster texts. The hierarchical clustering algorithm divides and aggregates documents according to a set of rules. The hierarchy's pairs of clusters of data items are then connected together. While this technique is simple to read and comprehend, it may not be as effective as non-hierarchical clustering. When there are a lot of flaws in the data, clustering might be tough. Non-hierarchical clustering entails merging and breaking existing clusters to create new ones. This is a clustering approach that is comparatively quicker, more dependable, and more stable.

2. Is lemmatization preferred for sentiment analysis?

Lemmatization and stemming are both highly effective procedures. When converted into root-form, however, lemmatization always yields the dictionary meaning term. When the meaning of the term isn't critical to the study, then stemming is recommended. When the meaning of a word is vital for analysis, lemmatization is advised. As a result, if you had to pick one approach for sentiment analysis, lemmatization would be the one to go with.

3. How are stemming and lemmatization used for document clustering?

Document clustering, also known as text clustering, is a method of analyzing textual texts by grouping them together. Its applications range from automated document arrangement to topic extraction and even speedy information retrieval. Stemming and lemmatization are used to reduce the number of tokens required to communicate the same information, hence improving the overall technique. Following this preprocessing step, features are calculated by measuring the frequency of each token, followed by the most efficient clustering approaches.