Stemming & Lemmatization in Python: Which One To Use?
Updated on Jun 30, 2023 | 9 min read | 6.9k views
Share:
For working professionals
For fresh graduates
More
Updated on Jun 30, 2023 | 9 min read | 6.9k views
Share:
Table of Contents
Natural Language Processing (NLP) is a communication processing technique that involves extracting important features from the language. It is an advancement in Artificial intelligence that involves building intelligent agents with previous experience. The previous experience here refers to the training that is performed over humongous datasets involving textual data from sources including social media, web scraping, survey forms, and many other data collection techniques.
The initial step after data gathering is the cleaning of this data and conversion into the machine-readable form, the numerical form that the machine can interpret. While the conversion process is a whole another thing, the cleaning process is the first step to be performed. In this cleaning task, inflection is an important concept that needs a clear understanding before moving on to stemming and lemmatization.
We know textual data comprises sentences with words and other characters that may or may not impact our predictions. The sentences comprise words and the words which are commonly used such as is, there, and, are called stop words. These can be removed easily by forming a corpus for them, but what about different forms of the same word?
Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
You don’t want your machine to consider ‘study’ and ‘studying’ as different words as the intent behind these words remains the same and both convey the same meaning. Handling this type of case is a common practice in NLP, and this is known as inflection. This is the base idea of stemming and lemmatization with different approaches. Let’s discover the differences between them and have a look at which one is better to use.
It is one of the text normalization techniques that focuses on reducing the ambiguity of words. The stemming focuses on stripping the word round to the stem word. It does so by removing the prefixes or suffixes, depending upon the word under consideration. This technique reduces the words according to the defined set of rules.
The resulted words may or may not have any actual meaningful root words. Its main purpose is to form groups of similar words together so that further preprocessing can be optimized. For example, words like play, playing, and played all belong to the stem word “play”. This also helps in reducing the search time in search engines, as now more focus is given on the key element.
Two cases need to be discussed regarding stemming, i.e., over steaming and under stemming. While removing the prefixes and suffixes from the word solves some cases, some words are stripped more than the requirements.
This can lead to more trash words with no meanings. Though this is the disadvantage of stemming as a whole, and if it happens more drastically, it is known as over stemming. Under stemming is the reverse where the stemming process results in very little or difference in words.
Here is a program that helps you better understand NLTK stemming:
from nltk.stem import PorterStemmer
e_words= ["wait", "waiting", "waited", "waits"]
ps =PorterStemmer()
for w in e_words:
rootWord=ps.stem(w)
print(rootWord)
Evaluating the pros and cons of stemming and lemmatization in Python can help you better compare the two and conclude which one is the best. So, let’s start with the pros of stemming:
Another approach for normalizing the text and converting them to root meanings is Lemmatization. This has the same motive of grouping similar intent words into one group, but the difference is that here the resultant words are meaningful.
They are not stripped off with pre-defined rules but are formed using a dictionary or we call it Lemma. Here the process of conversion takes more time because first, the words are matched with their parts of speech, which itself is time taking process.
This ensures that the root word has a literal meaning that helps in deriving good results in analysis. This is useful when we don’t want to spend much time on data cleaning, and cleaner data is required for further analysis. One drawback of this technique is that as it focuses more on the grammar of the words, different languages would require separate corpora leading to more and more data handling.
Checkout: Deep Learning Project Ideas for Beginners
Lemmatization in Python reduces ambiguity in writing. The root word bicycle is formed from examples like bicycle or bicycles. In essence, it will change all words with the same meaning but distinct interpretations to their original forms.
It lessens the number of words in the provided text and aids in creating precise features for the machine-learning training system. Your machine-learning system will be smarter and more precise the clean the data is.
Stemming Code:
import nltk
from nltk.stem.porter import PorterStemmer
porter_stemmer = PorterStemmer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Stemming for {} is {}".format(w,porter_stemmer.stem(w)))
Output:
Stemming for studies is studi
Stemming for studying is studi
Stemming for cries is cri
Stemming for cry is cri
Lemmatization Code:
import nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()
text = "studies studying cries cry"
tokenization = nltk.word_tokenize(text)
for w in tokenization:
print("Lemma for {} is {}".format(w, wordnet_lemmatizer.lemmatize(w)))
Output:
Lemma for studies is study
Lemma for studying is studying
Lemma for cries is cry
Lemma for cry is cry
When stemming from studies and studying, the output is the same (studi). However, the NLTK lemmatizer offers a separate lemma for the terms study for studies and studying for studying. Lemmatization would therefore be excellent if it were the dominant method when creating feature sets to teach machines.
Now comes the point of picking the one between the two of them. It is highly subjective to choose anyone as the use case you are targeting plays a major role here.
If you want to analyze a chunk of text but time is a constraint, then you can opt for stemming as it performs this action in less time but with a low success rate, and the stems are provided via an algorithmic way that may not have any meaning.
Adopting Lemmatization gives an added advantage of getting meaningful and accurate root words clubbed from different forms. If you can afford good computing resources with more time, then this is can be a better choice. This should be adopted where we want precise analysis. It can also be the case of some searching techniques on the search engines where the root word is enough to fetch the results user wants.
The NLTK (Natural Language Tool Kit) package is the Python implementation of the tasks around the NLP. This library has all the required tools such as Stemmers. Lemmatizers, stop words removal, creating custom parser trees, and much more. It also contains the corpus data from prominent sources included in the package itself.
The stemming technique has many implementations, but the most popular and oldest one is the Porter Stemmer algorithm. Snowball stemmer is also used in some projects. For understanding the difference between stemming and lemmatization more clearly, look at the code below and the output of the same:
import nltk
from nltk.stem import PorterStemmer
from nltk.stem import WordNetLemmatizer
word_stemmer = PorterStemmer()
lemmatizer = WordNetLemmatizer()
print(lemmatizer.lemmatize('flies'))
print(word_stemmer.stem('flies'))
Output:
fly
fli
The first output is from the lemmatizer and the second from the stemmer. You can see the difference that the lemmatizer gave the root word as the output while the stemmer just trimmed the word from the end.
Learn data science courses from the World’s top Universities. Earn Executive PG Programs, Advanced Certificate Programs, or Masters Programs to fast-track your career.
Also Read: Machine Learning Project Ideas
NLP is growing every day and new methods evolve with time. Most of them focus on how to efficiently extract the right information from the text data with minimum loss and eliminating all the noises. Both the techniques are popularly used. All it matters is that the analysis is carried on clean data.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources