Tokenization in Natural Language Processing
Updated on Dec 30, 2024 | 7 min read | 5.9k views
Share:
For working professionals
For fresh graduates
More
Updated on Dec 30, 2024 | 7 min read | 5.9k views
Share:
Table of Contents
When dealing with textual data, the most basic step is to tokenize the text. ‘Tokens’ can be considered as individual words, sentences, or any minimum unit. Therefore, breaking the sentences into separate units is nothing but Tokenization.
By the end of this tutorial, you will have the knowledge of the following:
Tokenization is the most fundamental step in an NLP pipeline.
But why is that?
These words or tokens are later converted into numeric values so that the computer can understand and make sense out of it. These tokens are cleaned, pre-processed and then converted into numeric values by the methods of “Vectorization”. These vectors can then be fed to the Machine Learning algorithms and neural networks.
Enrol for the Machine Learning Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.
Tokenization can not only be word level, but also sentence level. That is, text can be either tokenized with words as tokens or sentences as tokens. Let’s discuss a few ways to perform tokenization.
The split() function of Python returns the list of tokens splitted by the character mentioned. By default, it splits the words by spaces.
Word Tokenization
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks.”
Tokens = Mystr.split()
#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods,’, ‘and’, ‘ways?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks.’]
Sentence Tokenization
The same text can be splitted into sentences by passing the separator as “.”.
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks.”
Tokens = Mystr.split(“.”)
#Output:
>> [‘This is a tokenization tutorial’, ‘ We are learning different tokenization methods, and ways? Tokenization is essential in NLP tasks’, ”]
Though this seems straightforward and simple, it has a lot of flaws. And if you notice, it splits after the last “.” as well. And it doesn’t consider the “?” as an indicator of next sentence because it only takes one character, which is “.”.
Text data in real life scenarios is very dirty and not nicely put in words and sentences. A lot of garbage text might be present which will make it very difficult for you to tokenize this way. Therefore, let’s move ahead to better and more optimized ways of tokenization.
Must Read: Top 10 Deep Learning Techniques You Should Know
Regular Expression (RegEx) is a sequence of characters that are used to match against a pattern of characters. We use RegEx to find certain patterns, words or characters to replace them or do any other operation on them. Python has the module re which is used for working with RegEx. Let’s see how we can tokenize the text using re.
Word Tokenization\
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
Tokens = re.findall(“[\w’]+”, Mystr)
#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks’]
So, what happened here?
The re.findall() function matches against all the sequences that match with it and stores them in a list. The expression “[\w]+” means that any character – be it alphabets or numbers or Underscore (“_”). The “+” symbol means all the occurrences of the pattern. So essentially it will scan all the characters and put them in the list as one token when it hits a whitespace or any other special character apart from an underscore.
Please notice that the word “NLP’s” is a single word but our regex expression broke it into “NLP” and “s” because of apostrophe.
Sentence Tokenization
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
Tokens = re.compile(‘[.!?] ‘).split(Mystr)
#Output:
>> [‘This is a tokenization tutorial’, ‘We are learning different tokenization methods, and ways’, ‘Tokenization is essential in NLP tasks.’]
Now, here we combined multiple splitting characters into one condition and called the re.split function. Therefore, when it hits any of these 3 characters, it will treat it as a separate sentence. This is an advantage of RegEx over the python split function where you cannot pass multiple characters to split at.
Also Read: Applications of Natural Language Processing
Natural Language Toolkit (NLTK) is a Python library specifically for handling NLP tasks. NLTK consists of functions and modules built-in which are made for some specific processes of the complete NLP pipeline. Let’s have a look at how NLTK handles tokenization.
Word Tokenization
NLTK has a separate module, NLTK.tokenize, to handle tokenization tasks. For word tokenization, one of the methods it consists of is word_tokenize.
from nltk.tokenize import word_tokenize
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
word_tokenize(Mystr)
#Output:
>>[‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘,’, ‘and’, ‘ways’, ‘?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘tasks’, ‘.’]
Please notice that word_tokenize considered the punctuations as separate tokens. To prevent this from happening, we need to remove all the punctuations and special characters before this step itself.
Sentence Tokenization
from nltk.tokenize import sent_tokenize
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
sent_tokenize(Mystr)
#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, ‘Tokenization is essential in NLP tasks.’]
SpaCy is probably one of the most advanced libraries for NLP tasks. It consists of support for almost 50 languages. Therefore the first step is to download the core for English language. Next, we need to import the English module which loads the tokenizer, tagger, parser, NER and word vectors.
Word Tokenization
from spacy.lang.en import English
nlp = English()
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
my_doc = nlp(Mystr)
Tokens = []
for token in my_doc:
Tokens.append(token.text)
Tokens
#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘.’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘,’, ‘and’, ‘ways’, ‘?’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, “‘s”, ‘tasks’, ‘.’]
Here, when we call the function nlp with MyStr passed, it creates the word tokens for it. Then we index through them and store them in a separate list.
Sentence Tokenization
from spacy.lang.en import English
nlp = English()
sent_tokenizer = nlp.create_pipe(‘sentencizer’)
nlp.add_pipe(sent_tokenizer)
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
my_doc = nlp(Mystr)
Sents = []
for sent in doc.sents:
Sents.append(sent.text)
Sents
#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, “Tokenization is essential in NLP’s tasks.”]
For sentence tokenization, call the creat_pipe method to create the sentencizer component which creates sentence tokens. We then add the pipeline to the nlp object. When we pass the text string to nlp object, it creates sentence tokens for it this time. Now they can be added to a list in the same way as in the previous example.
Keras is one of the most preferred deep learning frameworks currently. Keras also offers a dedicated class for text processing tasks – keras.preprocessing.text. This class has the text_to_word_sequence function which creates word level tokens from the text. Let’s have a quick look.
from keras.preprocessing.text import text_to_word_sequence
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
Tokens = text_to_word_sequence(Mystr)
Tokens
#Output:
>> [‘this’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘we’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘tokenization’, ‘is’, ‘essential’, ‘in’, “nlp’s”, ‘tasks’]
Please notice that it treated the word “NLP’s” as a single token. Plus, this keras tokenizer lowercased all the tokens which is an added bonus.
Gensim is another popular library for handling NLP based tasks and topic modelling. The class gensim.utils offers a method tokenize, which can be used for our tokenization tasks.
Word Tokenization
from gensim.utils import tokenize
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
list(tokenize(Mystr))
#Output:
>> [‘This’, ‘is’, ‘a’, ‘tokenization’, ‘tutorial’, ‘We’, ‘are’, ‘learning’, ‘different’, ‘tokenization’, ‘methods’, ‘and’, ‘ways’, ‘Tokenization’, ‘is’, ‘essential’, ‘in’, ‘NLP’, ‘s’, ‘tasks’]
Sentence Tokenization
For sentence tokenization, we use the split_sentences method from the gensim.summarization.textcleaner class.
from gensim.summarization.textcleaner import split_sentences
Mystr = “This is a tokenization tutorial. We are learning different tokenization methods, and ways? Tokenization is essential in NLP’s tasks.”
Tokens = split_sentences(Mystr)
Tokens
#Output:
>> [‘This is a tokenization tutorial.’, ‘We are learning different tokenization methods, and ways?’, “Tokenization is essential in NLP’s tasks.”]
In this tutorial we discussed various ways to tokenize your text data based on applications. This is an essential step of the NLP pipeline, but it is necessary to have the data cleaned before proceeding to tokenization.
If you’re interested to learn more about machine learning & AI, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.
Get Free Consultation
By submitting, I accept the T&C and
Privacy Policy
Top Resources