Home
Blog
Artificial Intelligence
Introduction to Optical Character Recognition [OCR] For Beginners

Introduction to Optical Character Recognition [OCR] For Beginners

Updated on Nov 24, 2022 | 6 min read | 6.27K+ views

Table of Contents

View all

Tools Used
Steps Involved in the Text Extraction Process
Recognition and Extraction of Text
Getting Relevant Information
Accuracy
What Next?

OCR or optical character recognition(OCR) is used to extract information from images of bills and receipts, or anything that has written content on it. To develop this solution, OpenCV can be used to process the images which can be further fed into a Tesseract OCR engine that can extract the text from those images.

Top Machine Learning and AI Courses Online

Master of Science in Machine Learning & AI from LJMU		Executive Post Graduate Programme in Machine Learning & AI from IIITB
Advanced Certificate Programme in Machine Learning & NLP from IIITB	Advanced Certificate Programme in Machine Learning & Deep Learning from IIITB	Executive Post Graduate Program in Data Science & Machine Learning from University of Maryland
Machine Learning Certification

However, the text removal process can be efficient only if the image is clear and the texts are visible enough. In retail applications, for extracting texts from invoices, the invoice may be inundated with watermarks, or there can be a shadow on the bill that hinders the information to be captured.

Capturing pieces of information from longer pages of texts can also be an arduous task. To tackle these problems, it is prudent that in the information extraction pipeline, there is a place from the image processing module that deals with the aforementioned difficulties.

It comprises several sub-processes, i.e, localization of texts, character segmentation, and recognition of those characters. Although few systems manage without segmentation. Such methods are produced utilizing several procedures, such as applying the least square method to reduce the error rate and support vector machines to match the characters.

IIIT Bangalore

Executive Diploma in Machine Learning and AI

Placement Assistance

Executive PG Program12 Months

Liverpool John Moores University

Master of Science in Machine Learning & AI

Dual Credentials

Master's Degree18 Months

Trending Machine Learning Skills

AI Courses	Tableau Certification
Natural Language Processing	Deep Learning AI

Still, often to identify the occupancy of a character in an image, Convolutional Neural Networks (CNN) are employed. Texts can be viewed as a consistent sequence of characters. Detecting and identifying these characters with greater accuracy is a difficulty that can be resolved by using a special type of neural network, namely, recurrent neural networks (RNNs) and long short term memory (LSTM).

Words are collected by adjusting texts into blobs. These lines and regions are moreover examined for equivalent text. Text lines are divided into words only according to the sort of spacing among them. The method of identification is split into two steps. Firstly, each word is identified. Every perfect or correctly identified word is additionally passed to an adaptive classifier as training data.

The image that is received as input is examined and processed in parts. The text is fed into the LSTM model line by line. Tesseract, which is an optical character recognition engine, is available for various operating systems. It uses a combination of CNN and LSTM architecture to identify and derive texts from image data precisely. However, images with noise or shadows hamper the retrieval accuracy.

To minimize the noise, or improve the image quality, Preprocessing of the image can be performed using the OpenCV library. Such pre-processing steps can comprise discovering the ROI or the region of interest, cropping of the image, removal of noise(or unwanted regions), thresholding, dilation and erosion, detection of contours or edges. After those steps are completed the OCR engines can read the image and extricate relevant texts from it perfectly.

Tools Used

1. OpenCV

OpenCV is a library originally compatible with languages C/C++ and python. It is used commonly for processing data with image samples. A plethora of predefined useful functions are present in the library that implements necessary transformations on the image samples. All the aforementioned functions like dilation, erosion, slicing, edge detection, and many more can easily be done using this library.

2. Tesseract OCR Engine

Released by Google, it is an open-source library that is widely used for text recognition. It can be used to detect and identify texts in various languages. The processing is quite fast and gives the textual output of an image almost immediately. Many scanning applications leverage this library and rely on its extraction techniques.

Steps Involved in the Text Extraction Process

(1) Firstly, Possible image processing techniques like contour detection, noise removal, and erosion and dilation functions are applied to the incoming noisy image sample.

(2) After this step, removal of watermarks and shadows from the bill is done.

(3) Furthermore, the bill is segmented into parts.

(4) The segmented parts are passed through the Tesseract OCR engine to get the complete text.

(5) Finally using Regex, we get all the vital information like the total amount, date of purchase, and expenses per item.

let me talk about a specific image with texts – invoices and bills. They usually have watermarks on them, most of the company that is issuing the bills. As mentioned earlier, these watermarks are impediments in the way of efficient text extraction. Oftentimes, these watermarks themselves contain the text.

These can be regarded as noise as the Tesseract engine recognizes texts of every size in a line. Like watermarks, shadows also inhibit the engine’s accuracy to extract texts efficiently. Shadows are removed by enhancing the contrast and brightness of the image.

For images that have stickers or watermarks, a multi-step process is carried out. The process involves converting an image into grayscale, applying morphological transformations, applying thresholding (it can be a binary inversion or an otsu transformation), extracting darker pixels in the darker region, and lastly, pasting the darker pixels in the watermark region. Coming back to the process of shadow removal.

Firstly, dilation is applied to the grayscale image. Above this, a medium blue with an appropriate kernel suppresses the text. The output of this step is an image that contains portions of shadows and any other discolorations present. Now a simple difference operation is computed between the original image and the obtained image. Finally, after applying thresholding what we get is an image with no shadows.

Recognition and Extraction of Text

A Convolutional Neural Network model can be built and trained on the imprinted text found in images. The model can further be used for detecting text from other similar images with the same font. A Tesseract OCR engine is used to recover text from the images that have been processed using the computer vision algorithms.

For Optical Character Recognition, we have to perform text localization, followed by character segmentation, and then, recognition of characters. All of these steps are carried out by the Tesseract OCR. Tesseract OCR engine proves to be highly accurate when used on printed text rather than handwritten text.

Getting Relevant Information

Talkin about invoices specifically, out of all the text extracted, vital information like the date of purchase, Total amount, etc. can be readily obtained using multiple regular expressions. The total amount that is imprinted on the invoice can be extracted by applying regular expressions owing to the fact that it usually appears at the end of the invoice. Many such useful pieces of information can be stored according to their dates so that they are easily accessible.

Accuracy

Accuracy for text retrieval can be defined as the ratio of the correct number of information that is obtained by the Tesseract OCR and that are already in the invoice to the cumulative number of words actually present in the textual image. Higher accuracy signifies higher efficiency of pre-processing techniques and the ability of the Tesseract OCR to extract information.

Popular AI and ML Blogs & Free Courses

IoT: History, Present & Future	Machine Learning Tutorial: Learn ML	What is Algorithm? Simple & Easy
Robotics Engineer Salary in India : All Roles	A Day in the Life of a Machine Learning Engineer: What do they do?	What is Information Technology?
Permutation vs Combination: Difference between Permutation and Combination	Learning Artificial Intelligence & Machine Learning - How to Start	Machine Learning with R: Everything You Need to Know
NLP Free Course	Fundamentals of Deep Learning of Neural Networks	Linear Regression: Step by Step Guide
Artificial Intelligence in the Real World	Introduction to Tableau	Case Study using Python, SQL and Tableau

What Next?

If you’re interested to learn more about machine learning, check out IIIT-B & upGrad’s PG Diploma in Machine Learning & AI which is designed for working professionals and offers 450+ hours of rigorous training, 30+ case studies & assignments, IIIT-B Alumni status, 5+ practical hands-on capstone projects & job assistance with top firms.

Learn ML Course from the World’s top Universities. Earn Masters, Executive PGP, or Advanced Certificate Programs to fast-track your career.

Pavan Vadapalli

900 articles published

Director of Engineering @ upGrad. Motivated to leverage technology to solve problems. Seasoned leader for startups and fast moving orgs. Working on solving problems of scale and long term technology s...

Get Free Consultation

By submitting, I accept the T&C and
Privacy Policy

India’s #1 Tech University

Executive Program in Generative AI for Leaders

76%

seats filled

View Program

Top Resources