Text Recognition and Extraction In Images

Dharmaraj

In this post, I will show you how to extract text from an image using OpenCV and OCR. This process is simply called “Text Recognition” or “Text Detection”. So basically, as you can understand from the title, we will build a simple python program that will extract text for us. After the extraction, the program will also export the result into a text document. This way, we can keep a record of our results. We are going to see two methods which are Tesseract and EasyOCR.

OpenCV (Open-source computer vision) is a library of programming functions mainly aimed at real-time computer vision. OpenCV in python helps to process an image and apply various functions like resizing images, pixel manipulations, object detection, etc. In this article, we will learn how to use contours to detect the text in an image and save it to a text file.

OCR is formerly known as Optical Character Recognition which is revolutionary for the digital world nowadays. OCR is a complete process under which the images/documents which are present in a digital world are processed and from the text are being processed out as normal editable text.

Tesseract is an open-source text recognition (OCR) Engine. It can be used directly, or (for programmers) using an API to extract printed text from images. It can be used with the existing layout analysis to recognize text within a large document, or it can be used in conjunction with an external text detector to recognize text from an image of a single text line.

You can install the python wrapper for tesseract after this using pip. After installed tesseract does not forget to edit the “path” environment variable and add the tesseract

$ pip install pytesseract

import cv2
import pytesseract
#Tesseract path
pytesseract.pytesseract.tesseract_cmd = r’C:\Program Files\Tesseract-OCR\tesseract.exe’
# read image
img = cv2.imread(‘demo.jpg’)
# configurations
config = (‘-l eng — oem 1 — psm 3’)
# pytessercat
text = pytesseract.image_to_string(img, config=config)
print(text)

Above program will show the results of the image and we used tesseract without preprocessing so accuracy will be very low. We must do preprocess then apply tesseract.

Preprocessing and Detection

To avoid all the ways your tesseract output accuracy can drop, you need to make sure the image is appropriately pre-processed. This includes rescaling, binarization, noise removal, etc.

import cv2
import pytesseract
pytesseract.pytesseract.tesseract_cmd = r’C:\Program Files\Tesseract-OCR\tesseract.exe’
#read image
img = cv2.imread(‘happy.jpg’)
# get grayscale image
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
#noise removal
noise=cv2.medianBlur(gray,3)
# thresholding# converting it to binary image by Thresholding
# this step is require if you have colored image because if you skip this part
# then tesseract won’t able to detect text correctly and this will give incorrect #result
thresh = cv2.threshold(noise, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
#Configuration
config = (‘-l eng — oem 3 — psm 3’)
# pytessercat
text = pytesseract.image_to_string(thresh,config=config)
print(text)

HAPPINESS IS
enjoying the little things
in life.

EasyOCR is built with Python and Pytorch deep learning library, having a GPU could speed up the whole process of detection. The detection part is using the CRAFT algorithm and the Recognition model is CNN. It is composed of 3 main components, feature extraction (we are currently using Resnet), sequence labeling (LSTM), and decoding (CTC). EasyOCR doesn’t have many software dependencies, it can directly be used with its API.

EasyOCR is a Pytorch library so before install EasyOCR should have to install Pytorch and then install EasyOCR using the following cmd.

$ pip install easyocr

import pandas as pd
import easyocr
img = cv2.imread(‘happy.jpg’)
gray = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
noise=cv2.medianBlur(gray,3)
thresh = cv2.threshold(noise, 0, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)[1]
reader = easyocr.Reader([‘en’])
result = reader.readtext(img,paragraph=”False”)
df=pd.DataFrame(result)
print(df[1])

HAPPINESS IS enjoying the little things in life.

Tesseract and EasyOCR both are perfect for scanning clean documents and comes with high accuracy. I would say that both are good go-to tools if your task is scanning books, pdf, and printed text on a clean white background. Particularly Tesseract works well for scanned print documents, whereas EasyOCR works well for extracting texts in general scenes / random pictures.