Since the past couple of months, me and my colleague have been working on a research project.
The goal is simple – detect characters from a real-world image. However, the intermediate steps involved don’t make the task as straightforward as you might think!
Before discussing the technicalities of the project, it’s important to know what OCR is.
OCR – the heart of text detection
Short for Optical Character Recognition, it is used to identify glyphs – be it handwritten or printed. This way, all glyphs are detected and are separately assigned a character by the computer.
While OCR has gained traction in recent times, is not a new concept. In fact, it is this very technology that bank employees use to read cheques and bank statements.
For this project we chose Tesseract as our OCR engine. It has been developed by Google, and is what is used in their Google Keep app to convert images to text.
The project’s nitty-gritties
We have limited our scope to printed text – specifically, street signs – and are attempting to convert the captured images to .txt files. This is how our code is intended to work:
If it works, then it would be possible to scale down the file size – a very handy tool for storing names of places in smart-phones, which always come equipped with a camera these days. Ideally, such a task would be easy to accomplish, with perfect lighting, no perspective distortions or warping, and no background noise.
Reality, unsurprisingly, is quite the opposite. Hence, we are trying to process the images before feeding them to Tesseract, which is known to work best with binary (black and white) images.
According to our plan, we shall implement a three-step method:
- remove perspective distortion from the image
- binarize the image
- pass the image through Tesseract
Training the Tesseract engine
Before processing the images, the OCR engine needs to be ‘trained’ in order to work properly. For this reason, I downloaded jTessBoxEditor – a Java program for editing boxfiles (files generated by Tesseract when detecting glyphs). Since the project uses Ubuntu’s OS, I had to download and install Java Runtime Environment (JRE) to run jTessBoxEditor.
Since my portion of the project involves training the engine, I need to generate sample data for it. The engine needs to be fed samples of Times New Roman, Calibri, and Arial – the three fonts we came across in our images.
Our progress so far
Tesseract is still being trained, and the sample data is yet to be generated. After a while, realizing that these fonts would be available in my Windows installation, I copied the font files to Ubuntu, and successfully installed the fonts. One step down, several more to go!
On the image processing side, we are currently evaluating a Python implementation of ‘font and background colour independent text binarization’, a technique pioneered by T Kasar, J Kumar and A G Ramakrishnan.
I modified the code to work with python3, in order to avoid discrepancies between the various modules of our project. Here is the link:
A web forum also suggested that the input images be enlarged or shrunk, in order to make the text legible. This task requires ImageMagick, a software that uses a CLI (command line interface) for image manipulation. Therefore, I downloaded a bunch of grayscale text images (with the desired font, of course), and decided to convert all of them to PNG.
For some reason, I’m not able to do so, and have failed to convert any of them.
As an example, here is a sample command:
magick convert gray25.gif gray25.png
This is the error message I get in Terminal:
No command 'magick' found, did you mean: Command 'magic' from package 'magic' (universe) magick: command not found
I’ve tried re-installing ImageMagick several times, but to no avail. I need to go through yet more web forums for a solution to this problem.
What’s the scope?
This is a question almost everyone asks whenever I discuss my project. Indeed, it doesn’t look very promising at first sight, due to the tedious nature of the steps involved.
However, its scope is quite vast – ranging from preservation of ancient texts and languages to transliteration and transliteration of public signage, and converting street signs to audio for the visually impaired. In fact, it may be used as a last resort for driverless vehicles to navigate an area when GPS fails.
We are only limited by our imaginations. Once merged with technology, they can be used to achieve miracles!
1.Font and Background Color Independent Text Binarization; a research paper:
2.Perspective rectification of document images using fuzzy set and morphological operations; a research paper:
3.jTessBoxEditor; a how-to guide:
4.AptGet/HowTo; a how-to guide: