Image Processing – The Visualizer

This has been quite a busy month, especially on the programming front. On the good side of things, I have made some headway in the image-to-text project I previously mentioned. On the other hand, my teammate and I really need to hasten our work, and complete our project within the next two months. There is much to do, in so little time!

This project has also opened my mind about Python codes, and has given me compelling reasons to continue learning more about it. Being a staunch C++ programmer since my high school days, it took me quite some time to realize its immense value in today’s computer systems, and gain confidence in it .

Python – an introduction

Python was released by Guido van Rossum in 1991. An ardent fan of Monty Python’s Flying Circus, Van Rossum initiated it as a hobby project in December 1989, as an interpreter for ABC, another programming language at the time. Python eventually became an interpreted language in its own right, and now has a large developer base, spread all over the planet.

One reason for its global appeal is its strong emphasis on keeping lines of code as readable and clutter-free as possible. This is implemented by PEPs (Python Enhancement Proposals) – design documents that provide the rationale behind a feature, and its technical specifications.

The strict adherence to readability in Python is obvious from the fact that it lacks any curly braces or semicolons for indentation, and completely relies on the use of tabs and spaces for the same. This, unfortunately, acts as a double-edged sword, since Python is unforgiving when programmers add too many or too little spaces, or use a mix of spaces and tabs in their code. More often than not, Python code developed by rookie programmers will not run, due to incorrect indentation. Hence, it might help to go through the PEP 8 guide, and correct all indentation errors in the code.

Nevertheless, Python has become immensely popular over the last few decades, and continues to go mainstream, so much that de-facto libraries like OpenGL, OpenCL, Unity and TensorFlow have all been developed in C++, and are available for integration with Python, using wrappers.

What are wrappers?

Wrappers are interfaces that allow a program to build on to an existing piece of code or program, without disturbing it. This allows the programmer to extend the capabilities of a program, or a portion of it, while hiding a few features of the original program (abstraction). Hence, the portability of the program code is increased.

Through wrappers, Python lifts a big burden off the programmer’s shoulders, since he or she no longer has to translate the entire code into another language, and simply has to decide what features to include, and which ones to exclude, while designing the new interface.

Python wheels – libraries installed using pip

Another reason why I am impressed by Python modules is their ease of installation. With just a single ‘pip install’ command in Terminal, a user can download and install any Python library he or she needs, and deploy it immediately.

This has been made possible through wheels – a packaging format created by the Python community. These are maintained on PyPI (Python Packaging Index), Python’s official repository for third-party software. With its origins dating back to September 2000, it continues to grow and improve to this day. It currently houses over 100,000 packages, enough for a programmer to be spoilt for choice.

On the other hand, a C/C++ user, more often than not, is expected to manually download a zipped file from the Internet, decompress it, and build it from source – an unpleasant and tiresome experience in my past programming ventures. Undoubtedly, Python programmers are at an advantage over here!

Tuples

One data type that piqued my interest was the ‘tuple’, a concept borrowed from mathematics. It is a finite sequence of elements. It is extremely valuable in creating CSV files, and computational mathematics, especially matrix operations.

At this point, many of you might be wondering, “Even if all its elements are numbers, a tuple is just a row matrix. How is computational mathematics going to benefit from tuples?”

Enter the matrix

To counter this thought, I shall now share an example, from my project. Remember the colour images I wished to convert to grayscale? Well, that requires each pixel of the image to be read as a matrix, and what better way to do that than by tuples!

Here is some of the pixel information of this image, in matrix notation:

[[[ 71 65 53]

[ 73 67 55]

[ 76 70 58]

…

[168 178 143]

[166 176 139]

[164 174 137]]

[[ 58 48 38]

[ 57 47 37]

[ 55 45 35]

…

[221 215 199]

[221 215 199]]]

Though it is only part of the whole image, it is clear that each pixel is represented as a tuple of three elements. These elements are the red(R), green(G) and blue(B) values corresponding to their pixels. Python needs to read the image as a matrix of three-element tuples, before any kind of image manipulation can be done.

Taking this matrix of tuples, the image gets converted to grayscale using a simple formula, that computes the average of the R,G and B values of each pixel.

This causes the program to generate this output:

By the way, here is the complete Python program, if you wish to have a look:

https://github.com/the-visualizer/image-to-text/blob/master/img2grey.py

Note: Although the saved grayscale has the same resolution as the colour input, the pyplot function is generating this output on my machine, for some unknown reason.

I suspect it is caused by a missing module from my machine, since it works fine on my friend’s laptop. Hopefully, I shall fix this bug soon!

External Links

PEP 8 – the Style Guide for Python code:

https://pep8.org/

Function wrapper and python decorator; a blog post:

https://amaral.northwestern.edu/blog/function-wrapper-and-python-decorator

Tuples; a chapter in How to Think Like a Computer Scientist: Learning with Python 3:

http://openbookproject.net/thinkcs/python/english3e/tuples.html

RGB to Grayscale Conversion; a tutorial:

https://samarthbhargav.wordpress.com/2014/05/05/image-processing-with-python-rgb-to-grayscale-conversion/

Since the past couple of months, me and my colleague have been working on a research project.

The goal is simple – detect characters from a real-world image. However, the intermediate steps involved don’t make the task as straightforward as you might think!

Before discussing the technicalities of the project, it’s important to know what OCR is.

OCR – the heart of text detection

Short for Optical Character Recognition, it is used to identify glyphs – be it handwritten or printed. This way, all glyphs are detected and are separately assigned a character by the computer.

While OCR has gained traction in recent times, is not a new concept. In fact, it is this very technology that bank employees use to read cheques and bank statements.

For this project we chose Tesseract as our OCR engine. It has been developed by Google, and is what is used in their Google Keep app to convert images to text.

The project’s nitty-gritties

We have limited our scope to printed text – specifically, street signs – and are attempting to convert the captured images to .txt files. This is how our code is intended to work:

If it works, then it would be possible to scale down the file size – a very handy tool for storing names of places in smart-phones, which always come equipped with a camera these days. Ideally, such a task would be easy to accomplish, with perfect lighting, no perspective distortions or warping, and no background noise.

Reality, unsurprisingly, is quite the opposite. Hence, we are trying to process the images before feeding them to Tesseract, which is known to work best with binary (black and white) images.

According to our plan, we shall implement a three-step method:

remove perspective distortion from the image
binarize the image
pass the image through Tesseract

Training the Tesseract engine

Before processing the images, the OCR engine needs to be ‘trained’ in order to work properly. For this reason, I downloaded jTessBoxEditor – a Java program for editing boxfiles (files generated by Tesseract when detecting glyphs). Since the project uses Ubuntu’s OS, I had to download and install Java Runtime Environment (JRE) to run jTessBoxEditor.

Since my portion of the project involves training the engine, I need to generate sample data for it. The engine needs to be fed samples of Times New Roman, Calibri, and Arial – the three fonts we came across in our images.

Our progress so far

Tesseract is still being trained, and the sample data is yet to be generated. After a while, realizing that these fonts would be available in my Windows installation, I copied the font files to Ubuntu, and successfully installed the fonts. One step down, several more to go!

On the image processing side, we are currently evaluating a Python implementation of ‘font and background colour independent text binarization’, a technique pioneered by T Kasar, J Kumar and A G Ramakrishnan.

I modified the code to work with python3, in order to avoid discrepancies between the various modules of our project. Here is the link:

https://github.com/the-visualizer/image-to-text

A web forum also suggested that the input images be enlarged or shrunk, in order to make the text legible. This task requires ImageMagick, a software that uses a CLI (command line interface) for image manipulation. Therefore, I downloaded a bunch of grayscale text images (with the desired font, of course), and decided to convert all of them to PNG.

For some reason, I’m not able to do so, and have failed to convert any of them.

As an example, here is a sample command:

magick convert gray25.gif gray25.png

This is the error message I get in Terminal:

No command 'magick' found, did you mean:

 Command 'magic' from package 'magic' (universe)

magick: command not found

I’ve tried re-installing ImageMagick several times, but to no avail. I need to go through yet more web forums for a solution to this problem.

What’s the scope?

This is a question almost everyone asks whenever I discuss my project. Indeed, it doesn’t look very promising at first sight, due to the tedious nature of the steps involved.

However, its scope is quite vast – ranging from preservation of ancient texts and languages to transliteration and transliteration of public signage, and converting street signs to audio for the visually impaired. In fact, it may be used as a last resort for driverless vehicles to navigate an area when GPS fails.

We are only limited by our imaginations. Once merged with technology, they can be used to achieve miracles!

External Links

1.Font and Background Color Independent Text Binarization; a research paper:

https://www.researchgate.net/profile/Thotreingam_Kasar/publication/228680780_Font_and_background_color_independent_text_binarization/links/0fcfd5081158485343000000.pdf

2.Perspective rectification of document images using fuzzy set and morphological operations; a research paper:

http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.105.3716&rep=rep1&type=pdf

3.jTessBoxEditor; a how-to guide:

http://vietocr.sourceforge.net/training.html

4.AptGet/HowTo; a how-to guide:

https://help.ubuntu.com/community/AptGet/Howto