OCR and Text-to-Speech | Nehal Nagaraj Shet

In this project, we have discussed how to improve the efficiency of Tesseract OCR for Kannada. Kannada has roughly 44 million native speakers. Kannada is also spoken as a second and third language by over 12.9 million non-Kannada speakers in Karnataka, which adds up to 56 million speakers and writes Kannada script. Research in Optical Character Recognition (OCR) is popular for its application potential in banks, post offices, defense organizations and library automation etc. In this report, we have proposed a technique for improving the efficiency for Tesseract OCR.

* INTRODUCTION

Optical character recognition or optical character reader (OCR) is the electronic or mechanical conversion of images of typed, handwritten or printed text into machine-encoded text, whether from a scanned document, a photo of a document, a scene-photo (for example the text on signs and billboards in a landscape photo) or from subtitle text superimposed on an image (for example from a television broadcast). Widely used as a form of data entry from printed paper data records – whether passport documents, invoices, bank statements, computerized receipts, business cards, mail, printouts of static-data, or any suitable documentation – it is a common method of digitizing printed texts so that they can be electronically edited, searched, stored more compactly, displayed on-line, and used in machine processes such as cognitive computing, machine translation, (extracted) text-to-speech, key data and text mining. OCR is a field of research in pattern recognition, artificial intelligence and computer vision.

Why Kannada OCR? With the main aim of serving our society and creating and implementing something new that would be of some help to people, we chose to work for the literature domain of the society. One of the most common problems in the field of literature in India is transforming printed text into digitized format. We wanted to digitize Kannada script literary scripts as was the objective of Kannada Ganaka Parishat. With the ongoing Digital India Campaign, we realized this could contribute to the campaign. Hence, we 4decided to work on Optical Character Recognition (OCR). OCR can make your life easier by: • Making paper-based information searchable in seconds, rather than hours. • Reduce or eliminate costly data entry by automatically grabbing information you need from paper and putting it where it needs to go. • Enabling entirely new ways to process documents that can eliminate “human touches”, thereby reducing costs and dramatically reducing processing times.

As early as the 1960’s, engineers were trying to develop a way for machines to recognize text. Unlike humans with visual capability, computers don’t have eyes, nor can they differentiate between font types to be able to form a character or word. There was an early capability to read a singular font type called OCR-A, but there was no way to assure that everyone used this one typeface, so its usefulness was limited. After a spell back at the proverbial drawing board, a new way of performing OCR was developed. This new method relied on pattern recognition,whereby the computer didn’t have to recognize the whole letter “R” to know it was an “R.” Instead the computer would look for common points, patterns and combinations of lines and shapes to determine which letter it was reading.

Utilizing points, patterns and lines helped speed the OCR process and enabled the flexibility to read hundreds of fonts. This technology has expanded to handwriting as well, though it works best on structured forms rather than free form handwriting. Over the last 15 years, OCR technology has gotten faster and more accurate. Modern OCR software will have multiple language packs and the ability to ready scientific symbols and other less common font types.

Images are taken from the internet.

Codebase: https://github.com/hoplite2k/ocr