MLDOCR (formerly Thai OCR)

Multi-Language Dictionary OCR (MLDOCR) is a program takes as input a bitmap image, which is a page scanned from a book or magazine. It then extracts each character and creates a text file, suitable for editing.

The program works with a scanned image; it does not scan the document directly. The image must be scanned at 300 DPI, 256 shades of grey, not full colour. It doesn't matter which scanner is used to create the image. The document can contain Thai, Khmer, Lao, Burmese, English, and any of a dozen European languages.

The screen captures shown here are from a debugging version of the program, they are NOT the final user interface.

This image was captured after page boundary analysis (blue lines), and during block analysis (purple lines). The large green area shows the column of text already analysed. The pink bits in the second column show the program progressing line by line looking for the end of that block of text.

The screen captures shown here
show a bilingual (English and French)
document being analysed.

click image for full size

Note that the interface shown in these images
is the debugging interface, it is NOT the
final user interface. The program runs at about 1%
of its normal speed when the debugging is turned on.
This image shows the result of the block analysis. Each block is outlined in purple.

One block of text.

One line of text.

The program isolates individual characters.

Result of character analysis.

If it can't recognize the character, it displays a message, with a menu that lets you select a language or edit the pixels.

It presents a list of all possible characters for the selected language, and asks you which it is, or if it should be ignored. In this way, it expands its internal list of character features, so as time goes on, it should get better and better at recognising characters.

The program will work with Thai, Burmese, Lao, and Cambodian (Khmer) documents, as well as English and European language documents, as shown here. The primary output is a text file that can be edited with a word processor. (The plan is to have a secondary output identifying unrecognised characters and where they were found; this has not been programmed yet.)

At the moment, the program does not maintain the document formatting; I imagine that as the program gets better and more sophisticated, this will be added.

There will also be a secondary analysis, involving the lookup of words in the output file, validating each word with the MLD project dictionaries.

I would also like to add an optional third stage, which would provide a translation to another language; for example, a Thai document could be scanned and analysed to produce an editable Thai text file, as well as an English translation. The quality of the translation would improve as the grammar analysis algorithms improve, but would likely never approach the quality of a human translator. However, it would be good enough for use as a "rough translation" to get a basic understanding of a document in the other language.

Note that this program is not yet available on this Web site, as it is not finished yet. To be notified when it is available, send me an e-mail.

If you got here via a search engine, click here to go to the start
Contact me with questions or comments about this page.
Copyright © 1998-2006 Doug Anderson
Last modified: 21 September 2006