MLDOCR (formerly Thai OCR)
|
Multi-Language Dictionary OCR (MLDOCR) is a program takes
as input a bitmap image, which is a page scanned from a book
or magazine. It then extracts each character and creates a text
file, suitable for editing.
The program works with a scanned image; it does not scan the
document directly. The image must be scanned at 300 DPI, 256
shades of grey, not full colour. It doesn't matter which scanner
is used to create the image. The document can contain Thai, Khmer,
Lao, Burmese, English, and any of a dozen European languages.
The screen captures shown here are from a debugging version
of the program, they are NOT the final user interface. |
|
This image was captured after page boundary analysis (blue lines),
and during block analysis (purple lines). The large green area
shows the column of text already analysed. The pink bits in the
second column show the program progressing line by line looking
for the end of that block of text. |
The screen captures shown here
show a bilingual (English and French)
document being analysed.
click image for full size
Note that the interface shown in
these images
is the debugging interface, it is NOT the
final user interface. The program runs at about 1%
of its normal speed when the debugging is turned on. |
|
This image shows the result of the block analysis. Each block
is outlined in purple. |
 |
|
One block of text. |
 |
|
One line of text. |
 |
|
The program isolates individual characters. |
 |
|
Result of character analysis. |
 |
|
If it can't recognize the character, it displays a message, with
a menu that lets you select a language or edit the pixels. |
 |
|
It presents a list of all possible characters for the selected
language, and asks you which it is, or if it should be ignored.
In this way, it expands its internal list of character features,
so as time goes on, it should get better and better at recognising
characters. |
 |
The program will work with Thai, Burmese, Lao, and Cambodian
(Khmer) documents, as well as English and European language documents,
as shown here. The primary output is a text file that can be edited
with a word processor. (The plan is to have a secondary output
identifying unrecognised characters and where they were found;
this has not been programmed yet.)
At the moment, the program does not maintain the document formatting;
I imagine that as the program gets better and more sophisticated,
this will be added.
There will also be a secondary analysis, involving the
lookup of words in the output file, validating each word with
the MLD project dictionaries.
I would also like to add an optional third stage, which
would provide a translation to another language; for example,
a Thai document could be scanned and analysed to produce an editable
Thai text file, as well as an English translation. The quality
of the translation would improve as the grammar analysis algorithms
improve, but would likely never approach the quality of a human
translator. However, it would be good enough for use as a "rough
translation" to get a basic understanding of a document in
the other language.
Note that this program is not yet
available on this Web site, as it is not finished yet.
To be notified when it is available, send me an e-mail.
If you got here via a search engine, click here to go to the start
Contact me with questions or comments about this page.
Copyright © 1998-2006 Doug Anderson
Last modified: 21 September 2006