OCR/OTR software needed

Largo · Apr 25, 2004

Im looking for some optical character/text recognition software. No prior experience in this field.

Strict requirements:
* take 1 or more widely used *BITMAP* formats.
* pixel color format: 1 or more of 8bit palettized, 888rgb and 8888cmyk/rgb-none format.
* virtually accurate for large types, and also very high reliablity for small type. Character omission must be extremely low or nonexistant, if necessary by increasing false positive rate.
* fairly rapid.
* fully automated operation. (no user interaction required)
* cheap or free. Talking about academic interest use on currently no budget.
* memory restriction: max ~200MByte available for the application.
* output to raw text, 8bit ASCII cp437/cp865. Templates/macroscripting for output formatting a big plus.
* win32api app on x86 (IA-32). Processor requirement may be i80586 or i80686 instruction set compability (max pentium-III set). Pentium 4 or later specifics are not tolerated.

Human assisted mode is a plus but full automation is most important. Builtin bulk processing features welcome, but not required (will be scripted).
Targetted input: computergenerated text and scans/photos of printed text - I.e. almost anything u might see displayed on a computerscreen, short of handwriting, No handwriting recognition capability necessary. Some custom pre-recognition filtering can be arranged if that helps satisfy requirements.
As long as a single popular bitmap format is accepted thats fine. (Bulk) conversion can be done as needed.
Target text is largely monochromatic or at least can be preprocessed into low-variance color on bulk scale, but inputs may have potentially any background.
Key text orientations are *dead* horizontal left-to-right and *dead* vertical up and down. Expected deviation from this is very low. Some degree of compensation for scan misaligment is needed, but general text orientation can be specified.

Ice · Apr 25, 2004

http://www.research.att.com/projects/tts/demo.html

oh wait... that's text-to-speech. maybe someone needs it :confused:

Largo · Apr 25, 2004

ive done some searching but seems OCR is not available cheaply - unless u count those limited programs shipped with some flatbed scanners.
One software package i looked at which would fit the requirements starts at $1500 for a entry edition, going up to $5000 for the full featured. Fooking hell, like buying a nice new fullfeatured pc just for the basic version

Martz · Apr 25, 2004

I had a look on sourceforge.. there is relatively active project called DocMgr here: http://docmgr.sourceforge.net/

The best I could find. :mushroom:

Largo · Apr 26, 2004

thanks. i have looked it over. it doesnt do OCR by itself but via http://jocr.sourceforge.net/ (GnuOCR).
The documentation on that url is extremely sparse, so having to test it. The source package yielded little documentation on the proper usage of the switches.
Notes about gOCR so far:
* greyscale only (stated in the source package). I am assuming it takes dark/black to mean textcolor.
* completely undocumented from a user point of view, discounting switchlist provided at commandprompt.
available precompiled for windows: a binary ~360k. no other files hinting at learning database, fonts data, filtering or documentation of any sort.
Seems targeted specifically at scans of prints/books, i.e. neat black type on (near)white background, image rotated prior to invoking the program. Thus a lot of preprocessing is required for color images, esp those with a multicolored non-uniform background, because of the "to greyscale" hammer applied.
Currently looks like i would have to spend weeks writing the necessary filtering and image subsection extraction software first, to make this one useable, besides the work to automate the process and organize the data collection.
Currently i cant reccomend it (gOCR) to anyone except one wanting to explore the source and perhaps the casual text scanner.

Martz · Apr 26, 2004

Perhaps you could contribute to one of the existing projects to get it to do what you want, or fork a project which already has some of the groundwork done. Sounds like a PITA either way

Search

Search

OCR/OTR software needed

Largo

]>BO<[

Ice

New Member

Largo

]>BO<[

Martz

Largo

]>BO<[

Martz