October 24, 2007
3:50 pm

The first official alpha version of Google's OCRopus scanning software for Linux was released yesterday. OCRopus is built on top of HP's venerable open-source Tesseract optical character recognition (OCR) engine and is distributed under the Apache License 2.0.

OCRopus is a state-of-the-art document analysis and OCR system, featuring pluggable layout analysis, pluggable character recognition, statistical natural language modeling, and multi-lingual capabilities.

OCRopus uses Tesseract for character recognition but has its own layout analysis system that is optimized for accuracy. The OpenFST library is used for language modeling, but it still has some performance issues. OCRopus is designed to be modular, so that other character recognition and language modeling components can be used to eventually add support for non-Latin languages. An embedded Lua interpreter is used for scripting and configuration. The developers chose Lua rather than Python because Lua is slimmer and easier to embed. This release also includes some new image cleanup and de-skewing code.

Google, OCRopus, Scanning Software, Linux, Open-Source, Open Source, Release

Loading

Contextual Related Posts:

No followup yet

Leave a Response

Comment Preview
« Microsoft details security in Server 2008Storm Worm Botnet Lobotomizing Anti-Virus Programs »
Feed Icon

Subscribe via RSS or email: