Tesseract Open Source OCR Engine
Tesseract is a free optical character recognition engine originally developed at Hewlett-Packard and currently developed by Google. It is a raw OCR engine - it has no document layout analysis, no output formatting, and no graphical user interface. It only processes a TIFF or BMP image of a single column and creates text from it. It can detect fixed pitch vs proportional text. The engine was in the top 3 in terms of character accuracy in 1995. The source code will read a binary, grey or color image and output text.
Tesseract can process English, French, Italian, German, Spanish, Brazilian, Portuguese and Dutch and can be trained to work in other languages as well.
- Sources inherited from project openSUSE:Leap:42.1
- Download package
-
Checkout Package
osc -A https://api.opensuse.org checkout openSUSE:Leap:42.1:Ports/tesseract-ocr && cd $_
- Create Badge
Refresh
Refresh
Source Files
Filename | Size | Changed |
---|---|---|
3.04.00.tar.gz | 0002264427 2.16 MB | |
tesseract-ocr.changes | 0000006503 6.35 KB | |
tesseract-ocr.spec | 0000003821 3.73 KB |
Latest Revision
Stephan Kulow (coolo)
accepted
request 337049
from
Ismail Dönmez (namtrac)
(revision 3)
- Update to version 3.04.00: * Added OpenCL support (experimental). * Many bug fixes. From version 3.03.00: * Added new training tool text2image to generate box/tif file pairs from text and truetype fonts. * Added support for PDF output with searchable text. * Removed entire IMAGE class and all code in image directory. * Tesseract executable: support for output to stdout; limited support for one page images from stdin (especially on Windows) * Added Renderer to API to allow document-level processing and output of document formats, like hOCR, PDF. * Major refactor of word-level recognition, beam search, eliminating dead code. * Refactored classifier to make it easier to add new ones. * Generalized feature extractor to allow feature extraction from greyscale. * Improved sub/superscript treatment. * Improved baseline fit. * Added set_unicharset_properties to training tools. * Many bug fixes. * More training source data included. - Added new build requirements cairo-devel, doxygen, libicu-devel and pango-devel. - Recommend tesseract-ocr-traineddata-english instead of tesseract-ocr-traineddata-american (based on new (3.04.00) tesseract-ocr traineddata files).
Comments 0