Tesseract Open Source OCR Engine

Edit Package tesseract-ocr

Tesseract is a free optical character recognition engine originally developed at Hewlett-Packard and currently developed by Google. It is a raw OCR engine - it has no document layout analysis, no output formatting, and no graphical user interface. It only processes a TIFF or BMP image of a single column and creates text from it. It can detect fixed pitch vs proportional text. The engine was in the top 3 in terms of character accuracy in 1995. The source code will read a binary, grey or color image and output text.

Tesseract can process English, French, Italian, German, Spanish, Brazilian, Portuguese and Dutch and can be trained to work in other languages as well.

Refresh
Refresh
Source Files
Filename Size Changed
3.04.00.tar.gz 0002264427 2.16 MB
tesseract-ocr.changes 0000006503 6.35 KB
tesseract-ocr.spec 0000003821 3.73 KB
Latest Revision
Stephan Kulow's avatar Stephan Kulow (coolo) accepted request 337049 from Ismail Dönmez's avatar Ismail Dönmez (namtrac) (revision 3)
- Update to version 3.04.00:
  * Added OpenCL support (experimental).
  * Many bug fixes.
  From version 3.03.00:
  * Added new training tool text2image to generate box/tif file
    pairs from text and truetype fonts.
  * Added support for PDF output with searchable text.
  * Removed entire IMAGE class and all code in image directory.
  * Tesseract executable: support for output to stdout; limited
    support for one page images from stdin  (especially on Windows)
  * Added Renderer to API to allow document-level processing and
    output of document formats, like hOCR, PDF.
  * Major refactor of word-level recognition, beam search,
    eliminating dead code.
  * Refactored classifier to make it easier to add new ones.
  * Generalized feature extractor to allow feature extraction from
    greyscale.
  * Improved sub/superscript treatment.
  * Improved baseline fit.
  * Added set_unicharset_properties to training tools.
  * Many bug fixes.
  * More training source data included.
- Added new build requirements cairo-devel, doxygen, libicu-devel
  and pango-devel.
- Recommend tesseract-ocr-traineddata-english instead of
  tesseract-ocr-traineddata-american (based on new (3.04.00)
  tesseract-ocr traineddata files).
Comments 0
openSUSE Build Service is sponsored by