Text Charset and Language Guesser
mguesser is a standalong part of libudmsearch (a core of mnogo search engine http://mnogosearch.org) which allows to guess text's charset and language. Guessing is implemented using "N-Gram-Based Text Categorization" technique which is implemented in TextCat language guesser written in Perl (http://www.let.rug.nl/~vannoord/TextCat/). mguesser is significantly faster than TextCat especially on large texts. This package consist of C written N-gram based algorythms as well as a number of maps for texts in various languages and charsets. Take a look into "maps" directory of this package to check currently supported languages and charsets.
Source Files
Filename | Size | Changed | Actions |
---|---|---|---|
mguesser-0.4.tar.bz2 | 0000128769126 KB | 1267444978about 8 years ago | ![]() |
mguesser-fix_printf_format.patch | 0000000296296 Bytes | 1204829448about 10 years ago | ![]() |
mguesser-makefile.patch | 0000000572572 Bytes | 1267444978about 8 years ago | ![]() |
mguesser.spec | 00000020041.96 KB | 1322760004over 6 years ago | ![]() |