A CRF-based word segmenter in Java. Supports Arabic and Chinese
Some languages require extensive token pre-processing, which is usually called segmentation. The Stanford Word Segmenter currently supports Arabic and Chinese. The provided segmentation schemes have been found to work well for a variety of applications.