TextCat
TextCat is an implementation of the text categorization algorithm
presented in Cavnar, W. B. and J. M. Trenkle, ``N-Gram-Based Text
Categorization'' In Proceedings of Third Annual Symposium on Document
Analysis and Information Retrieval, Las Vegas, NV, UNLV
Publications/Reprographics, pp. 161-175, 11-13 April
1994.
This paper was available at:
- http://msen.com/~wei/JT-homepage.html
- http://spd.erim.org/jt_papers/
It now is available from John
Trenkle's homepage., as papers/sdr94ps.gz.
I have applied the technique to implement a written language
identification program. At the moment, the system knows about 69
natural languages (counting Esperanto as a natural language).
Local links
Installation
Edit the text_cat script to have the first line point to your Perl binary.
Edit the text_cat script to have $opt_d point to the LM directory.
Usage
text_cat -h displays usage information.
Remotely related links
- Survey
on the State of the Art in Human Language Technology contains a
chapter on language identification (both for spoken and
written language).
- LIFI:
Language Identification From Images. Quote: The Language
Identification From Images project (LIFI) is concerned with the
automated identification of the script (alphabet) used in a document
image. Our initial phase, from 1994 to 1995, focused on
machine-printed documents. The second phase, from 1997 to 1998,
focused on handwritten documents. We will soon begin a new project
concerning how to use our script identification techniques to segment
multi-script document images.
- Bibliography on Automatic
Spoken Language Identification Bibliography. This bibliography
lists research in Automatic Identification of Spoken Language. There
are also some links for language identification of written
language.
- The
World's Main Languages. Lots of information on languages
and the internet.