The Trigram class can be used to compare blocks of text based on their local structure, which is a good indicator of the language used. It could also be used within a language to discover and compare the characteristic footprints of various registers or authors. As all n-gram implementations should, it has a method to make up nonsense words.
Python, 172 lines
For some reason I am on the W3C's www-international mailing list, where I read this message:
mentioning that people use n-grams to guess languages. That is to say, you look at the micro-structure of a block of text, and count how many times sequences of length n occur. If you count pairs it is called a 2-gram (or bigram), and so with any value of n. I have used a 3-gram, or trigram.
I combined this with a vector search as described by Maciej Ceglowski in his famous O'Reilly article:
It would be quicker, simpler, and more memory efficient to use a bigram, for perhaps no worse results. If speed really mattered, it might also be tempting to use an array.array of ints indexed by the characters' ordinal numbers, rather than nested dictionaries.
On the other hand, converting everything to unicode would make it slower and more complicated (because you have to be sure of the source material encoding), but would of course be more useful.
The greatest improvement is probably to be found in integrating ad-hoc short-cuts (akin to search engine stopwords), or hybridising with other techniques.
how the pros do it (bigram combined with chararacter distribution and encoding analysis):
Maciej Ceglowski (of the O'Reilly article above) uses what seems to be a cleverly augmented trigram method.
Maciej Ceglowski's Language Guesser:
GPL source (perl):
Wikipedia recommend removing the spaces first: