Web Pages Language Classification: Bayes, Characters and n-grams
Most search engines start focusing on only one language (e.g., English) because it is simpler, requires almost no characters encoding, and has a wide audience. Wherever you want to index only pages written in English or support all the language of the world, a fast page language classification is one of the first tasks that you will have to deal with.
Simple word-based classification techniques like Naive Bayes will do the trick but require a very big training set especially for the foreign languages. Even if in the last years memory and processing power became less and less expensive, they are not free and especially in a startup you may need to optimize every single function.
For this reason, you may want to consider characters classification. In its simplest form, you just want to compute the frequency in which alphabet characters appear in each language and then compute the distance (e.g., geometric distance) between the text and your models. The memory requirement of this solution are very small (i.e., a float for each of the 26 letters) and CPU can be easily bounded as well (e.g., you can stop after N characters of the input text) trading off precision for speed.
If that is not enough, n-grams of characters (e.g., sequences of 2 or 3 adjacent characters) will probably work even better but require an higher memory footprint.
The following graph shows the frequency of the alphabet letters across the 5 most common European languages. For some letters the difference in usage is pretty high, e.g., the letter “A” is used twice as much in Spanish than in German while “H” is frequently used in German and English but almost never used in the other languages.
If you enjoyed this post, please consider to leave a comment or subscribe to the feed and get future articles delivered to your feed reader.


