Archive for March, 2010
Web Pages Language Classification: Bayes, Characters and n-grams
Most search engines start focusing on only one language (e.g., English) because it is simpler, requires almost no characters encoding, and has a wide audience. Wherever you want to index only pages written in English or support all the language of the world, a fast page language classification is one of the first tasks that you will have to deal with.
Simple word-based classification techniques like Naive Bayes will do the trick but require a very big training set especially for the foreign languages. Even if in the last years memory and processing power became less and less expensive, they are not free and especially in a startup you may need to optimize every single function.
For this reason, you may want to consider characters classification. In its simplest form, you just want to compute the frequency in which alphabet characters appear in each language and then compute the distance (e.g., geometric distance) between the text and your models. The memory requirement of this solution are very small (i.e., a float for each of the 26 letters) and CPU can be easily bounded as well (e.g., you can stop after N characters of the input text) trading off precision for speed.
If that is not enough, n-grams of characters (e.g., sequences of 2 or 3 adjacent characters) will probably work even better but require an higher memory footprint.
The following graph shows the frequency of the alphabet letters across the 5 most common European languages. For some letters the difference in usage is pretty high, e.g., the letter “A” is used twice as much in Spanish than in German while “H” is frequently used in German and English but almost never used in the other languages.
TurboTax costs $5 less without Cookies

With the exception of last year (changing job and state, I needed professional help) I always filed my taxes using TurboTax. It is nice, simple to use, and reasonably priced.
This year was no exception, but before finding out that Chase and Bank of America customers have a 35% discount on it, I went on their website to checkout the prices. Oddly, I discovered that visiting www.turbotax.com without accepting cookies shows lower prices ($5 less) than the ones offered to whom visits it with cookies enabled.
Glitch in the system? Naaaaa, I say it is an A/B comparison to see what customers are willing to pay.
Google integrates Profile Results and search links to Social Networks
Looking for names on Google now shows a “Profile Search” results box with the two best hits, an invite to create your Google Profile (in case you are doing a vanity search) and quick links to MySpace, Facebook, Classmates and LinkedIn search pages.
Google Search Suggestions: Men are More Worried about Manhood than IQ
Analysing query logs is really amusing some times. This is a screenshot from Google’s Search Suggestions for queries which start with the word “average”.
Here are a few extrapolation from these suggestions:
1) Men are more worried about the length of their penis than their IQ.
2) Height is more important for men, weight for women.
3) Salary is a concern only once you have established that you have a big penis, you are smart and tall.
Want your URL Shortener? Buy a Domain and Use Bit.ly, like Techcrunch and New York Times
With the diffusion of Twitter and other social sharing communities there is a lot of buzz over URLs shortening. Every major company wants to have its own URLs shortening domain: Google (goo.gl), TechCrunch (tcrn.ch), New York Times (nyti.ms), FourSquare (4sq.com), Fox News (fxn.ws), Delicious (icio.us), Bing (fa.il), …
Did they all really setup some highly reliable servers and databases to do that? The answer is no.
Most of them just bought a short domain and setup their web server to redirects all the requests to Bit.ly. Some examples:
TechCrunch: http://tcrn.ch/cNYWLR -> http://bit.ly/cNYWLR
New York Times: http://nyti.ms/dzy2b7 -> http://bit.ly/dzy2b7
Fox News: http://fxn.ws/cH1usB -> http://bit.ly/cH1usB
Now that you know the trick, building your own URL shortening service may be easier than you thought. And you get statistics too, simply adding a plus (+) at the end of the URL (e.g., http://fxn.ws/cH1usB+) or using Bit.ly API.




