Google Search Suggestions: Men are More Worried about Manhood than IQ
Analysing query logs is really amusing some times. This is a screenshot from Google’s Search Suggestions for queries which start with the word “average”.
Here are a few extrapolation from these suggestions:
1) Men are more worried about the length of their penis than their IQ.
2) Height is more important for men, weight for women.
3) Salary is a concern only once you have established that you have a big penis, you are smart and tall.
Want your URL Shortener? Buy a Domain and Use Bit.ly, like Techcrunch and New York Times
With the diffusion of Twitter and other social sharing communities there is a lot of buzz over URLs shortening. Every major company wants to have its own URLs shortening domain: Google (goo.gl), TechCrunch (tcrn.ch), New York Times (nyti.ms), FourSquare (4sq.com), Fox News (fxn.ws), Delicious (icio.us), Bing (fa.il), …
Did they all really setup some highly reliable servers and databases to do that? The answer is no.
Most of them just bought a short domain and setup their web server to redirects all the requests to Bit.ly. Some examples:
TechCrunch: http://tcrn.ch/cNYWLR -> http://bit.ly/cNYWLR
New York Times: http://nyti.ms/dzy2b7 -> http://bit.ly/dzy2b7
Fox News: http://fxn.ws/cH1usB -> http://bit.ly/cH1usB
Now that you know the trick, building your own URL shortening service may be easier than you thought. And you get statistics too, simply adding a plus (+) at the end of the URL (e.g., http://fxn.ws/cH1usB+) or using Bit.ly API.
How to Update Flash Player on Apple TV
The folks at Boxee recently updated their player to version 0.9.20.10710. The new interface is pretty slick and installers are now available for Ubuntu, Apple TV, OS X and Windows.
Unfortunately, after having installed it on Apple TV v3.0.2, it is not possible to play any TV Show due to an outdated Flash Player. No worries, here is how to unlock the Flash magic:
- SSH into your Apple TV
ssh frontrow@<ip_address> - Go into the temporary directory
cd /tmp - Download the latest version of Flash Player for OS X
wget 'http://fpdownload.macromedia.com/get/flashplayer/current/install_flash_player_osx_ub.dmg' - Mount the DMG file
sudo hdiutil attach install_flash_player_osx_ub.dmg - Go into the Internet Plugin Directory
cd "/Library/Internet\ Plug-Ins/" - Install the new Flash Player
sudo pax -r -z -f "/Volumes/Install\ Flash\ Player\ 10\ UB/Adobe\ Flash\ Player.pkg/Contents/Archive.pax.gz"
Now restart your Apple TV and enjoy watching movies on Boxee (if the CPU is powerful enough).
Reiser4 Performances on Ubuntu 9.10
Since its debut in 2004, there is a lots of controversy on Reiser4, the game-changing file-system developed by Namesys. It has been available for Linux for quite a while and there are great reviews on the Internet, but most distribution do not include it as an option by default, including Ubuntu Karmic 9.10.
Since the performances of my loyal Dell x300 are disk bounded, and I had success improving them using SquashFS last year (unfortunately it’s read-only), I decided to do my own review of Reiser4: I downloaded the source of my kernel (apt-get source linux-image-2.6.31-20-generic), its build dependencies (apt-get build-dep linux-image-2.6.31-20-generic), applied the right Reiser4 patch, copied over my current .config file (disable the DEBUG symbols from the kernel or your image will be huge) from the /boot directory and a couple of minutes later I was already running make-kpkg.
The goal of my first test was to compare Reiser4 with other well-known choices. I created a partition of about 5Gb and copied over my /usr directory (3549986832 bytes) using rsync (-aHAX) for each file-system.
| FS | Time | KB/s | CPU % | Disk (Gb) |
| ext4 | - | 2983 | - | - |
| jfs | - | 2817 | - | - |
| reiser3 | 16:57 | 3412 | 20 | 3.4 |
| reiser4 | 12:40 | 4570 | 25 | 3.4 |
Using Reiser4 the CPU utilization was definitively higher, but the disk throughput increased significantly (+53% versus ext4). If the ccreg40 plugin is used for file creation instead of the default reg40, Reiser4 offers many additional options on how the files are laid out, if and how are compressed, and which compression/encryption algorithm is used. I tested the impact of some of its compression schemes (conv, ultim and force) and the two compression algorithms (gzip1 and lzo1).
I was expecting the best compression performances from LZO1, but instead got the highest throughput (+70% respect to ext4) at the expense of an higher CPU utilization (+70% respect to Reiser3):
| Compression | Time | KB/s | CPU % | Disk (Gb) |
| conv | 11:41 | 4954 | 30 | 3.8 |
| force | 11:33 | 5004 | 27 | 3.4 |
| ultim | 11:23 | 5085 | 28 | 3.6 |
Using GZIP1 lead to the highest compression ratio (-48% vs. ext4) but a lower CPU utilization (-19% respect to LZO1) and a lower disk throughput (+42% over ext4):
| Compression | Time | KB/s |
CPU % | Disk (Gb) |
| conv | 14:32 | 3972 | 27 | 2.7 |
| force | 13:56 | 4155 | 22 | 2.4 |
| ultim | 13:39 | 4241 | 23 | 1.8 |
The results can be difficult to interpret: LZO1 seems to give the maximum disk throughput but there is less CPU available for applications and the files on disk are bigger (thus there is more to read), on the other hand, GZIP1 offers lower disk performances but there is half of the data to read from the disk.
To help me decide, I devised a comprehensive score (lower is better) which takes in account CPU, disk utilization and throughput, assuming that they all have the same importance:
score = (max_throughput / throughput) * (cpu / min_cpu) * (size / min_size)
Using this formula the winning configuration is GZIP1 with “ultim” (1.38). With default options enabled Reiser3 and Reiser4 obtain a score of 2.82 and 2.63 respectively. Slightly better GZIP1 with “force” (1.79) and “conv” (2.59). LZO1 is best using “force” (2.59) instead of “ultim” (2.80) or “conv” (3.25).
Using Reiser4 with GZIP1 and “ultim” should provide a 42% increase in disk throughput and reduce by 48% the amount of data to read, making my disk about 2.95 times faster. Let’s hope its true.
Top Shared Domains on Twitter, February 2010
Almost every magazine, tech blog, and news site (even CNN Money!) announced this week that Twitter receives more than 50 Million of tweets per day.
Great, but how many of those are junk or spam? Twitter does not care because (good or bad) every tweet translates in growth for them, but I am trying to answer this question in a paper that I am currently writing.
In the meanwhile, the following are the 100 most shared domains on Twitter:
| 5.8864% | youtube.com |
| 5.3170% | twitpic.com |
| 4.5964% | facebook.com |
| 4.4783% | formspring.me |
| 3.7426% | twitlonger.com |
| 1.6916% | tweetphoto.com |
| 1.6064% | twitcam.com |
| 1.1491% | fun140.com |
| 0.9761% | foursquare.com |
| 0.9400% | plurk.com |
| 0.8683% | twitter.com |
| 0.8161% | amazon.com |
| 0.7341% | blip.fm |
| 0.5434% | funwebsites.org |
| 0.5052% | mashable.com |
| 0.5046% | tinychat.com |
| 0.4638% | flickr.com |
| 0.4157% | news.bbc.co.uk |
| 0.4098% | askbiography.com |
| 0.3841% | pollpigeon.com |
| 0.3791% | nytimes.com |
| 0.3342% | etsy.com |
| 0.3329% | news.yahoo.com |
| 0.3059% | friendfeed.com |
| 0.3049% | spreadmytweets.com |
| 0.2931% | twittascope.com |
| 0.2914% | digg.com |
| 0.2810% | huffingtonpost.com |
| 0.2730% | theonlyexception.paramore.net |
| 0.2681% | oohja.com |
| 0.2656% | cnn.com |
| 0.2424% | epicpetwars.com |
| 0.2341% | techcrunch.com |
| 0.2281% | reuters.com |
| 0.2281% | guardian.co.uk |
| 0.2219% | ardenal.info |
| 0.2203% | dezireweb.info |
| 0.2193% | activities.myspace.com |
| 0.2010% | cgi.ebay.com |
| 0.1951% | cv-library.co.uk |
| 0.1911% | online.wsj.com |
| 0.1906% | last.fm |
| 0.1851% | tweetmyjobs.com |
| 0.1819% | justin.tv |
| 0.1789% | engadget.com |
| 0.1697% | myspace.com |
| 0.1617% | google.com |
| 0.1614% | dealspl.us |
| 0.1520% | getafreelancer.com |
| 0.1480% | am6.jp |
| 0.1460% | limelinx.com |
| 0.1455% | newzfor.me |
| 0.1431% | ustream.tv |
| 0.1425% | examiner.com |
| 0.1408% | roflquiz.com |
| 0.1391% | lolquiz.com |
| 0.1353% | pleaserobme.com |
| 0.1317% | washingtonpost.com |
| 0.1294% | news.cnet.com |
| 0.1294% | thetweettank.com |
| 0.1250% | gizmodo.com |
| 0.1243% | telegraph.co.uk |
| 0.1237% | ezinearticles.com |
| 0.1179% | simplyhired.com |
| 0.1178% | whoisdirectory.com |
| 0.1177% | itunes.apple.com |
| 0.1099% | sports.espn.go.com |
| 0.1090% | blogtv.com |
| 0.1076% | caltweet.com |
| 0.1035% | twitition.com |
| 0.1029% | vimeo.com |
| 0.1007% | marketwatch.com |
| 0.0969% | wired.com |
| 0.0947% | businessweek.com |
| 0.0937% | twitgoo.com |
| 0.0933% | money.cnn.com |
| 0.0919% | readwriteweb.com |
| 0.0918% | fwix.com |
| 0.0917% | en.wikipedia.org |
| 0.0917% | fanfeedr.com |
| 0.0867% | raptr.com |
| 0.0857% | followtweeter.com |
| 0.0852% | fastfollowertrain.com |
| 0.0847% | askgetanswer.com |
| 0.0846% | gowalla.com |
| 0.0845% | dealnay.com |
| 0.0820% | usatoday.com |
| 0.0815% | businessinsider.com |
| 0.0803% | sports.yahoo.com |
| 0.0798% | spotcrime.com |
| 0.0794% | trademytweets.com |
| 0.0777% | maps.google.com |
| 0.0769% | twibbon.com |
| 0.0757% | msnbc.msn.com |
| 0.0752% | sfbay.craigslist.org |
| 0.0737% | helium.com |
| 0.0722% | astore.amazon.com |
| 0.0722% | boingboing.net |
| 0.0713% | zazzle.com |
| 0.0710% | petitionspot.com |
| 0.0704% | foxnews.com |
I was expecting to see YouTube among the first domains but not Facebook. Formspring is a total surprise for me, but all its tweets (questions and answers) are automated so it is understandable. Looking at photo/video sharing sites yFrog did not seem to have caught on, since the most popular are Twitcam, Tweetphoto and Twitpic.
Foursquare’s automated tweets are increasing with the number of their users. All the tweets about Amazon contain referral codes. Among the news websites it seems that BBC is the most shared, followed by the New York Times, Huffington Post and CNN. The most popular tech blog is Mashable, which wins over TechCrunch and Engadget.
Why Twitter Clients still lack Classification, Clustering and Ranking?
On Facebook the average user has about 130 friends and I believe that the average user of Twitter follows a similar number of people.
Considering one or two Tweets per day from each, plus 10 or more from accounts like CNN or ABC, it’s reasonable to think that you would have to look at 250 messages per day. And if you follow more people, or automated services, this number is even higher.
Who has the time to read all this? I doubt you will keep looking at your Twitter stream the entire day looking for those few gems. And if you do, most days you will learn that Bob is sipping a cappuccino and Jane bought new boots, instead of something really useful.
So I wonder: why Twitter clients (e.g., TweetDeck, Seesmic, …) do not learn from news products (e.g., Google News) and start adding classification, cluestring and raking of followers/tweets?
Classification would help probabilistically flagging tweets I may care about (e.g., technology, search, …) from the ones I do not (e.g., you are watching Lost, eating a pizza, …) pretty much like the spam filters do in modern email clients.
Clustering will put together all the tweets/discussions about the same topic (e.g., comments on the new movie of Bruce Willis) so I do not see them scattered in pieces here and there and I can quickly understand what is the general opinion on it.
Ranking could then take advantage of both classification and clustering, understand who I care the most among the people I follow, and rank the tweets in my stream accordingly.
With 41 Million tweets per day (39% containing link, for the majority spam) we could already take advantage of smarter Twitter Clients.
Average Query Length on Major Search Engines, February 2010
With the increase on popularity of Internet access, people use the Web for almost everything. Web search engines are used as recipe books, calculators, encyclopedias, howto’s, DYI references, and so on.
In the last years users became better at formulating their queries and it is kind of funny to think that at the beginning they were typing queries like “could you please find for me a recipe for apple pie?”.
Here are some query length statistics from 2006, 2007, and 2009.
But how good/bad are people in writing queries today? What is the distribution of the lengths? Does it vary between search engines?
This morning I decided to investigate that. After the analysis of some log files, here are some updated statistics as of February 2010:
| Bing | Ask | Yahoo | ||
| 1 | 26.79% | 46.76% | 49.90% | 54.15% |
| 2 | 23.39% | 18.81% | 13.03% | 18.11% |
| 3 | 18.72% | 15.92% | 16.09% | 12.31% |
| 4 | 12.78% | 8.40% | 6.72% | 7.08% |
| 5 | 8.23% | 5.23% | 6.42% | 3.73% |
| 6 | 4.55% | 1.94% | 3.77% | 2.47% |
| 7 | 2.76% | 1.40% | 0.71% | 0.97% |
| 8 | 1.36% | 0.71% | 2.24% | 0.68% |
| 9 | 1.02% | 0.77% | 0.81% | 0.33% |
| 10 | 0.41% | 0.06% | 0.31% | 0.18% |
| avg. length | 2.93 | 2.27 | 2.39 | 2.06 |
(Disclaimer: while I did my best to compute those statistics, due to increasingly high privacy constraints they were made on a relatively small sample of queries and therefore could be not perfectly accurate.)
Facebook’s Email could really Take Down Gmail Supremacy
According to some statistics from Google, people spend 4x more times surfing the Internet than driving their car. However, when asked what a browser is, they had no clue. The first Internet users were hackers which spent most of their time on terminals, chatting through IRC, using Pine for their emails and a few newsgroup. The first Netscape Navigator was still very far.
Nowadays, the Internet is a platform and technologies like Ajax and Flash reduced (perhaps even canceled) the gap between online and local applications. Most of the people who use Internet every day, I am sure, do not know the difference between email, Outlook, Gmail or Facebook Messages. In the recent years many providers are even pushing for Web OS, with applications (e.g., Excel) and storage somewhere in the cloud.
Tech people (2% at most?) know that it is a bad idea. The rest of the world (98%) will almost not even notice the difference. After all, their spreadsheet looks the same even in a browser.
Today, I read about Facebook’s idea of launching their own email platform: it is a genius idea.
Just a few days ago Facebook reached 400 Million users, and most of them log in every day. They definitively take a look at their Walls, the homepage, and check their messages. Facebook does not have to do anything more than setting up a few SMTP/IMAP servers, tell everybody that they now have an email <username>@fbmail.com and the deal is done.
I am sure that almost everyone who managed to create and use a Gmail account is on Facebook, so why bother checking both? Just sync their address book the first time and goodbye Gmail. People do not even know what are the 8 Gb of space which Google is giving them for their emails, they do not use labels or stars, they not install addons nor use the IMAP capabilities..
For Facebook, this is a great move. They will be able to look into your email stream and figure out what your interests are to improve their targeted advertising. You will spend even more time on their site. They already managed to keep everybody logged in through the chat, and now, with the introduction of email and a better search experience (a tailored web search), people will have no reason to leave.
They are already the biggest photo sharing website of the world (Flickr who?). They are probably the biggest “forum” site of the world. Now they will also conquer email.
Good idea. Very good idea.
Most Used URLs Shortener on Twitter, January 2010
URLs shorteners are definitively a hot business right now: Twitter made them popular restricting the tweets to only 140 characters, and while developing a URLs shortener is pretty simple, the amount and quality of data that they can collect (e.g., number of time a URL has been clicked on) is amazing.
It is easy to imagine how big search engines like Google, Bing or Ask.com are interested in the click streams of these companies. Traditional search engines generally discover pages through crawling (which is getting increasingly more difficult due to the ever growing size of the web), with the expansion of Twitter and the data of Bit.ly, users will “report” the hot pages directly to them and clicks will tell their importance.
According to my studies, in January 2010 the Twitter crowd produced about 41 Million tweets per day and of those about 38% contained an URL. Pretty impressive, considering that a few months ago there were only 26 Million of tweets per day and 22% contained URLs.
The table below shows the top 100 most used URLs Shortener and their relative percentages of URLs in the Twitter stream for January 2010.
| Percentage #URLs | Service Name |
| 69.63% | bit.ly |
| 7.17% | tinyurl.com |
| 6.50% | ow.ly |
| 2.55% | url4.eu |
| 1.83% | is.gd |
| 1.82% | cli.gs |
| 1.42% | goo.gl |
| 1.05% | tl.gd |
| 0.74% | ff.im |
| 0.72% | 4sq.com |
| 0.51% | su.pr |
| 0.51% | j.mp |
| 0.44% | s1z.us |
| 0.42% | lnk.ms |
| 0.42% | wp.me |
| 0.36% | shar.es |
| 0.31% | tiny.cc |
| 0.25% | ping.fm |
| 0.23% | fb.me |
| 0.22% | digg.com |
| 0.21% | fwix.com |
| 0.20% | r2u.at |
| 0.19% | dlvr.it |
| 0.16% | tr.im |
| 0.13% | siga.st |
| 0.13% | post.ly |
| 0.13% | nxy.in |
| 0.12% | mnt.to |
| 0.11% | nyti.ms |
| 0.09% | ur1.ca |
| 0.07% | u.nu |
| 0.07% | 3.ly |
| 0.06% | fxn.ws |
| 0.06% | uol.com |
| 0.05% | kele.es |
| 0.05% | sbne.ws |
| 0.05% | flic.kr |
| 0.05% | p.gs |
| 0.05% | kl.am |
| 0.05% | ad.vu |
| 0.04% | blip.fm |
| 0.04% | idek.net |
| 0.04% | ur.ly |
| 0.04% | trim.su |
| 0.03% | eca.sh |
| 0.03% | url.ie |
| 0.03% | digs.by |
| 0.03% | tcrn.ch |
| 0.03% | fk.cm |
| 0.03% | htxt.it |
| 0.02% | moby.to |
| 0.02% | om.ly |
| 0.02% | minu.me |
| 0.02% | tgam.ca |
| 0.02% | icio.us |
| 0.02% | vur.me |
| 0.02% | uurl.in |
| 0.02% | bub.bz |
| 0.02% | ning.it |
| 0.02% | mltp.ly |
| 0.02% | que.es |
| 0.02% | awe.sm |
| 0.02% | trim.li |
| 0.01% | flne.ws |
| 0.01% | vf.cx |
| 0.01% | 76k.com |
| 0.01% | askp.me |
| 0.01% | olha.biz |
| 0.01% | rp.pe |
| 0.01% | job.bs |
| 0.01% | znl.me |
| 0.01% | twa.lk |
| 0.01% | zz.gd |
| 0.01% | twib.es |
| 0.01% | rago.ca |
| 0.01% | sp2.ro |
| 0.01% | twlv.net |
| 0.01% | tynt.com |
| 0.01% | pk.gd |
| 0.01% | doms.bz |
| 0.01% | xr.com |
| 0.01% | hyux.com |
| 0.01% | bit2.ca |
| 0.01% | bz9.cc |
| 0.01% | tol.bz |
| 0.01% | act.ly |
| 0.01% | blip.tv |
| 0.01% | 9mp.com |
| 0.01% | dw.am |
| 0.01% | f1a.me |
| 0.01% | fwd4.me |
| 0.01% | amzn.com |
| 0.01% | bte.tc |
| 0.01% | gmed.net |
| 0.01% | r.im |
| 0.01% | sn.im |
| 0.01% | vai.la |
| 0.01% | boo.fm |
| 0.01% | elmo.st |
| 0.01% | im.ly |
(Disclaimer: some of those may not in fact be URLs Shorteners. The list was too long, and my life too short, to have the time to go through each of them and verify what their business is. If you find any error, please feel free to let me know.)
Use shared_clone() to Share Variables among Perl Threads
Sharing variables across threads is generally very annoying in Perl. You have to declare the variable as shared before using it, and pay attention to the values you put in it.
Things get especially messy with multi-level hashes, since you are obligated to pre-declare each level as shared.
Luckily, there is a way to make things easier. If you upgrade threads::shared to version 1.32 using CPAN and can afford to waste some memory for a little, you can create your objects normally and then create shared copies of them using shared_clone().
This function will recursively traverse the object, create a shared clone of each element in it, and return you a nice reference which you can pass around to your threads.
At that point, to save memory, you can undef() the original object and keep only the clone.
This works great and flawlessly for read-only objects but it will still require some caution when you want to modify or add/append data to them since they need to be pre-declared as shared.
