Why Twitter Clients still lack Classification, Clustering and Ranking?

TweetDeck logoOn Facebook the average user has about 130 friends and I believe that the average user of Twitter follows a similar number of people.

Considering one or two Tweets per day from each, plus 10 or more from accounts like CNN or ABC, it’s reasonable to think that you would have to look at 250 messages per day. And if you follow more people, or automated services, this number is even higher.

Who has the time to read all this? I doubt you will keep looking at your Twitter stream the entire day looking for those few gems. And if you do, most days you will learn that Bob is sipping a cappuccino and Jane bought new boots, instead of something really useful.

So I wonder: why Twitter clients (e.g., TweetDeck, Seesmic, …)  do not learn from  news products (e.g., Google News) and start adding classification, cluestring and raking of followers/tweets?

Classification would help probabilistically flagging tweets I may care about (e.g., technology, search, …) from the ones I do not (e.g., you are watching Lost, eating a pizza, …) pretty much like the spam filters do in modern email clients.

Clustering will put together all the tweets/discussions about the same topic (e.g., comments on the new movie of Bruce Willis) so I do not see them scattered in pieces here and there and I can quickly understand what is the general opinion on it.

Ranking could then take advantage of both classification and clustering, understand who I care the most among the people I follow, and rank the tweets in my stream accordingly.

With 41 Million tweets per day (39% containing link, for the majority spam) we could already take advantage of smarter Twitter Clients.

Internet & Search, Technology

Average Query Length on Major Search Engines, February 2010

With the increase on popularity of Internet access, people use the Web for almost everything. Web search engines are used as recipe books, calculators, encyclopedias, howto’s, DYI references, and so on.

In the last years users became better at formulating their queries and it is kind of funny to think that at the beginning they were typing queries like “could you please find for me a recipe for apple pie?”.

Here are some query length statistics from 2006, 2007, and 2009.

But how good/bad are people in writing queries today? What is the distribution of the lengths? Does it vary between search engines?

This morning I decided to investigate that. After the analysis of some log files, here are some updated statistics as of February 2010:


Google Bing Ask Yahoo
1 26.79% 46.76% 49.90% 54.15%
2 23.39% 18.81% 13.03% 18.11%
3 18.72% 15.92% 16.09% 12.31%
4 12.78% 8.40% 6.72% 7.08%
5 8.23% 5.23% 6.42% 3.73%
6 4.55% 1.94% 3.77% 2.47%
7 2.76% 1.40% 0.71% 0.97%
8 1.36% 0.71% 2.24% 0.68%
9 1.02% 0.77% 0.81% 0.33%
10 0.41% 0.06% 0.31% 0.18%
avg. length 2.93 2.27 2.39 2.06


(Disclaimer: while I did my best to compute those statistics, due to increasingly high privacy constraints they were made on a relatively small sample of queries and therefore could be not perfectly accurate.)

Internet & Search

Facebook’s Email could really Take Down Gmail Supremacy

Facebook LogoAccording to some statistics from Google, people spend 4x more times surfing the Internet than driving their car. However, when asked what a browser is, they had no clue. The first Internet users were hackers which spent most of their time on terminals, chatting through IRC, using Pine for their emails and a few newsgroup. The first Netscape Navigator was still very far.

Nowadays, the Internet is a platform and technologies like Ajax and Flash reduced (perhaps even canceled) the gap between online and local applications. Most of the people who use Internet every day, I am sure, do not know the difference between email, Outlook, Gmail or Facebook Messages. In the recent years many providers are even pushing for Web OS, with applications (e.g., Excel) and storage somewhere in the cloud.

Tech people (2% at most?) know that it is a bad idea. The rest of the world (98%) will almost not even notice the difference. After all, their spreadsheet looks the same even in a browser.

Today, I read about Facebook’s idea of launching their own email platform: it is a genius idea.

Just a few days ago Facebook reached 400 Million users, and most of them log in every day. They definitively take a look at their Walls, the homepage, and check their messages. Facebook does not have to do anything more than setting up a few SMTP/IMAP servers, tell everybody that they now have an email <username>@fbmail.com and the deal is done.

I am sure that almost everyone who managed to create and use a Gmail account is on Facebook, so why bother checking both? Just sync  their address book the first time and goodbye Gmail. People do not even know what are the 8 Gb of space which Google is giving them for their emails, they do not use labels or stars, they not install addons nor use the IMAP capabilities..

For Facebook, this is a great move. They will be able to look into your email stream and figure out what your interests are to improve their targeted advertising. You will spend even more time on their site. They already managed to keep everybody logged in through the chat, and now, with the introduction of email and a better search experience (a tailored web search), people will have no reason to leave.

They are already the biggest photo sharing website of the world (Flickr who?). They are probably the biggest “forum” site of the world. Now they will also conquer email.

Good idea. Very good idea.

Internet & Search, Technology

Most Used URLs Shortener on Twitter, January 2010

URLs shorteners are definitively a hot business right now: Twitter made them popular restricting the tweets to only 140 characters, and while developing a URLs shortener is pretty simple, the amount and quality of data that they can collect (e.g., number of time a URL has been clicked on) is amazing.

It is easy to imagine how big search engines like Google, Bing or Ask.com are interested in the click streams of these companies. Traditional search engines generally discover pages through crawling (which is getting increasingly more difficult due to the ever growing size of the web), with the expansion of Twitter and the data of Bit.ly, users will “report” the hot pages directly to them and clicks will tell their importance.

According to my studies, in January 2010 the Twitter crowd produced about 41 Million tweets per day and of those about 38% contained an URL. Pretty impressive, considering that a few months ago there were only 26 Million of tweets per day and 22% contained URLs.

The table below shows the top 100 most used URLs Shortener and their relative percentages of URLs in the Twitter stream for January 2010.


Percentage #URLs Service Name
69.63% bit.ly
7.17% tinyurl.com
6.50% ow.ly
2.55% url4.eu
1.83% is.gd
1.82% cli.gs
1.42% goo.gl
1.05% tl.gd
0.74% ff.im
0.72% 4sq.com
0.51% su.pr
0.51% j.mp
0.44% s1z.us
0.42% lnk.ms
0.42% wp.me
0.36% shar.es
0.31% tiny.cc
0.25% ping.fm
0.23% fb.me
0.22% digg.com
0.21% fwix.com
0.20% r2u.at
0.19% dlvr.it
0.16% tr.im
0.13% siga.st
0.13% post.ly
0.13% nxy.in
0.12% mnt.to
0.11% nyti.ms
0.09% ur1.ca
0.07% u.nu
0.07% 3.ly
0.06% fxn.ws
0.06% uol.com
0.05% kele.es
0.05% sbne.ws
0.05% flic.kr
0.05% p.gs
0.05% kl.am
0.05% ad.vu
0.04% blip.fm
0.04% idek.net
0.04% ur.ly
0.04% trim.su
0.03% eca.sh
0.03% url.ie
0.03% digs.by
0.03% tcrn.ch
0.03% fk.cm
0.03% htxt.it
0.02% moby.to
0.02% om.ly
0.02% minu.me
0.02% tgam.ca
0.02% icio.us
0.02% vur.me
0.02% uurl.in
0.02% bub.bz
0.02% ning.it
0.02% mltp.ly
0.02% que.es
0.02% awe.sm
0.02% trim.li
0.01% flne.ws
0.01% vf.cx
0.01% 76k.com
0.01% askp.me
0.01% olha.biz
0.01% rp.pe
0.01% job.bs
0.01% znl.me
0.01% twa.lk
0.01% zz.gd
0.01% twib.es
0.01% rago.ca
0.01% sp2.ro
0.01% twlv.net
0.01% tynt.com
0.01% pk.gd
0.01% doms.bz
0.01% xr.com
0.01% hyux.com
0.01% bit2.ca
0.01% bz9.cc
0.01% tol.bz
0.01% act.ly
0.01% blip.tv
0.01% 9mp.com
0.01% dw.am
0.01% f1a.me
0.01% fwd4.me
0.01% amzn.com
0.01% bte.tc
0.01% gmed.net
0.01% r.im
0.01% sn.im
0.01% vai.la
0.01% boo.fm
0.01% elmo.st
0.01% im.ly




(Disclaimer: some of those may not in fact be URLs Shorteners. The list was too long, and my life too short, to have the time to go through each of them and verify what their business is. If you find any error, please feel free to let me know.)

Internet & Search

Use shared_clone() to Share Variables among Perl Threads

Sharing variables across threads is generally very annoying in Perl. You have to declare the variable as shared before using it, and pay attention to the values you put in it.

Things get especially messy with multi-level hashes, since you are obligated to pre-declare each level as shared.

Luckily, there is a way to make things easier. If you upgrade threads::shared to version 1.32 using CPAN and can afford to waste some memory for a little, you can create your objects normally and then create shared copies of them using shared_clone().

This function will recursively traverse the object, create a shared clone of each element in it, and return you a nice reference which you can pass around to your threads.

At that point, to save memory, you can undef() the original object and keep only the clone.

This works great and flawlessly for read-only objects but it will still require some caution when you want to modify or add/append data to them since they need to be pre-declared as shared.

Technology

Perl: if you chomp() to split(), skip the first

In Perl it is common to write a readline() while loop over a file to read its content in memory.

When the file contains tab-separated data, many use chomp() to remove the newline from each input line and then split(/\t/) to separate the values into an array.

Today, trying to improve the performances of one program I wrote, I discovered that I was spending the same amount of time on both functions. Eliminating one of them would have doubled the speed of my code.

If your input data do not contain spaces, you can skip the chomp() and use split(/\s+/) or use split(/[\t\n]/) if it does.

Technology

The Boulder Creek Inspires People..even to Pee!

man peeing during a photo shootToday, walking by the Boulder Creek, I saw some people doing a photo shoot. It happens often around here, especially when young couples take their engagement pictures. It was inspiring.

Apparently, I was not the only one inspired by it.

The man in the orange coat decided to pee in the river next to the couple’s photo shoot in full daylight!

Travel

Free Wifi and Toolbars are Used to Monitor the Pages you Visit

DetectiveIf your browser sports a toolbar (e.g., from Yahoo, Google, MyWebSearch, …), you are using Google’s Chrome browser or the free WiFi that they offer in Airports an planes, somebody is gathering data on the pages you visit, how long you stay on them, and what else you do online.

This is the sometimes shocking truth that most people ignore when they use free web products without wondering why and how those companies provide them for free.

Afterall, if you think of it, to install that wireless connection in your home you pay the cable company, the cost of the router and the electricity to make it work 24/7. And this is just enough for one family and within 20 feet of radius. An airport definitively needs more, so why Google is so happy to offer you that for free?

And what about GMail? It is a wonderful free product and you have 8Gb of space for your emails. On the other hand, buying 8Gb of memory card for your camera arts you back of $20.

Google Chrome is a great browser but why would Google invest the salary of 50 of its own engineers to develop a free browser while there are already plenty of alternatives out there? Why Yahoo would develop a special toolbar to put in your browser while there is already a search box on the top right corner of it?

There are lots of other examples like those on the web. Almost all free software (e.g., if you installed uTorrent it came with Ask.com toolbar) on the Internet comes with a toolbar nowadays.

The answer is simple: they want your data.

Those companies are not looking for your address or SSN. They are interested in your hobbies, the news you like, which pages you visit, and what you buy. They are trying to create a profile of you and then use it to provide better targeted ADs, increasing the likely-hood that you will click on them and therefor make them money.

Clicks and time spent on each page can also help web search engines to improve and train their ranking algorithms. If everybody stop a 5-minute YouTube video after a couple of minutes, it is probably not that great. The same goes for a page full of text abandoned after a few seconds. On the other hand, if the average time spent on a page is 3 minutes, and you spend there only one, it is probably just not that relevant to you.

GMail is a great example of this technology. While you read your email, perhaps discussing the recent vacation of your pal in Hawaii, the servers of Google are busy at work extracting the important keywords from those messages and providing you flight and vacations offer on the right side of the screen.

The free WiFi that Google offered around the Christmas holidays allowed them to gather plenty of data on what people were buying this season, information that could then be used to improve Google Checkout. At the same time, they could monitor in real-time which news people were looking at, which definitively helped improve Google News.

Should you stop using all those products? You cannot, we both know it. However, you can take steps to reduce your exposure: delete all the browser toolbars (what do you use them for, anyway?), start using a free browser like Firefox, and install extensions like CookieSafe and Adblock Plus.

Featured, Internet & Search

Prostheses Technology, the amazing progresses of last years

Prosthetic Legs

Losing my eyesight, my legs or arms has always been among my biggest fears.

I would love to be able to do research 24/7 in all the possible fields and help to improve the life of the less fortunates, but my knowledge in the medical field is very limited for now and I still have to finish my PhD in Computer Science before jumping into something else.

Luckily, other people are doing that and their progresses blew my mind. If you did not do that already, go read the article “Bionic Legs, i-Limbs, and Other Super Human Prostheses You’ll Envy” in this months’ Fast Company. Its amazing to see the progresses in Prostheses Technology of the last years.

Computer research is fun, but this is totally another level. Go guys, you are the real heros.

http://www.fastcompany.com/magazine/142/super-human.html
Health, Technology

Search Results: 65% of Users’ Attention goes on First Three

According to a 2004 eye/click tracking study (Eye-tracking analysis of user behavior in WWW search) done at Cornell University on Google results page, users spend about 50% of the time on the page reading the snippet of the first and second results, and 14% to read the third one.

Here is the percentage of time spent reading the snippets compared to the percentage of clicks done on the results.

         eye     clicks
   1    28.43%   56.36%
   2    25.08%   13.45%
   3    14.72%    9.82%
   4     8.70%    4.00%
   5     6.02%    4.73%
   6     4.01%    3.27%
   7     3.01%    0.36%
   8     3.68%    2.91%
   9     3.01%    1.45%
  10     2.34%    2.55%



The eye-tracking data can be especially useful for who is trying to train a relevance/ranking system and uses Discounted Cumulative Gain as metric.

Featured, Internet & Search