Posts Tagged ‘Internet & Search

How to Easily Stop Internet Piracy: Make People Pay for Bandwidth

Shrink CableIn the last days there have been so many debates and threads about the PIPA/SOPA legislation. Sites went black (even Wikipedia), people added “STOP SOPA” banner on their avatars, and the technology news is overwhelmed with those messages. There are tons of people who say that this legislation will destroy the web, freedom of speech, and innovation.

While I do not have a particular position on the topic, I do think that there could be a much easier way for the government to stop Internet piracy, which does not require any form of censorship: obligate service provider to make people pay for bandwidth instead of a monthly flat rate, pretty much like they are already doing with cellphone data.

Let’s assume that normal Internet users use 50 Gb/month and currently pay $50/month for the service. That’s $1/Gb of bandwidth. If you do a lot of file sharing, you will definitely use more, and you’ll pay for it. After a while people will stop because it will become too expensive.

In theory, this should work just fine. One could argue that I could legally download/watch things on regular channels (e.g., Hulu, Amazon, …) and those will still be counted towards my bandwidth effectively increasing my bill. True, but I think we could find a fair amount of traffic which covers pretty much everybody. My hunch is that people who really contribute to Internet Piracy use way more bandwidth than regular people.

Another solution could be to have a list of sites which do not count toward the bandwidth. Unfortunately, this would be hard to regulate and check, while I can always look at my router’s traffic stats to check that my cable bill is right.

Finally, there could be a legislation that requires cable provider to report to the authorities statistics on the traffic of people who go above a certain monthly bandwidth. I am sure the entire process could be automated, including building up a system that automatically estimates the probability that the user is doing something illegal.

Tags : , , , , , , ,

Average Query Length on Major Search Engines, February 2010

With the increase on popularity of Internet access, people use the Web for almost everything. Web search engines are used as recipe books, calculators, encyclopedias, howto’s, DYI references, and so on.

In the last years users became better at formulating their queries and it is kind of funny to think that at the beginning they were typing queries like “could you please find for me a recipe for apple pie?”.

Here are some query length statistics from 2006, 2007, and 2009.

But how good/bad are people in writing queries today? What is the distribution of the lengths? Does it vary between search engines?

This morning I decided to investigate that. After the analysis of some log files, here are some updated statistics as of February 2010:


Google Bing Ask Yahoo
1 26.79% 46.76% 49.90% 54.15%
2 23.39% 18.81% 13.03% 18.11%
3 18.72% 15.92% 16.09% 12.31%
4 12.78% 8.40% 6.72% 7.08%
5 8.23% 5.23% 6.42% 3.73%
6 4.55% 1.94% 3.77% 2.47%
7 2.76% 1.40% 0.71% 0.97%
8 1.36% 0.71% 2.24% 0.68%
9 1.02% 0.77% 0.81% 0.33%
10 0.41% 0.06% 0.31% 0.18%
avg. length 2.93 2.27 2.39 2.06


(Disclaimer: while I did my best to compute those statistics, due to increasingly high privacy constraints they were made on a relatively small sample of queries and therefore could be not perfectly accurate.)

Tags : , , ,

Facebook’s Email could really Take Down Gmail Supremacy

Facebook LogoAccording to some statistics from Google, people spend 4x more times surfing the Internet than driving their car. However, when asked what a browser is, they had no clue. The first Internet users were hackers which spent most of their time on terminals, chatting through IRC, using Pine for their emails and a few newsgroup. The first Netscape Navigator was still very far.

Nowadays, the Internet is a platform and technologies like Ajax and Flash reduced (perhaps even canceled) the gap between online and local applications. Most of the people who use Internet every day, I am sure, do not know the difference between email, Outlook, Gmail or Facebook Messages. In the recent years many providers are even pushing for Web OS, with applications (e.g., Excel) and storage somewhere in the cloud.

Tech people (2% at most?) know that it is a bad idea. The rest of the world (98%) will almost not even notice the difference. After all, their spreadsheet looks the same even in a browser.

Today, I read about Facebook’s idea of launching their own email platform: it is a genius idea.

Just a few days ago Facebook reached 400 Million users, and most of them log in every day. They definitively take a look at their Walls, the homepage, and check their messages. Facebook does not have to do anything more than setting up a few SMTP/IMAP servers, tell everybody that they now have an email <username>@fbmail.com and the deal is done.

I am sure that almost everyone who managed to create and use a Gmail account is on Facebook, so why bother checking both? Just sync  their address book the first time and goodbye Gmail. People do not even know what are the 8 Gb of space which Google is giving them for their emails, they do not use labels or stars, they not install addons nor use the IMAP capabilities..

For Facebook, this is a great move. They will be able to look into your email stream and figure out what your interests are to improve their targeted advertising. You will spend even more time on their site. They already managed to keep everybody logged in through the chat, and now, with the introduction of email and a better search experience (a tailored web search), people will have no reason to leave.

They are already the biggest photo sharing website of the world (Flickr who?). They are probably the biggest “forum” site of the world. Now they will also conquer email.

Good idea. Very good idea.

Tags : , , , , , ,

Most Used URLs Shortener on Twitter, January 2010

URLs shorteners are definitively a hot business right now: Twitter made them popular restricting the tweets to only 140 characters, and while developing a URLs shortener is pretty simple, the amount and quality of data that they can collect (e.g., number of time a URL has been clicked on) is amazing.

It is easy to imagine how big search engines like Google, Bing or Ask.com are interested in the click streams of these companies. Traditional search engines generally discover pages through crawling (which is getting increasingly more difficult due to the ever growing size of the web), with the expansion of Twitter and the data of Bit.ly, users will “report” the hot pages directly to them and clicks will tell their importance.

According to my studies, in January 2010 the Twitter crowd produced about 41 Million tweets per day and of those about 38% contained an URL. Pretty impressive, considering that a few months ago there were only 26 Million of tweets per day and 22% contained URLs.

The table below shows the top 100 most used URLs Shortener and their relative percentages of URLs in the Twitter stream for January 2010.


Percentage #URLs Service Name
69.63% bit.ly
7.17% tinyurl.com
6.50% ow.ly
2.55% url4.eu
1.83% is.gd
1.82% cli.gs
1.42% goo.gl
1.05% tl.gd
0.74% ff.im
0.72% 4sq.com
0.51% su.pr
0.51% j.mp
0.44% s1z.us
0.42% lnk.ms
0.42% wp.me
0.36% shar.es
0.31% tiny.cc
0.25% ping.fm
0.23% fb.me
0.22% digg.com
0.21% fwix.com
0.20% r2u.at
0.19% dlvr.it
0.16% tr.im
0.13% siga.st
0.13% post.ly
0.13% nxy.in
0.12% mnt.to
0.11% nyti.ms
0.09% ur1.ca
0.07% u.nu
0.07% 3.ly
0.06% fxn.ws
0.06% uol.com
0.05% kele.es
0.05% sbne.ws
0.05% flic.kr
0.05% p.gs
0.05% kl.am
0.05% ad.vu
0.04% blip.fm
0.04% idek.net
0.04% ur.ly
0.04% trim.su
0.03% eca.sh
0.03% url.ie
0.03% digs.by
0.03% tcrn.ch
0.03% fk.cm
0.03% htxt.it
0.02% moby.to
0.02% om.ly
0.02% minu.me
0.02% tgam.ca
0.02% icio.us
0.02% vur.me
0.02% uurl.in
0.02% bub.bz
0.02% ning.it
0.02% mltp.ly
0.02% que.es
0.02% awe.sm
0.02% trim.li
0.01% flne.ws
0.01% vf.cx
0.01% 76k.com
0.01% askp.me
0.01% olha.biz
0.01% rp.pe
0.01% job.bs
0.01% znl.me
0.01% twa.lk
0.01% zz.gd
0.01% twib.es
0.01% rago.ca
0.01% sp2.ro
0.01% twlv.net
0.01% tynt.com
0.01% pk.gd
0.01% doms.bz
0.01% xr.com
0.01% hyux.com
0.01% bit2.ca
0.01% bz9.cc
0.01% tol.bz
0.01% act.ly
0.01% blip.tv
0.01% 9mp.com
0.01% dw.am
0.01% f1a.me
0.01% fwd4.me
0.01% amzn.com
0.01% bte.tc
0.01% gmed.net
0.01% r.im
0.01% sn.im
0.01% vai.la
0.01% boo.fm
0.01% elmo.st
0.01% im.ly




(Disclaimer: some of those may not in fact be URLs Shorteners. The list was too long, and my life too short, to have the time to go through each of them and verify what their business is. If you find any error, please feel free to let me know.)

Tags : , , , , , , , , ,

PubSubHubbub: a 1987 idea with HTTP/XML and Peer-to-Peer Sprinkled on It

PubSubHubbub ModelIf you read tech blogs like ReadWriteWeb or TechCrunch you probably have heard of PubSubHubbub, a distributed publishing method recently announced by the Google’s folks: PubSubHubbub.

Tech bloggers are going crazy about it and wrote thousands of posts without really knowing what it is and who will benefit from it. It is one of the buzzwords of the moment and nobody wants to miss on it.

But anybody who studied Computer Science in college will probably remember the Publish-Subscriber model from some of the introductory classes. When simple poll models, in which who is interested in new data constantly asks for it, are too expensive or not scalable, everybody switches to a push model, in which who is interested in the data (subscriber) let the creator (publisher) know and will receive updates whenever there is something new. This was invented in 1987.

Add some HTTP/XML, sprinkle some ideas from peer-to-peer systems, and 30 years later you have PubSubHubbub.

Seriously, that is the idea. The publisher picks some hubs and let them know that it will be publishing something. When it has some updates, pings them (with an HTTP POST) to let them know. Once alerted, Hubs go fetch the full content and distribute it to all the subscribers who previously registered with the hub for that particular feed.

Yes, it will make your little blog scale better since people will crush the Hub and not your site (but isn’t it on WordPress/Blogger servers anyway, so why you care?), but how many blogs/publisher out there have this kind of problems? And if they are that important to receive so much traffic, shouldn’t they actually think about that as a business?

Finally, why isn’t anybody talking about the Hubs? The system is “simple” for publisher and subscribers, but who designs, runs and maintains the Hubs? If an Hub goes down all the subscribers lose the updates (yes, they can go to another Hub) so those systems need to be redundant and scalable (they have to download the content and distribute it).

The only interest one can have to create and maintain an public/free Hub (as Google has done) is to get an hold on the data. Instead of crawling millions of blogs (publishers) to check if they have been updated, they will let you know. At the same time, you will know who (subscribers) is interested in what, and as Google has shown in the past years, that is pretty handy information.

Tags : , , , , ,