More data usually beats better algorithms

Friday, April 4th, 2008

Anand Rajaraman teaches a data mining class at Stanford, and he has found that more data usually beats better algorithms:

Team A came up with a very sophisticated algorithm using the Netflix data. Team B used a very simple algorithm, but they added in additional data beyond the Netflix set: information about movie genres from the Internet Movie Database (IMDB). Guess which team did better?

Team B got much better results, close to the best results on the Netflix leaderboard!
[...]
Another fine illustration of this principle comes from Google. Most people think Google’s success is due to their brilliant algorithms, especially PageRank. In reality, the two big innovations that Larry and Sergey introduced, that really took search to the next level in 1998, were:

  1. The recognition that hyperlinks were an important measure of popularity — a link to a webpage counts as a vote for it.
  2. The use of anchortext (the text of hyperlinks) in the web index, giving it a weight close to the page title.

First generation search engines had used only the text of the web pages themselves. The addition of these two additional data sets — hyperlinks and anchortext — took Google’s search to the next level. The PageRank algorithm itself is a minor detail — any halfway decent algorithm that exploited this additional data would have produced roughly comparable results.

The same principle also holds true for another area of great success for Google: the AdWords keyword auction model. Overture had previously proved that the model of having advertisers bid for keywords could work. Overture ranked advertisers for a given keyword based purely on their bids. Google added some additional data: the clickthrough rate (CTR) on each advertiser’s ad. Thus, to a first approximation, Google ranks advertisers by the product of their bid and their CTR (this was true in the first version of AdWords; they now use more considerations). This simple change made Google’s ad marketplace much more efficient than Overture’s. Notice that the algorithm itself is quite simple; it is the addition of the new data that made the difference.

Leave a Reply