This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize

Thursday, February 28th, 2008

This Psychologist Might Outsmart the Math Brains Competing for the Netflix Prize:

Many of the contestants begin, like Cinematch does, with something called the k-nearest-neighbor algorithm — or, as the pros call it, kNN. This is what Amazon.com uses to tell you that “customers who purchased Y also purchased Z.” Suppose Netflix wants to know what you’ll think of Not Another Teen Movie. It compiles a list of movies that are “neighbors” — films that received a high score from users who also liked Not Another Teen Movie and films that received a low score from people who didn’t care for that Jaime Pressly yuk-fest. It then predicts your rating based on how you’ve rated those neighbors. The approach has the advantage of being quite intuitive: If you gave Scream five stars, you’ll probably enjoy Not Another Teen Movie.

BellKor uses kNN, but it also employs more abstruse algorithms that identify dimensions along which movies, and movie watchers, vary. One such scale would be “highbrow” to “lowbrow”; you can rank movies this way, and users too, distinguishing between those who reach for Children of Men and those who prefer Children of the Corn.

Of course, this system breaks down when applied to people who like both of those movies. You can address this problem by adding more dimensions — rating movies on a “chick flick” to “jock movie” scale or a “horror” to “romantic comedy” scale. You might imagine that if you kept track of enough of these coordinates, you could use them to profile users’ likes and dislikes pretty well. The problem is, how do you know the attributes you’ve selected are the right ones? Maybe you’re analyzing a lot of data that’s not really helping you make good predictions, and maybe there are variables that do drive people’s ratings that you’ve completely missed.

BellKor (along with lots of other teams) deals with this problem by means of a tool called singular value decomposition, or SVD, that determines the best dimensions along which to rate movies. These dimensions aren’t human-generated scales like “highbrow” versus “lowbrow”; typically they’re baroque mathematical combinations of many ratings that can’t be described in words, only in pages-long lists of numbers. At the end, SVD often finds relationships between movies that no film critic could ever have thought of but that do help predict future ratings.

The danger is that it’s all too easy to find apparent patterns in what’s really random noise. If you use these mathematical hallucinations to predict ratings, you fail. Avoiding that disaster — called overfitting — is a bit of an art; and being very good at it separates masters like BellKor from the rest of the field.

In other words: The computer scientists and statisticians at the top of the leaderboard have developed elaborate and carefully tuned algorithms for representing movie watchers by lists of numbers, from which their tastes in movies can be estimated by a formula. Which is fine, in Gavin Potter’s view — except people aren’t lists of numbers and don’t watch movies as if they were.

Potter likes to use what psychologists know about human behavior. “The fact that these ratings were made by humans seems to me to be an important piece of information that should be and needs to be used,” he says. [...] One such phenomenon is the anchoring effect, a problem endemic to any numerical rating scheme. If a customer watches three movies in a row that merit four stars — say, the Star Wars trilogy — and then sees one that’s a bit better — say, Blade Runner — they’ll likely give the last movie five stars. But if they started the week with one-star stinkers like the Star Wars prequels, Blade Runner might get only a 4 or even a 3. Anchoring suggests that rating systems need to take account of inertia — a user who has recently given a lot of above-average ratings is likely to continue to do so. Potter finds precisely this phenomenon in the Netflix data; and by being aware of it, he’s able to account for its biasing effects and thus more accurately pin down users’ true tastes.

Leave a Reply