Spam Goes Literary

Tuesday, August 8th, 2006

Spam Goes Literary notes the latest spam tactic: including passages of classic works — like those posted on Project Gutenberg — in ads. And people have complained to Greg Newby, the director of Project Gutenberg:

“No we don’t send spam,” he says. “We’re not doing anything other than trying to give away good literature.”

A better person to blame (or thank) would be Paul Graham. He’s not a spammer; he’s a programmer famous for creating one of the first really good spam filters.

In 2002, he was trying to write a little program to separate spam from ordinary e-mail. It did what you’d expect; it looked for keywords like “click” as in “click here to buy our product.” Graham says the results were less than spectacular.

“For one thing, spammers could just replace the ‘I’ in click with a ’1′ and you’d be out of luck,” he says. “And they did in fact start doing that.”

Graham tried something different. He wrote a program to find out how to best separate spam from real e-mail. To train it, he fed it a good helping of spam and a separate sample of real e-mail.

The program looked at each word and counted how many times it appeared in spam or legitimate mail. It found, for instance that words like “lunch” tend to be in legitimate e-mails. And words like “Viagra” or “cl1ck” are more likely to be in spam.

“This was 50 lines of code,” Graham said, “it took me a day to write.”

He ran this simple filter on his incoming e-mail. It evaluated all the words in each e-mail, and calculated an overall probability that the e-mail was spam.

Remarkably, it caught more than 99 percent of new spam, and let all his real e-mail through.

“I was so delighted,” Graham said. “It got practically all my spam the first time try.”

And this is why the spammers have had to resort to literature. Filters like the one Graham wrote are everywhere now. In order to get past them, spammers try to make the text of their e-mails look more like something you’d actually write.

Leave a Reply