16 Aug 2002 raph   » (Master)

Paul Graham and Spam

xach figured it out: the person analyzing spam is, indeed, Paul Graham. In fact, he has a new piece up.

I'm writing something more detailed about trust now, filling out some email conversation that we've started. However, I'm struck by one feature of his new scheme, so I'll just blog my response here:

Very good. I do believe you're on to something here. The idea of everybody building their own corpus _is_ powerful. Your analysis makes sense to me, and I do believe your tool will do better than most.

I take issue with one thing, though: your assertion that probabilities are superior to scores. Most people not trained in statistics will find adding up of scores more transparent than computing Bayesian probabilities. Further, you've got an awful lot of voodoo constants in there: 2x, 0.01, 0.99, 0.20, 0.9. Lastly, the Bayesian probability computation assumes all the probabilities are independent (if I recall my stats correctly), which is definitely not valid in this application.

In fact, I do believe that your probabilities and SpamAssassin-like scores are equivalent. Use the transform: score = log p - log (1-p), or p = exp(score) / (1 + exp(score)). Take the 15 scores with greatest absolute values (see, already a more intuitive formulation), and simply add them. Your voodoo scores are now -4.6, 4.6, -1.4, and 2.2, respectively.

Perhaps the best way to look at this is that Bayesian probability can give theoretical justification to a "score" system. That might be interesting to some.

Latest blog entries     Older blog entries

New Advogato Features

New HTML Parser: The long-awaited libxml2 based HTML parser code is live. It needs further work but already handles most markup better than the original parser.

Keep up with the latest Advogato features by reading the Advogato status blog.

If you're a C programmer with some spare time, take a look at the mod_virgule project page and help us with one of the tasks on the ToDo list!