JoeEmail - 0.9.107

About
Program Help
Step By Step
How It Works
Developers
Release Notes
License

How it works

This program is an attempt to implement Paul Graham's "Plan For Spam". It's a pretty slick idea and it works far better than anything else I've tried.  Any errors in the implementation are all mine!

An incoming email is broken into "tokens" (i.e. words).  Each token is looked up in a database of tokens and from that database the number of times it has occurred in spam and non-spam messages is taken.  Using these counts and the total number of spam and non-spam messages the program determines a probability that a message containing the token in question is spam.  The fifteen most interesting (as in the ones with the probability furthest from .5) are taken and combined according to Bayes' Rule.  Bayes' Rule is how you mathematically determine probabilities based on new information.  (For example, Weatherman A, who is right 90% of the time, says there is an 80% chance of rain tomorrow, Weatherman B, who is right 85% of the time, says there is a 50% chance of rain tomorrow.  What would you say the chances of rain are tomorrow?  See http://www.mathpages.com/home/kmath267.htm for a detailed explanation.  Okay, so after we apply Bayes' Rule we get a probability* that the message is spam.  That number is associated with the email, it's color-coded and reported back to the user to deal with as they see fit.

JoeEmail give you some control over parameters in the algorithm, expect to see more as time goes on.  There are also options for automatically deleting spam once you get a "mature" database that is working well for you.  See Program Help for details on how to use these features.

*There is some controversy over whether or not this really is "Bayesean" or not- see this article by Gary Robinson if you are concerned about it.  My $0.02 is that Graham really does use Bayes' Rule- but the simple one that assumes the probabilities of tokens are independent.  If this were a federally funded scientific study on which lives depended that wouldn't be a good assumption.  However, we are just trying to filter out spam and experience so far has shown me that the assumptions and biases in Graham's approach are plenty good enough (to both catch spam and have a safety factor against false-positives).  Robinson also complains that what is used isn't a real probability and gives several suggestions as how to make it one and to generally improve the accuracy.  Well, he has a point.  As I build in more user control of the spam algorithm I'll likely include some of these as options.