Bayesian: The Bag of Words

Ever wonder how your email filter determines which emails are junk and which are not? If I told you it came down to statistics you might be surprised. You might even want to leave this article now, because I just said statistics, but I promise you there is great fun in store. Plus, after seeing it applied here you may find it useful for other things in life or at least I hope you do.

What is the Bayesian?
What we are really aiming for here is the application of Baye’s theorem. Baye’s theorem states that the probability of A given B is equal to probability of B give A times the probability of A divided by the probability of B. So you ask what is that going to tell me? It tells you the conditional probability of event A given B is related to the converse conditional probability of B given A. Which in english terms means you get the probability of an event occurring taking into consideration the priori, likelihood, and evidence.

Filtering Spam
Now, personally I think this is a cool application for the theorem and it isn’t overly complicated.

First, let’s define a few things:

P(S) – Is the probability that the message is spam.
P(H) – Is the probability that the message is ham (or the good stuff.)
P(S|W) – Is the probability that a message is spam, that contains the word W.
P(W|S) – Is the probability that the word W is in a spam message.
P(W|H) – Is the probability that the word W is in a ham message.

And then our formula becomes:

P(S|W) = (P(W|S)*P(S))/(P(W|S)*P(S)+P(W|H)*P(H))

Next, we need to talk some about how sensitive our filter is going to be in determining what is spam and what isn’t or spamicity. There are current statistics that show a majority of email is spam, thus we can adjust our P(H) and P(S) accordingly. A good estimate would be 30/70. Here we are going to assume that there is not any bias between the two and put each at 50/50. This simplifies our equation to:

P(S|W) = (P(W|S)*)/(P(W|S)+P(W|H))

Because the .5 is factored out of both the numerator and the denominator. Last, the system needs to be trained and this can be the most tedious part of creating such a filter. The samples for training must be real and not pre-generated. Then one must decide, in person, whether or not an email is spam. Once the decision is made those words need to sent to a database and noted as spam or ham. This process is repeated for each email and the counts for each word aggregated. Once training has been completed, the probability of occurrence for each word is calculated.

Running the Filter
In order to run the filter the probability for each word must be created and applied to every word in the email. Then once the probabilities have been aggregated the filter should give you a number for the spamicity of the email. If the number is lower than your threshold, then likely you have a legitimate email. If the number is above the threshold then you have spam. Could it be that simple?