The contents of this write-up is based largely on the work of Paul Graham who is one of the pioneers of the idea of using Bayesian filters to separate spam from non-spam e-mails. The two essays of his that I am basing most of this on are these:
"A Plan for Spam" - http://www.paulgraham.com/spam.html
"Better Bayesian Filtering" - http://paulgraham.com/better.html
Note however that I am not just re-iterating the above articles but, hopefully, helping explain them more fully and in a way that is more easily accessible (for example the articles use code examples in LISP, a language clearly not made for the mortal mind) as well as adding my own commentary about spam filtering and the issues surrounding spam.
To start off we first should define what we chose to call spam. I am going to choose to call as spam any e-mails that I am sent, that have been sent automaticly, that I do not want. I make the distinction of it having been sent automaticly since I get e-mail from individual people that I do not want, yet it is not spam. This definition also, potentially, includes e-mails from so-called 'opt-in' lists. In many cases I am now subscribed to those lists simply because I forgot to check or uncheck a box, not because I actually wanted those e-mails. So I'd rather that they disappeared from my sight also. I know that this does not follow a technical or (potential) legal definition of spam, but I feel confident that it follows a fairly good 'practical definition' of spam.
Currently our most effective anti-spam solutions are based on filters that recognize features within the e-mail. One of the most well known of these filtering programs is SpamAssassin. Basicly these programs work by looking for certain known strings within an e-mail. For example this is the report that came with a recent piece of spam that I received:
SPAM: -------------------- Start SpamAssassin results ------------------
SPAM: The WELL is running spam tagging software called SpamAssassin.
SPAM: This software has flagged this message as probable spam. For more
SPAM: information on SpamAssassin on The WELL, and how to configure it
SPAM: to work for you, please go to http://www.well.com/help/spam/
SPAM:
SPAM: Content analysis details: (8.50 hits, 6 required)
SPAM: CUM_SHOT (2.3 points) BODY: Possible porn - Cum Shot
SPAM: BEST_PORN (0.5 points) BODY: Possible porn - Best, Largest Porn Collections
SPAM: EXTRA_CASH (0.4 points) BODY: Offers Extra Cash
SPAM: HOT_NASTY (0.4 points) BODY: Possible porn - Hot, Nasty, Wild, Young
SPAM: CLICK_BELOW (0.3 points) BODY: Asks you to click below
SPAM: SPAM_PHRASE_03_05 (1.1 points) BODY: Spam phrases score is 03 to 05 (medium)
SPAM: score: 3
SPAM: BIG_FONT (0.3 points) BODY: FONT Size +2 and up or 3 and up
SPAM: HTML_FONT_COLOR_RED (0.3 points) BODY: HTML font color is red
SPAM: LINK_TO_NO_SCHEME (1.3 points) BODY: Contains link without http:// prefix
SPAM: CLICK_HERE_LINK (0.3 points) BODY: Tells you to click on a URL
SPAM: MSG_ID_ADDED_BY_MTA_3 (0.9 points) 'Message-Id' was added by a relay (3)
SPAM: CTYPE_JUST_HTML (0.4 points) HTML-only mail, with no text version
SPAM:
SPAM:
SPAM: Please remove this report if reporting this message to SpamCop.
SPAM: -------------------- End of SpamAssassin results -----------------
The e-mail in question was trying to sell me hard-core porn videos.
So we can see that SpamAssassin is finding certain strings in the text of the e-mail or within the HTML of the the e-mail. It then assigns a point value to each of the phrases that has triggered it and then marked it as spam since the the total point value is greater than a threshold that I have specified.
But there is a large problem with this. What do the points really mean? What causes "Cum Shot" to get a score that is six times larger than "Hot, Nasty" does? The problem is that no one really knows what these numbers mean, not even the developers who have assigned the values. This e-mail got a score of 8.5 and so it failed the spam test and got marked as such. But lets compare this to the scores of two other e-mails. Firstly the Everything Daily Report. The EDR gets a spam score of 1.4, well below my threshold of 6. Apparently it failed the following tests: "NO_REAL_NAME, SPAM_PHRASE_00_01, SUBJECT_FREQ". Secondly, a hand-typed e-mail from my girlfriend: this e-mail gets a score of 3.6, in turn it failed the following tests: "FROM_ENDS_IN_NUMS, MIME_LONG_LINE_QP, NO_REAL_NAME, SPAM_PHRASE_03_05".
Hold on... lets look at that again... EDR - 1.4, personally written e-mail from girlfriend - 3.6??? How does a hand written e-mail get a score over 2.5 times higher than an automated, strictly formated e-mail sent by E2? This indicates a major weakness in automated filtering.
The other two major problems with programs like SpamAssassin are false positives and false negatives. Of the two false positives are a far greater problem. If you have effective spam filters then you are unlikely to check your spam for any real e-mails that may have fallen through the cracks. But in the real world that could have been an important e-mail that you just missed. So it is vital that any filter doesn't have any false positives. So far with SpamAssassin I've had about 5 or 10 false positives in the last year and several thousand e-mails. So its not doing too badly there. False negatives are less important, but still annoying. Currently for every 10 pieces of spam that I get about 2 will slip through the filters and end up in my Inbox. This is not an acceptable percentage.
There is also one problem that applies to all 'Feature Recognizing' filters. Once you know what the filters are looking for then its very easy to get e-mail past them. The filters are not dynamic and don't learn. So you just end up with an endless cycle of increasingly intelligent spam against more and more broad-sweeping filters. The situation is fairly similar to that of virus writers against anti-virus software. Ultimately the anti-spam filters will always be lagging behind the spam.
So whats the solution to this problem?
Well its looking as if statistical filtering could provide the ideal solution to these problems. Basicly statistical filtering works by assigning to each e-mail the probability that it is spam. It works by analyzing the words in the e-mail and looking up the probability of each word being in a spam e-mail. We then use Bayes's Rule to work out the overall probability of the e-mail being spam.
So let us outline the system that Graham gives in his two articles. Firstly we need a large corpus of spam and of non-spam. We then feed these into a program that counts the number of instances of every word (where a word can also be a number such as "$9.99" or an IP address, or a URL and also e-mail headers are included in the data that is parsed) and stores them in to two tables, one for spam and one for non-spam. We then generate a third table, giving the probability that each word appears in a spam e-mail. To generate the third table we use the following formula (translated from LISP, all errors are mine). Note that it is more 'valuable' for a word to appear in the non-spam corpus than the spam corpus, so we double the number of occurrences of any word in the non-spam corpus.
let nbad equal the number of spam e-mails in the corpus
let ngood equal the number of non-spam e-mails in the corpus
for all words w in {list_of_spam_words, list_of_non-spam_words}
let b equal the number of occurrences of w in spam e-mails
let g equal the number of occurrences of w in non-spam e-mails
( b )
(-----)
( nbad)
let p_is_spam equal -----------------------
( 2g ) ( b )
(-----) + (-----)
(ngood) ( nbad)
next w
Note that the maximum value of b/nbad and g/ngood is 1. Also note that: 0.001 < p_is_spam < 0.999 or in other words, p_is_spamcan be at most 0.999 and at least 0.001. The reason for adding more weight to non-spam words is to reduce the instances of false positives. We can easily see that if a words tends to regularly appear in both your spam and non-spam e-mails (if, for example your SO enjoys e-mailing you kinky stories) then we want to make sure that an e-mail with that word in it still has a good chance of being classed as non-spam and just just thrown straight into the spam box.
So now that we have our nice table of probabilities how do we use it?
Simple, for every e-mail that we now get we run it through a simple program. We first take the 15 words that are furthest away from a neutral 0.5 chance of appearing in spam, so for example 0.2 is just as far away as 0.8 - basicly we take the 15 highest values for |p_is_spam(w) - 0.5| - where p_is_spam is just looked up from the table we generated earlier. We then apply Bayes's Rule to these numbers to come up with a probability that the entire e-mail is spam. Since we have taken 15 words, we will label the probability of each word being spam from a to o. So we now apply the following formula:
abc...o
----------------------------------- = probability that the e-mail is spam
abc...o + (1-a)(1-b)(1-c)...(1-o)
If this final value is over 0.9 then we categorize the e-mail as being spam.
There one problem remaining though. How do we classify words that we haven't come across before? Since most spam tends towards the same vocabulary Graham suggests, in his articles, using a value 0.4. This value is sufficiently neutral to not skew the filter too much, yet it takes advantage of the regular vocabulary in spam to give the word 'benefit of the doubt' and push it slightly towards the non-spam side.
One of the most important parts of the process is in creating the original two word lists with your spam and non-spam corpi. It is very important, especially for the non-spam corpus, to use your own e-mails. This is so that words that commonly appear in your own e-mails can be picked up and correctly classified. For example someone who does a lot of coding will most likely have words like "C" or "char" appear often in your e-mail. So you want to make sure that they will be good indicators of non-spam.
How effective is this all?
The only numbers I have to go on for the effectiveness of this technique are the results that Graham has published in his articles. With his first filters (Graham altered his treatment of certain words afterwards) he claims 99.5% of spam filtered out, with 0.03% false positives, although all the false positives were 'spamesque'. I hope to be able to verifiy these numbers soon by writing my own filtering software
Graham notes that once he altered his parsing routines to be more discriminating then the performance of the filters goes up (although the performance difference from 99.5% to 99.75% seems fairly negligible to me). The improvements he made were to make the the parsing routines more fine-grained. For example to differentiate between a word appearing in the body of an e-mail and the subject line of an e-mail. He also allowed for unknown words to degrade untill they match a know word. This means that a word such as "Sexy!!!!!!" could degrade down to a known word such as "Sexy!".
So far this form of probabilistic filtering is showing great promise. Yet there are still improvements that can be made. The main problem is e-mails that are written in a non-spam style. Any e-mail that is written in a more natural style (with out huge HTML tags, no exclamation marks etc.) is far more likely to get through this filters. Graham calls these the "spam of the future". An example of this sort of e-mail is:
From: Cindy
To: #####@well.com
Subject: why havn't you been online lately?
Date: Sun, 9 Mar 2003 22:26:02 GMT
Hi Gorgeous.
Here are the pics I promised
http://www.Cindy.com/breasts
If my father could see me now he would kill me :)
Anyway's I hope the pics turn you on :)
Here are some of my home movies
http://www.Cindy.com/dirtymovies
Yours Truly
Cindy Stellick
Although it is easy to see by sight that this is spam, the style in which it is written is quite natural and uses words that are more likely to appear in our non-spam corpus of e-mails.
Thankfully these sort of e-mails are harder for spammers to write, and will become increasingly hard to write as statistical filtering takes off. For them to succeed they need to have enough words from our non-spam corpus to offset any 'spam words' that might be included as well. This means that the e-mails have to be more personalized towards each person that they are sent to.
To conclude, I think that this sort of mail filtering will be the start of a new war on spam, one where we will be a step ahead for the first time. If enough people use effective filtering software, or if the ISP's start to use it then the response rate to spam will hopefully drop so low that it will no longer be of economic benefit for spamers to hawk their wares in this way.
At last we shall be free...