|
Some words appear more often in spam emails than in regular emails,
and this fact has been used by a number of systems to identify and
filter spam.
To try to visualize what sorts of things it picks up on in both spam and non-spam emails, I wrote a little script to colorize my emails word-by-word, based on each word's spamliness. The spamliness index is computed as by the pseudo-Bayesian formula: sp=[occurrences in spam]/[spam corpus size] np=[occurrences in nonspam]/[nonspam corpus size] spamliness=sp/(sp+np)with 0/0==0. Each word is assigned a color proportional to its spamliness: nonspam words are black, and spam-only words are red. Examples showing typical spam and nonspams side-by-side are here, here, and here. I should note that the spam and nonspam corpuses were unrelated to the actual test examples: they were the Ling-Spam corpus [1], which has the legitimate discussion on a linguistics mailing list versus a corpus of spam from 1999. The program to do this is here. It has some of my directory names in it, but it should be easy to generalize. [1] I. Androutsopoulos, J. Koutsias, K.V. Chandrinos, G. Paliouras, and C.D. Spyropoulos, "An Evaluation of Naive Bayesian Anti-Spam Filtering". In Potamias, G., Moustakis, V. and van Someren, M. (Eds.), Proceedings of the Workshop on Machine Learning in the New Information Age, 11th European Conference on Machine Learning (ECML 2000), Barcelona, Spain, pp. 9-17, 2000. |