Spam Filter Improvement Through Measurement
Loading...
Files
Date
2009-04-27T18:24:57Z
Authors
Lynam, Thomas Richard
Advisor
Journal Title
Journal ISSN
Volume Title
Publisher
University of Waterloo
Abstract
This work supports the thesis that sound quantitative evaluation for
spam filters leads to substantial improvement in the classification
of email. To this end, new laboratory testing methods and datasets
are introduced, and evidence is presented that their adoption at Text
REtrieval Conference (TREC)and elsewhere has led to an improvement in state of the art
spam filtering. While many of these improvements have been discovered
by others, the best-performing method known at this time -- spam filter
fusion -- was demonstrated by the author.
This work describes four principal dimensions of spam filter evaluation
methodology and spam filter improvement. An initial study investigates
the application of twelve open-source filter configurations in a laboratory
environment, using a stream of 50,000 messages captured from a single
recipient over eight months. The study measures the impact of user
feedback and on-line learning on filter performance using methodology
and measures which were released to the research community as the
TREC Spam Filter Evaluation Toolkit.
The toolkit was used as the basis of the TREC Spam Track, which the
author co-founded with Cormack. The Spam Track, in addition to evaluating
a new application (email spam), addressed the issue of testing systems
on both private and public data. While streams of private messages
are most realistic, they are not easy to come by and cannot be shared
with the research community as archival benchmarks. Using the toolkit,
participant filters were evaluated on both, and the differences found
not to substantially confound evaluation; as a result, public corpora
were validated as research tools. Over the course of TREC and similar
evaluation efforts, a dozen or more archival benchmarks --
some private and some public -- have become available.
The toolkit and methodology have spawned improvements in the state
of the art every year since its deployment in 2005. In 2005, 2006,
and 2007, the spam track yielded new best-performing systems based
on sequential compression models, orthogonal sparse bigram features,
logistic regression and support vector machines. Using the TREC participant
filters, we develop and demonstrate methods for on-line filter fusion
that outperform all other reported on-line personal spam filters.
Description
Keywords
evaluation methodology, spam filtering, spam corpora, spam fusion