[R] Re: Problems for 13 year old
Jim_Garrett@bd.com
Jim_Garrett at bd.com
Mon Jan 27 17:11:03 CET 2003
How about spam filtering?
Granted, there's some infrastructure involved, which means gratification is
not instant. But it involves something that most people who use computers
care about: e-mail, and spam.
I mention this because the following web site sparked some interest in
statistics among some acquaintances who were otherwise very cool to it:
http://www.paulgraham.com/spam.html
This outlines a "Bayesian" spam filter. I'm not sure it's wholly Bayesian,
but it comes close, the author's are good, and I hear that it performs
well, in fact better than many commercial spam filters (or so I hear).
Moreover, the web site virtually gushes about the virtues of statistical
methods. The interesting thing about the filter is that you get to see
what "features" it's discovering.
A quick search also indicated that Mozilla apparently offers a plug-in for
the same spam filter. That would offer a quick way to get the filter up
and running with real e-mail. But I don't know if Mozilla offers
interesting diagnostics about which features it's using, which is the
pedagogically interesting part. Mozilla mentions it here:
http://www.mozilla.org/mailnews/spam.html
Of course, you can use any number of classification techniques to
distinguish spam from other e-mail, you just need data. Hastie and
Tibshirani's _The Elements of Statistical Learning_ demonstrates a couple
of types of models applied to the spam problem, and points to data at
ftp.ics.uci.edu
Ideally, you would do some exploration to design a filter, implement it in
R, and then integrate it with your nephew's e-mail program. This would be
a long-term project, maybe even a science-fair project, with long-term
benefits (educational and practical). I know this can be done with Linux,
but I have no idea about Mac OS 9! It's probably a stretch for typical
13-year-olds, but for the right 13-year-old, it would be a blast.
Good luck!
Jim Garrett
Baltimore, Maryland, USA
*********************************************************************************
This message is intended only for the designated recipient(s). ... [[dropped]]
More information about the R-help
mailing list