Previous Entry Share Next Entry
POPFile Roolz
digimind
thwack
The other day as I upgraded POPFile to the latest release, I decided to add a new bucket. Up until now I only had it classifying my e-mail into two buckets: personal and spam. I figured I'd add one for all the worms and viruses that try to come in through attachments, so I added the worm bucket.

I went in my history and reclassified a few e-mails that had previously been classified as spam but should now be in the worm bucket. (Gross!) I only had to reclassify 5 - FIVE! - before POPFile recognized all the rest. I keep going back into my history to check random ones, but they all say "would now classify as worm".

Oh yeah, that's one of many useful new features added in the latest release. On message-view pages, if the message would now classify as some other bucket than the one to which it was originally classified, it says so. That helps avoid unnecessarily reclassifying stuff.

  • 1
Likewise, I upgraded. I didn't notice the additional Perl modules I needed to install. Oh well. I was up to 99.01% accuracy over 8300+ e-mails.

Can't complain about accuracy like that.

Now it's up to 100% after I upgraded and reset the stats. ;-)

I've had 99.31% accuracy over 13,525 e-mails, and didn't reset the stats, so that includes training of the new worm bucket. Awesome. :)

Perl modules... pheh... I ditched the multiplatform version months ago when stuff just wouldn't work anymore and nobody could figure out why the process kept disappearing. Windows version works much better for me.

vat es das?*




*totally made up accent, any resemblance to a real accent or language is purely coincidental

Don't worry, you've got Spam Assassin. :)

I provided a link, but yeah, it's not totally obvious that it's a spam filter. It works totally on training though, not using blacklists or whitelists or wordlists.

Here's a quick explanation that I wish I got at first, because it took me a while to grasp the concept. When you train POPFile by telling it "this message was spam" or "...personal", whatever, it marks EVERY word in that e-mail, headers and all, as representing whatever you said it is, and stores that in a database. Over time, an entire dictionary of words builds up, each one with a record of how often you trained it to be spam, personal, or whatever other "buckets" you specified. Using the scores of ALL the words in an e-mail combined, it places that e-mail in the appropriate bucket. It actually just adds a classification header and/or modifies the subject line. It's up to your e-mail client of choice to implement filters.

That's why I was impressed by how little training it took to add another bucket. Over 7 months of training I have built a spam bucket containing over 77,000 words, and a personal bucket around 53,000. Since all the virus/worm e-mails I got were previousyl classified as spam, I was thinking it would take a while to retrain those words out of the spam bucket and into the new worm bucket. But it turns out I only needed to re/train 5 messages, creating a worm bucket with 10,000 words. Now every other old message I would retrain to the worm bucket already says that's where it would currently be classified anyway. And the new ones coming in are being classified correctly too.

Actually, now I think I know why it was so easy to retrain, but I don't need to bore you with that. :)

  • 1
?

Log in

No account? Create an account