hoochie
Full Member | Редактировать | Профиль | Сообщение | Цитировать | Сообщить модератору Q. I've fed DSPAM thousands of spam, and am only getting marginal accuracy. What's up? A. Your problem might be that you've fed DSPAM thousands of spam, but have not fed it enough nonspam for it to learn adequately. It's typically a bad practice to feed a statistical filter a grossly unbalanced corpus of mail, and if you're using a version of DSPAM that has a "training buffer" enabled by default, feeding a ton of spam can also cause it to start watering down its results until you feed it more ham. This watering down gets stronger the higher your spam ratio is, in an attempt to prevent false positives - so the more spam you feed it, the worse your accuracy will get. There are a few things you can do to remedy this: Turn off the training buffer ("Feature tb=5" in dspam.conf) if it is turned on, or lower the buffering level. You'll want to use a value lower than 5, as this is DSPAM's default. A value of 0 will disable this protection entirely. Find a value that gives you the best spam filtering without allowing for too many false positives. The better solution may be to feed DSPAM enough nonspam to exceed the training threshold (2500 messages). This will not only disengage the statistical sedation feature, but will allow other algorithms to kick in, such as Bayesian Noise Reduction, which only engage after training. Try deleting your database and retraining using the dspam_train tool, instead of dspam_corpus. dspam_corpus isn't really designed for building highly accurate pretrained databases. If this doesn't work, or you're showing TI+IC values over 2500 in dspam_stats for your user, another common problem is incorrect training parameters. When a message is retrained in DSPAM, be careful not to specify it as a corpusfed spam, but as an error. Check your commandline arguments, and make sure you're using --source=error and NOT --source=corpus. --source=corpus is for messages that have not been processed by DSPAM. --source=error is for messages that have been processed by DSPAM, and were erroneously classified. It's important not to specify corpus training on missed spam, because DSPAM only learns corpus messages, and doesn't relearn them. So you'll end up with 1 spam tick mark and 1 innocent tick mark, instead of the correct result: 1 spam tick mark and 0 innocent tick marks. |