SpamAssassin Config Error

system · March 8, 2005, 1:14pm

I have encountered a problem with the spamassassin configuration on iworx.

First, the spamd bayes database is stored in /tmp. By definition, this is a temporary directory, and in some cases is a ramdisk (which would cause loss of spamd’s homedir on reboot). Additionally, this means that ALL domains using spamassassin will use a global bayes database. This is a terrible idea in a webhosting environment, since CustomerA has different criteria on what constitutes spam than CustomerB.

It appears that the developers either overlooked the following option, or noted its warning that it’s incompatible with SQL implementations, and settled for a global database.

(source: spamd(1) manpage)
SPAMD(1) User Contributed Perl Documentation SPAMD(1)
[…]
–virtual-config-dir=dir Enable pattern based Virtual configs
(needs -x)
[…]
–virtual-config-dir=pattern
This option specifies where per-user preferences can be found for
virtual users, for the -x switch. The pattern is used as a base
pattern for the directory name. Any of the following escapes can
be used:

       %u -- replaced with the full name of the current user, as sent by
       spamc.
       %l -- replaced with the ’local part’ of the current username.  In
       other words, if the username is an email address, this is the part
       before the "@" sign.
       %d -- replaced with the ’domain’ of the current username.  In other            words, if the username is an email address, this is the part after
       the "@" sign.
       %% -- replaced with a single percent sign (%).

       So for example, if "/vhome/users/%u/spamassassin" is specified, and
       spamc sends a virtual username of "jm@example.com", the directory
       "/vhome/users/jm@example.com/spamassassin" will be used.

       The set of characters allowed in the virtual username for this path
       are restricted to:

               A-Z a-z 0-9 - + _ . , @ =

       All others will be replaced by underscores ("_").

       This path must be a writable directory.  It will be created if it
       does not already exist.  If a file called user_prefs exists in this
       directory (note: not in a ".spamassassin" subdirectory!), it will
       be loaded as the user’s preferences.  The auto-whitelist and/or
       Bayes databases for that user will be stored in this directory.

       Note that this requires that -x is used, and cannot be combined
       with SQL- or LDAP-based configuration.

The documentation recommends per-user databases:

(from Mail::SpamAssassin::CoUser)Contributed Perl DocumenMail::SpamAssassin::Conf(3))
By default, each user has their own, in their “~/.spamassassin”
directory with mode 0700/0600, but for system-wide SpamAssassin
use, you may want to reduce disk space usage by sharing this across
all users. (However it should be noted that Bayesian filtering
appears to be more effective with an individual database per user.)

Additionally, there appears to be no way to “move” the user-added configuration options. Since SA options are read “down” and will clobber, the order is important.

Finally, how is someone supposed to “tell” SA when it’s made a mistake? Will use of SA require use of Horde so that server-side folders can be seen? How can users send the sa-learn command? Does use of SA require IMAP?

IWorx-Paul · March 8, 2005, 2:24pm

Hi Chris,

Regarding the Bayes Database setup, it’s clearly not ideal the way it is. In fact it is the way it is only as a result of the spamassassin defaults (which just means we haven’t done any specific configuration of the Bayes feature).

Indeed, I was not aware of the --virtual-config-dir option, thanks for bringing that up, that should make per-user bayes databases feasible. I’m looking into it.

Additionally, there appears to be no way to “move” the user-added configuration options. Since SA options are read “down” and will clobber, the order is important.

Is there a particular scenario you’re concerned about here? The SiteWorx level Spam Preferences will “clobber” the Global (NodeWorx) level Spam Preferences by design - but it seems like you’re concerned about only ordering within only the SiteWorx level Preferences, is that right? Maybe if you describe the problem that prompted this question I’ll better understand your concern.

Finally, how is someone supposed to “tell” SA when it’s made a mistake? Will use of SA require use of Horde so that server-side folders can be seen? How can users send the sa-learn command? Does use of SA require IMAP?

Re: sa-learn, this isn’t implemented yet, and will make more sense when the bayes database setup is improved.

Re: IMAP - What are you really getting at here? No, IMAP isn’t needed to use SA.

Paul

IWorx-Paul · March 8, 2005, 2:47pm

Actually it looks like the --vritual-config-dir isn’t compatible with the --sql-config option, so that won’t work.

However there’s also an option to store bayes databases in a SQL database, so I’m looking into that as well.

Paul

IWorx-Paul · March 8, 2005, 2:49pm

Sorry for repeating the obvious, I looked back at your post and saw that you said that option wasn’t compatible w/ --sql-config.

system · March 9, 2005, 8:54am

Is there a particular scenario you’re concerned about here? The SiteWorx level Spam Preferences will “clobber” the Global (NodeWorx) level Spam Preferences by design - but it seems like you’re concerned about only ordering within only the SiteWorx level Preferences, is that right? Maybe if you describe the problem that prompted this question I’ll better understand your concern.

This is primarily an organizational problem. Let’s say I want to add a header. I create the config option in the interface and all’s well. In fact, I add several. Later, I realize that I forgot to clear the headers first. If I add the command to clear the headers, it will do so, in order. Therefore my headers will be added, then deleted. There’s no way for me to “move” a particular rule (like clear headers) up (or down) in the config order, and there’s no way to “insert here” a rule. So I’m left with the only option: delete all configs that may be affected, and re-enter them in the correct order. While this would solve it, it forces me to duplicate a lot of work, and would be very frustrating. Kind of like punishment for not doing it right the first time.

Re: sa-learn, this isn’t implemented yet, and will make more sense when the bayes database setup is improved.
Re: IMAP - What are you really getting at here? No, IMAP isn’t needed to use SA.

This is a general SA issue, as well as a Bayes issue. Let’s say that SA marks an email as spam. If it is, cool, and it’s put into … where? The spam folder? No, not possible on pop3 (which only understands INBOX). What if it’s marked spam, but it’s really not? Ordinarily I’d just run sa-learn, but there’s no way for me to do that here. I could potentially do this by having a this-was-marked-as-spam-but-isn’t folder and a this-was-marked-as-ham-but-isn’t folder, but the server-wide bayes filtering, coupled with no folder access is a problem. In IMAP, it could be a write-only shared (and subscribed) folder that I could use to train SA (and bayes), but pop3 again doesn’t grok folders. (and that would carry its own problems of oops, wrong folder). If we’re just marking with SA, and relying on the MUA to sort based on the SA header, it’s a reduction of usefulness (especially since many MUAs can’t sort based on X-headers).

I VERY_MUCH appreciate you guys adding SA and ClamAV to the offerings, and I just want to give some feedback to improve the implementation and configuration. Thanks for an excellent control panel!

IWorx-Paul · March 9, 2005, 10:00am

This is primarily an organizational problem. Let’s say I want to add a header. I create the config option in…

Thanks for the explanation, and good point, we’ll give it some thought.

This is a general SA issue, as well as a Bayes issue. Let’s say that SA marks an email as spam. If it is, cool, and it’s put into … where? The spam folder?

Indeed, right now it obviously just delivers the Spam to your inbox, tagged in whatever way your configuration defines. The user can then decide to filter the spam to another folder if they choose, and if they know how to do that. I agree that this isn’t ideal - I’d prefer to have the option of delivering tagged Spam to an IMAP folder by the MDA. That is definately the direction we’re headed, and there is active development in vpopmail right now that’s adding this very feature in vdelivermail, which would make it very easy.

I’ve got the Bayes problem worked out now. We’ll have per-user (not just per-unix-user, but per-email-box-user) auto-whitelisting and bayes database functioning shortly, and a straightforward way to train your personal bayes database with your spam and ham, using two IMAP folders.

One folder will be named “Learn Spam” and the other will be called “Learn Ham”. Periodically (probably once per day) these IMAP folders will be used to train your bayes database. Once the training is done the folders will be emptied automatically. So you’d put false positives in “Learn Ham” and false negatives in “Learn Spam”.

And just to be clear, this isn’t available yet, it’s under development and testing right now.

Paul

system · March 9, 2005, 10:59am

Will the /tmp problem be resolved with this fix, or before? On my system, /tmp is a tmpfs partition, and my .spamassassin folder won’t survive a reboot.

IWorx-Paul · March 9, 2005, 11:04am

Yes, it’ll be resolved. There will be no .spamassassin folder anywhere

system · March 9, 2005, 11:21am

Erm, does that mean putting bayes into SQL? That may be problematic. It’s not uncommon for a bayes database to be huge … hundreds of thousands of rows, and each match would require the row be updated.


 update blah set nspam=nspam+1 where word='the';
 update blah set nspam=nspam+1 where word='word';
 update blah set nspam=nspam+1 where word='also';
 update blah set nspam=nspam+1 where word='foo';

IWorx-Paul · March 9, 2005, 11:49am

Erm, does that mean putting bayes into SQL?

Yes. This provides by far the most flexibility.

It’s not uncommon for a bayes database to be huge … hundreds of thousands of rows and each match would require the row be updated.

I can’t imagine there being an order of magnitude difference between between the flat file version of and the SQL version. MySQL is pretty good at what it does after all.

Paul

system · March 11, 2005, 1:04pm

H?h? I’m happy to see this thread.

I thought that Paul was sulky me and did not want to answer my questions relating to Bayes and SA-learn
Ok, it is true that I had not put the questions at the good place (see annoucement http://interworx.info/forums/showpost.php?p=2611&postcount=11 and in nodeworx request http://interworx.info/forums/showthread.php?t=445 lol I didn’t find the bayes database was in /tmp )

For me it’s not the pbm to have a common data base for all domains, the union makes the force when it is about Spam.
On the other hand it is true that normally, for the “autolearn” functions, it is already necessary to have launched Sa-learn on at least 200 Spam mails and ham mail.
I personaly thought doing this using imap but all the customers can’t do this.

but Paul is here and :

I’ve got the Bayes problem worked out now. We’ll have per-user (not just per-unix-user, but per-email-box-user) auto-whitelisting and bayes database functioning shortly, and a straightforward way to train your personal bayes database with your spam and ham, using two IMAP folders.

One folder will be named “Learn Spam” and the other will be called “Learn Ham”. Periodically (probably once per day) these IMAP folders will be used to train your bayes database. Once the training is done the folders will be emptied automatically. So you’d put false positives in “Learn Ham” and false negatives in “Learn Spam”.

And just to be clear, this isn’t available yet, it’s under development and testing right now.

Suprised me.

Ok I’ve already set up a bayes database in MySQL, but I’m curious about the IMAP folders and the SA-learn run automaticly.

How will you route the spam mails to the spam imap folder and the ham mails to the ham imap folder ?
After what, you’ll run a cron job sa-learn ham and sa-learn spam ?

If you do this, it 'll be really impress me

For the auto-withelist in mysql it is not a real pbm.

I have to tell that I really waited for this spam/clam option in interworx, and I was curious on how you’d do this. I was afraid you just ran spam without any bayes and SA option, because I could understand it was not an easy issue for multi hosting (but I was really happy to have it the way it is now).
So If you resolve the Bayes and SA pbm, well, well, well…

Let take a drink with me

Good job Paul (and unknown darkness others peoples)

Pascal
ps : erff too hard to tell some idiomatic expression in english so excuse me by advance