Large-scale global Bayes tuning?

Discussion:

Kris Deugau

2008-04-09 16:12:43 UTC

Anyone have any suggestions on tuning a large global Bayes db for
stability and sanity? I've got my fingers in the pie of a moderately
large mail cluster, but I haven't yet found a Bayes configuration that's
sane and stable for any extended period. Wiping it completely about
once a week seems to provide "acceptable" filtering performance (we have
a number of addon rulesets), but I still see spam in my inbox with
BAYES_00 - a sure sign of a mistuned Bayes database.

Past experience with (much) smaller systems has shown stable behaviour
with bayes_expiry_max_db_size set to 1500000 (~40M BDB Bayes), daily
expiry runs delete ~25-35K tokens; mail volume ~3K/day. However, the
larger system (MySQL, currently set with max_db_size at 3000000, on-disk
files running ~100M) only seems to be expiring that same 25-35K tokens
even though autolearn is picking up ~1.5M+ from ~300K messages on a
daily basis. Reading through the docs on token expiry I would guess it
should be far more aggressive than it is. (Among other things, I really
don't want to bump up max_db_size by two orders of magnitude; up to ~5M
should be fine, and I could see as high as 7.5M if really necssary.)

I'm not even really sure what questions to ask to get more detail;
sa-learn -D doesn't really spit out *enough* detail about the expiry
process to know for sure if something is going wrong there.

-kgd

John Hardin

2008-04-09 16:26:32 UTC

Permalink

autolearn is picking up ~1.5M+ from ~300K messages on a daily basis.

Push your autolearn thresholds out to reduce the overall volume of learned
spam and ham?
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
***@impsec.org FALaholic #11174 pgpk -a ***@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
People seem to have this obsession with objects and tools as being
dangerous in and of themselves, as though a weapon will act of its
own accord to cause harm. A weapon is just a force multiplier. It's
*humans* that are (or are not) dangerous.
-----------------------------------------------------------------------
4 days until Thomas Jefferson's 265th Birthday

Kris Deugau

2008-04-09 16:38:41 UTC

Permalink

Post by John Hardin

autolearn is picking up ~1.5M+ from ~300K messages on a daily basis.

Push your autolearn thresholds out to reduce the overall volume of
learned spam and ham?

I've thought about that. It makes it more difficult to get Bayes data
on the critical messages in that middle range though. :(

-kgd

John Hardin

2008-04-09 16:50:16 UTC

Permalink

Post by Kris Deugau

Post by John Hardin

autolearn is picking up ~1.5M+ from ~300K messages on a daily basis.

Push your autolearn thresholds out to reduce the overall volume of learned
spam and ham?

I've thought about that. It makes it more difficult to get Bayes data
on the critical messages in that middle range though. :(

How varied is the character of your message traffic? Is manual learning an
option, especially with larger autolearn thresholds?

Then at least you'd be able to reseed your bayes with a known-good corpus.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
***@impsec.org FALaholic #11174 pgpk -a ***@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
People seem to have this obsession with objects and tools as being
dangerous in and of themselves, as though a weapon will act of its
own accord to cause harm. A weapon is just a force multiplier. It's
*humans* that are (or are not) dangerous.
-----------------------------------------------------------------------
4 days until Thomas Jefferson's 265th Birthday

Kris Deugau

2008-04-09 19:56:44 UTC

Permalink

Post by John Hardin
How varied is the character of your message traffic? Is manual learning
an option, especially with larger autolearn thresholds?

What is this... "manual learning"... you speak of? <g>

Not really an option in the short term, although in the long term I'd
*like* to have a system similar to what I've mostly trained users to do
on the much smaller systems - forward misclassified mail to a suitable
role account as an attachment for manual processing (whitelist,
blacklist, feed to Bayes, write/adjust rules, etc). Of course, that
requires someone to *do* the manual processing.... :(

I've been taking my own FNs and feeding them back in; that's really the
only misclassified mail I have easy access to. No FPs noticed so far....

Post by John Hardin
Then at least you'd be able to reseed your bayes with a known-good corpus.

*nod* I've thought about exporting the database from the smaller system
and pulling it in to the cluster to see how the accuracy is.

"Tokens don't get expired according to my understanding of the expiry
algorithm" about sums up the immediate problem; overall filter accuracy
is pretty good on the whole.

-kgd

Michael Scheidell

2008-04-09 16:27:39 UTC

Permalink

Organization: ViaNet Internet Solutions
Date: Wed, 09 Apr 2008 12:12:43 -0400
Subject: Large-scale global Bayes tuning?
Anyone have any suggestions on tuning a large global Bayes db for
stability and sanity? I've got my fingers in the pie of a moderately
large mail cluster, but I haven't yet found a Bayes configuration that's
sane and stable for any extended period. Wiping it completely about
once a week seems to provide "acceptable" filtering performance (we have
a number of addon rulesets), but I still see spam in my inbox with
BAYES_00 - a sure sign of a mistuned Bayes database.

Bayes on cluster begs the question: what if you didn't replicate the bayes
tables, and left them server specific?

Since (depending on configurations) some of the servers might get 'spam
only' (higher mx records), maybe just take one of the 'valid' bayes tables
and manually copy it (sa-learn backup, sa-learn clear, restore) every week
or so.

Only way I could get a cluster of 9 to work right.
--
Michael Scheidell, CTO

|SECNAP Network Security

Winner 2008 Network Products Guide Hot Companies
FreeBSD SpamAssassin Ports maintainer
Charter member, ICSA labs anti-spam consortium

_________________________________________________________________________
This email has been scanned and certified safe by SpammerTrap(tm).
For Information please see http://www.spammertrap.com
_________________________________________________________________________

Kris Deugau

2008-04-09 16:36:56 UTC

Permalink

Post by Michael Scheidell
Bayes on cluster begs the question: what if you didn't replicate the bayes
tables, and left them server specific?

It may yet take that. :( (If only for overall cluster reliability -
any one of the current three machines could handle the current load
without any trouble, but we're likely going to stuff ClamAV on them as
well.) Unfortunately that means doing mistake-training on *each*
machine - autolearn on it's own just doesn't cut it.

I'm dogfooding pretty much that exact scenario on one machine; it's got
its own local Bayes DB that I'm hand-training with my own mail.

Post by Michael Scheidell
Since (depending on configurations) some of the servers might get 'spam
only' (higher mx records), maybe just take one of the 'valid' bayes tables
and manually copy it (sa-learn backup, sa-learn clear, restore) every week
or so.

Mmmh. Access is for both inbound and outbound mail, through a
load-balancer; the type of mail seen on any one system is pretty much
identical over time.

Michael Scheidell

2008-04-09 16:52:40 UTC

Permalink

Organization: ViaNet Internet Solutions
Date: Wed, 09 Apr 2008 12:36:56 -0400
Subject: Re: Large-scale global Bayes tuning?

Post by Michael Scheidell
Bayes on cluster begs the question: what if you didn't replicate the bayes
tables, and left them server specific?

It may yet take that. :( (If only for overall cluster reliability -
any one of the current three machines could handle the current load
without any trouble, but we're likely going to stuff ClamAV on them as
well.) Unfortunately that means doing mistake-training on *each*
machine - autolearn on it's own just doesn't cut it.
I'm dogfooding pretty much that exact scenario on one machine; it's got
its own local Bayes DB that I'm hand-training with my own mail.

You could also take mysql off of one or several, have them load balance to
the other mysql servers, run a caching (global) dns server and clamav on one
of them.

What about DCC? I assume with those volumes you are running a local DCC
server, and having the other boxes talk to it?

Mmmh. Access is for both inbound and outbound mail, through a

Keep a couple for outbound only, won't need bayes too much on those.
We have an engineering spec for a 9x9 (9 nodes in a cluster, 9 clusters in a
group) to support up to 2MM users, and we do a lot of task and load
splitting like that.
--
Michael Scheidell, CTO

|SECNAP Network Security

2008-04-09 20:05:17 UTC

Permalink

Hi Kris,

Post by Kris Deugau
Anyone have any suggestions on tuning a large global Bayes db for
stability and sanity? I've got my fingers in the pie of a
moderately large mail cluster, but I haven't yet found a Bayes
configuration that's sane and stable for any extended
period. Wiping it completely about once a week seems to provide
"acceptable" filtering performance (we have a number of addon
rulesets), but I still see spam in my inbox with BAYES_00 - a sure
sign of a mistuned Bayes database.

Spam hitting BAYES_00 points to the bayes database being
polluted. That can happen if the autolearn levels are not low
enough. Some manual learning can help to keep the Bayes database in
tune. A more aggressive expiry won't necessarily prevent
mistuning. You'll have to do some MySQL tuning for performance. In
a large setup, manual learning isn't always possible. You can have
some rules to identify some "good" and "bad" messages which are
representative of the userbase.

Regards,
-sm