distribute bayes with rsync

Discussion:

Reindl Harald

2014-10-17 09:59:30 UTC

Hi

does SA need anything to recognize a rsynced bayes on similar setups to
load the new version or is it anyways reopened for each connection by
spamd child?

in case of clamd rsync "/var/lib/clamav/" is enough

background:

* a perfect trained bayes on the inbound spamfirewall
* after recently a account was hacked and sent spam
(luckily not massive by rate-limits) which would have
been clearly caught by SA/spamass-milter i consider
to install SA also on the submission servers and just
rsync the bayes per cronjob

ls /var/lib/spamass-milter/.spamassassin/
insgesamt 8,4M
-rw------- 1 sa-milt sa-milt 32K 2014-10-17 11:52 bayes_journal
-rw------- 1 sa-milt sa-milt 324K 2014-10-17 09:50 bayes_seen
-rw------- 1 sa-milt sa-milt 11M 2014-10-17 11:45 bayes_toks
-rw------- 1 sa-milt sa-milt 98 2014-08-21 17:47 user_prefs

Joolee

2014-10-17 10:02:15 UTC

Permalink

File base bayes is actually quite slow. Isn't it an option for you to use
an sql replicated master-slave set or cluster or per haps Redis
master-slave replication?

It would be nice if we'd be able to use a clustered nosql database though.

Kind regards,
Peter Overtoom

Post by Reindl Harald
Hi
does SA need anything to recognize a rsynced bayes on similar setups to
load the new version or is it anyways reopened for each connection by spamd
child?
in case of clamd rsync "/var/lib/clamav/" is enough
* a perfect trained bayes on the inbound spamfirewall
* after recently a account was hacked and sent spam
(luckily not massive by rate-limits) which would have
been clearly caught by SA/spamass-milter i consider
to install SA also on the submission servers and just
rsync the bayes per cronjob
ls /var/lib/spamass-milter/.spamassassin/
insgesamt 8,4M
-rw------- 1 sa-milt sa-milt 32K 2014-10-17 11:52 bayes_journal
-rw------- 1 sa-milt sa-milt 324K 2014-10-17 09:50 bayes_seen
-rw------- 1 sa-milt sa-milt 11M 2014-10-17 11:45 bayes_toks
-rw------- 1 sa-milt sa-milt 98 2014-08-21 17:47 user_prefs

Axb

2014-10-17 10:11:19 UTC

Permalink

Post by Joolee
It would be nice if we'd be able to use a clustered nosql database though.

nosql like what?

something like Cassandra? CouchDB? for Bayes they're slower than file
DB.(tested, dumped)

If your traffic size justifies it, till Redis cluster is released, load
balancing/sentinel is probably the way to go.

Reindl Harald

2014-10-17 10:11:57 UTC

Permalink

Post by Joolee
File base bayes is actually quite slow. Isn't it an option for you to
use an sql replicated master-slave set or cluster or per haps Redis
master-slave replication?

performance is not a problem here, no high traffic on the submission
servers and on the inbound-spamfirewall SA only faces around 5% of all
delivery attempts, 80000 legit mail per month, 15000 per month catched
with SA and the other 450000 delivery attemps are catched with
postscreen or based on PTR/HELO/SPF with smtpd before the milter

i just want to know if i need to reload spamd after rsync

below "/var/lib/spamass-milter/" lives training messages and scripts to
update or rebuild from scratch making it perfectly maintainable

in case of rsync the training-folders and scripts are not needed on the
other machines, just the result below ".spamassassin" and the point is
if no reload is needed the rsync itself could happen with a restricted
user and no root-command needed

Post by Joolee
Hi
does SA need anything to recognize a rsynced bayes on similar setups
to load the new version or is it anyways reopened for each
connection by spamd child?
in case of clamd rsync "/var/lib/clamav/" is enough
* a perfect trained bayes on the inbound spamfirewall
* after recently a account was hacked and sent spam
(luckily not massive by rate-limits) which would have
been clearly caught by SA/spamass-milter i consider
to install SA also on the submission servers and just
rsync the bayes per cronjob
ls /var/lib/spamass-milter/.__spamassassin/
insgesamt 8,4M
-rw------- 1 sa-milt sa-milt 32K 2014-10-17 11:52 bayes_journal
-rw------- 1 sa-milt sa-milt 324K 2014-10-17 09:50 bayes_seen
-rw------- 1 sa-milt sa-milt 11M 2014-10-17 11:45 bayes_toks
-rw------- 1 sa-milt sa-milt 98 2014-08-21 17:47 user_prefs

2014-10-17 17:45:57 UTC

Permalink

On Fri, 17 Oct 2014 11:59:30 +0200

Post by Reindl Harald
Hi
does SA need anything to recognize a rsynced bayes on similar setups
to load the new version or is it anyways reopened for each connection
by spamd child?

I think so, and I don't recall any special handling being needed for
sa-learn, which can create a new token file. There is a lock-file that
can be used but I doubt it would be needed in this case.

I wouldn't just rsync it though. However the file is copied, I'd finish
with an atomic mv so there is no risk of a partial file.

Post by Reindl Harald
in case of clamd rsync "/var/lib/clamav/" is enough
* a perfect trained bayes on the inbound spamfirewall
* after recently a account was hacked and sent spam
(luckily not massive by rate-limits) which would have
been clearly caught by SA/spamass-milter i consider
to install SA also on the submission servers and just
rsync the bayes per cronjob

This is not ideal, a well-trained incoming database wont be
well-trained for outgoing mail.

Reindl Harald

2014-10-17 18:04:11 UTC

Permalink

Post by RW
On Fri, 17 Oct 2014 11:59:30 +0200

Post by Reindl Harald
does SA need anything to recognize a rsynced bayes on similar setups
to load the new version or is it anyways reopened for each connection
by spamd child?

I think so, and I don't recall any special handling being needed for
sa-learn, which can create a new token file. There is a lock-file that
can be used but I doubt it would be needed in this case.
I wouldn't just rsync it though. However the file is copied, I'd finish
with an atomic mv so there is no risk of a partial file.

rsync is atomic as long you don't use "--inplace" it creates a temp-file
http://stackoverflow.com/questions/3769263/are-rsync-operations-atomic-at-file-level

Post by RW

This is not ideal, a well-trained incoming database wont be
well-trained for outgoing mail

the 2000 ham samples are incoming and outgoing legit mail
for safety the high-scores of bayes are adjusted lower
BAYES_99 only 5.0 instead 5.5, BAYES_999 0.5
should be safe given no RBL's and only URIBL
_______________________________________________________________

in the meantime it's implemented on both submission servers
my "sa-learn.sh" at the end does the sync automatically

great, with 2.0 lower bayes weight copy the mailbody from
the abuse-mail we got ofter the hacked account into a new
mail and try to send it to my gmail address score 10.1
while milter rejects above 8.0

Oct 17 18:09:07 mail spamd[30845]: spamd: identified spam (10.1/7.0) for
sa-milt:189 in 0.1 seconds, 2724 bytes.
Oct 17 18:09:07 mail spamd[30845]: spamd: result: Y 10 -
ADVANCE_FEE_3_NEW,ADVANCE_FEE_4_NEW,ADVANCE_FEE_5_NEW,ALL_TRUSTED,BAYES_50,LOTTO_DEPT,RP_MATCHES_RCVD,TVD_APPROVED
scantime=0.1,size=2724,user=sa-milt,uid=189,required_score=7.0,rhost=localhost,raddr=127.0.0.1,rport=33091,mid=<***@thelounge.net>,bayes=0.500000,autolearn=disabled

2014-10-20 16:03:29 UTC

Permalink

On Fri, 17 Oct 2014 20:04:11 +0200
Reindl Harald wrote:
a perfect trained bayes on the inbound spamfirewall

Post by Reindl Harald

Post by RW

Post by Reindl Harald
* after recently a account was hacked and sent spam
(luckily not massive by rate-limits) which would have
been clearly caught by SA/spamass-milter i consider
to install SA also on the submission servers and just
rsync the bayes per cronjob

This is not ideal, a well-trained incoming database wont be
well-trained for outgoing mail

the 2000 ham samples are incoming and outgoing legit mail

If possible it's better to keep them separate because there will be
tokens frequencies that are very different between the two types of ham.

For example, if a spammer is sending-out spam spoofing a bank, you don't
want to have legitimate incoming mail from that bank in your ham corpus.

Reindl Harald

2014-10-20 16:59:08 UTC

Permalink

Post by RW
On Fri, 17 Oct 2014 20:04:11 +0200
a perfect trained bayes on the inbound spamfirewall

Post by Reindl Harald

Post by RW

This is not ideal, a well-trained incoming database wont be
well-trained for outgoing mail

the 2000 ham samples are incoming and outgoing legit mail

If possible it's better to keep them separate because there will be
tokens frequencies that are very different between the two types of ham.
For example, if a spammer is sending-out spam spoofing a bank, you don't
want to have legitimate incoming mail from that bank in your ham corpus

no autolearning, hand-feed bayes and it was *a lot* of work catch 2000
clear spam and 2000 clear ham samples (with the help of some users
forwarding mails as eml) in total - hence i don't want to maintain a
second one

the ham should contain samples of any type legit mail here
new spam is regulary forwarded to me for training

IMHO the new spam is the most important because is think if someone
hacks mail-accounts than for send out the last recent crap with it

lowered and/or disabled some rules not make sense in context of
authenticated MUA's from dialup home-networks, lowered the impact of the
bayes in general and tested with the two intrusions attached in abuse
mails as mailbody - both would have been rejected by milter and so far
no single mail nearly in a FP range

looks like the goal is achieved, rate-controls and so on also tuned to
make dictionary attacks harder - they become really a lot recently