Training Bayes On A Gateway

Discussion:

John Traweek CCNA, Sec+

2014-10-09 19:43:05 UTC

I've built a gateway server using sa-exim to filter email for our
corporate Microsoft Exchange environment. It's working pretty good, but
I have Bayes turned off due to the fact that I am unsure on how to train
it in this type of environment. Has someone written a how to article on
how to efficiently continually train Bayes in any environment like this.
I was thinking if specific users could forward SPAM to some box on
Exchange and have sa-exim POP it or something to "learn" that would be
ideal, but maybe there is a better way. Any ideas are appreciated, the
easier the better. TIA...

________________________________

John Traweek CCNA, Sec+
Executive Director, Information Technology
Proud PCI Associate for 18 years
PCI: the data company

________________________________

Heritage Square . 4835 LBJ Freeway, Suite 1100 . Dallas, TX 75244 . 214.530.0394

Did you know last year, PCI raised over 9 million dollars in donations for our clients? Ask us how!

This Email is covered by the Electronic Communications Privacy Act, 18 U.S.C. Sections 2510-2521 and is legally privileged. The information contained in this Email is intended only for . If the reader of this message is not the intended recipient, you are hereby notified that any dissemination, distributions or copying of this communication is strictly prohibited. If you have received this communication in error, please notify us by telephone 1.800.395.4724 X160, and destroy the original message.

Reindl Harald

2014-10-09 19:51:54 UTC

Permalink

Ive built a gateway server using sa-exim to filter email for our
corporate Microsoft Exchange environment. Its working pretty good, but
I have Bayes turned off due to the fact that I am unsure on how to train
it in this type of environment. Has someone written a how to article on
how to efficiently continually train Bayes in any environment like
this. I was thinking if specific users could forward SPAM to some box
on Exchange and have sa-exim POP it or something to learn that would
be ideal, but maybe there is a better way. Any ideas are appreciated,
the easier the better

i just decided to stay on spamass-milter which implies a single user and
so one central bayes trained with a simple script from two folders (ham
and spam) and disable any autolearning - users are adviced to foreard
samples as attachment which get added after review, until now not more
than 5 per day, the rest is catched by the fact that i receive currently
10 email addresses including some alias-lists and so face all sort of crap

the ham folder just contains a lot of my legit mail in case it don#t
contain sensible data

the machine itself is inbound only with postfix-transport tables after
the filters and so should match your subject

so far the results are impressive

the first script is a wrapper running as root and take care of
permissions and remove dulicates to optimize the training in case of a
complete rebuild, the sample eml-files are renamend with Konqueror to
"YYYY-mm-dd-#" and so get a automatic number wich offers to remove
outdated spam samples and rebuild easy in a year or two

the second script does the training itself, is running as the
milter-user and is called with "su" from the wrapper, the milter-user
has /bin/dash as sehll instead /sbin/nologin

[***@mail-gw:~]$ cat /scripts/sa-learn.sh
#!/usr/bin/bash
# Home-Directory und Name des Milter-Users
SA_MILTER_HOME="/var/lib/spamass-milter"
SA_MILTER_USER="sa-milt"
# Permissions der Lern-Dateien sicherstellen
chown root:$SA_MILTER_USER -R $SA_MILTER_HOME/training/ham/
chown root:$SA_MILTER_USER -R $SA_MILTER_HOME/training/spam/
chmod 750 $SA_MILTER_HOME/training/ham/
chmod 750 $SA_MILTER_HOME/training/spam/
chmod 640 $SA_MILTER_HOME/training/ham/*.eml
chmod 640 $SA_MILTER_HOME/training/spam/*.eml
# Duplikate in beiden Ordnern entfernen
/usr/bin/fdupes -r -f $SA_MILTER_HOME/training/ham/ | grep -v '^$' |
xargs rm -v 2> /dev/null
/usr/bin/fdupes -r -f $SA_MILTER_HOME/training/spam/ | grep -v '^$' |
xargs rm -v 2> /dev/null
# Worker-Script als Milter-User ausfuehren
/usr/bin/su -c "$SA_MILTER_HOME/training/learn.sh $1" $SA_MILTER_USER

[***@mail-gw:~]$ cat /var/lib/spamass-milter/training/learn.sh
#!/usr/bin/bash
SA_MILTER_HOME="/var/lib/spamass-milter"
SA_MILTER_USER="sa-milt"
if test `whoami` = "$SA_MILTER_USER"
then
/bin/echo "" > /dev/null
else
/bin/echo "Das Script 'learn.sh' muss als Benutzer '$SA_MILTER_USER'
aufgerufen werden"
exit
fi
cd $SA_MILTER_HOME
SHOW_HELP="0"
if [ "$1" == "rebuild" ] || [ "$1" == "" ] || [ `echo $((($1*2)/2))` ==
"$1" ]; then
# Kompletter Rebuild angefordert
if [ "$1" == "rebuild" ]; then
# Bayes-Reset
/usr/bin/sa-learn --clear
# SPAM-Training
MY_TIME=$(/usr/bin/date "+%d-%m-%Y %H:%M:%S")
echo "$MY_TIME: Verarbeite SPAM Samples"
nice -n 19 /usr/bin/sa-learn --progress --spam
$SA_MILTER_HOME/training/spam/*.eml
echo ""
# HAM-Training
MY_TIME=$(/usr/bin/date "+%d-%m-%Y %H:%M:%S")
echo "$MY_TIME: Verarbeite HAM Samples"
nice -n 19 /usr/bin/sa-learn --progress --ham
$SA_MILTER_HOME/training/ham/*.eml
echo ""
else
# Default auf aktuellen Tag oder Parameter
if [ "$1" == "" ]; then
TRAIN_DAYS="1"
else
TRAIN_DAYS="$1"
fi
# HAM-Training
MY_TIME=$(/usr/bin/date "+%d-%m-%Y %H:%M:%S")
echo "$MY_TIME: Verarbeite SPAM Samples"
nice -n 19 /usr/bin/find $SA_MILTER_HOME/training/spam/ -type f -name
\*.eml -mtime -$TRAIN_DAYS | xargs -r /usr/bin/sa-learn --spam
echo ""
# HAM-Training
MY_TIME=$(/usr/bin/date "+%d-%m-%Y %H:%M:%S")
echo "$MY_TIME: Verarbeite HAM Samples"
nice -n 19 /usr/bin/find $SA_MILTER_HOME/training/ham/ -type f -name
\*.eml -mtime -$TRAIN_DAYS | xargs -r /usr/bin/sa-learn --ham
echo ""
fi
else
SHOW_HELP="1"
fi
if [ "$1" == "--help" ] || [ "$1" == "-h" ] || [ "$SHOW_HELP" == "1" ]; then
echo "Bayes-Maintaining-Skript"
echo "Usage:"
echo " rebuild: Bayes komplett zuruecksetzen und anhand der Samples
neu aufbauen"
echo " <days>: Alter der zu trainierenden Samples in Tagen (Default: 1)"
exit
fi
MY_TIME=$(/usr/bin/date "+%d-%m-%Y %H:%M:%S")
echo "$MY_TIME: Done"
echo ""
nice -n 19 /usr/bin/sa-learn --dump magic
echo ""
/usr/bin/ls -l -h --time-style=long-is $SA_MILTER_HOME/.spamassassin/

John Hardin

2014-10-09 20:14:21 UTC

Permalink

Post by John Traweek CCNA, Sec+
I've built a gateway server using sa-exim to filter email for our
corporate Microsoft Exchange environment. It's working pretty good, but
I have Bayes turned off due to the fact that I am unsure on how to train
it in this type of environment. Has someone written a how to article on
how to efficiently continually train Bayes in any environment like this.
I was thinking if specific users could forward SPAM to some box on
Exchange and have sa-exim POP it or something to "learn" that would be
ideal, but maybe there is a better way. Any ideas are appreciated, the
easier the better. TIA...

This topic comes up fairly regularly. Did you search the list archives on
terms like "exchange bayes" ?

There's no explicit coverage of this in the wiki, but these pages may
help:

http://wiki.apache.org/spamassassin/SiteWideBayesFeedback

http://wiki.apache.org/spamassassin/RemoteImapFolder

...though I've heard Exchange has deprecated public IMAP folders.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
***@impsec.org FALaholic #11174 pgpk -a ***@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The reason it took so long to get Bin Laden is that it took the
SEALs five years to swim that far into the desert. -- anon
-----------------------------------------------------------------------
861 days since the first successful private support mission to ISS (SpaceX)

Jason W.

2014-10-09 22:58:26 UTC

Permalink

Post by John Traweek CCNA, Sec+
I've built a gateway server using sa-exim to filter email for our
This topic comes up fairly regularly. Did you search the list archives on
terms like "exchange bayes" ?

Since the OP mentioned exim, I'll share a bit of how I did something
similar. While I have Exchange in the picture, most of my users are not on
it.

I wanted to be able to fully reject mail at SMTP time if SpamAssassin (SA
does not block mail <g>) and not worry about whether exim would change the
log format if I did a 'fakereject'. SMTP rejects are nice since I do not
quarantine spam. I didn't see elsewhere, either on the SA wiki or
elsewhere, so figured I'd share and maybe help out somewhere..

I use exim's native SA integration, not sa-exim. I also use dovecot for my
IMAP users' mailboxes, and this is where my spam mail goes.

In my data ACL within exim.conf, I have:

---------------

# Call SA and add some headers to the email delivered via normal means if
it's non-spam
warn spam = spam:true/defer_ok
add_header = X-Spam-Score: $spam_score ($spam_bar)
add_header = X-Spam-Report: $spam_report

# If it's spam (defined as an SA score > 5), then run my custom deliver
script against the copy of the email in the exim mail spool.
# Exim's mail spool copy won't have the above added headers, so need to do
so here to see them in the spam mailbox.
warn condition = ${if >{$spam_score_int}{50}{1}{0}}
condition = ${run{/home/spam/bin/deliver incoming-spam
$spool_directory/scan/$message_id/$message_id.eml 'X-Spam-Score:
$spam_score\nX-Spam-Report: $spam_report'}}

deny condition = ${if >{$spam_score_int}{50}{1}{0}}
message = .....

-------------

/home/spam/bin/deliver contains:

----------------

#!/bin/bash

MAILBOX=$1
FILE=$2
shift
shift
HEADERS="$*"

TMPFILE=/tmp/deliver.$$

echo -e "$HEADERS" >> $TMPFILE
# Exim writes out a standard mbox-style From line, remove it
cat $FILE | tail -n +2 >> $TMPFILE

# Dovecot must be root to do direct delievery
cat $TMPFILE | sudo /usr/libexec/dovecot/deliver -d spam -m $MAILBOX

rm $TMPFILE

--
HTH, YMMV, HANW :)

Jason

The path to enlightenment is /usr/bin/enlightenment.

Ted Mittelstaedt

2014-10-10 05:47:55 UTC

Permalink

I collect spam this way, periodically I scan the mail logs looking for
"unknown user" entries and sort the results - usernames/email addresses
that are repeatedly being "guessed" get an alias entry added that
forwards the spam to a spam mailbox. I have about 20 of these now that
are aliased to the spambox and that box gets tons and tons of spam.

Ham is just my own email folders - all legitimate mail I get, once I
finish dealing with it, goes into an archive, and that archive is
periodically fed into the Bays learner.

Ted

I’ve built a gateway server using sa-exim to filter email for our
corporate Microsoft Exchange environment. It’s working pretty good, but
I have Bayes turned off due to the fact that I am unsure on how to train
it in this type of environment. Has someone written a how to article on
how to efficiently continually train Bayes in any environment like this.
I was thinking if specific users could forward SPAM to some box on
Exchange and have sa-exim POP it or something to “learn” that would be
ideal, but maybe there is a better way. Any ideas are appreciated, the
easier the better. TIA…
*John Traweek CCNA, Sec+
*Executive Director, Information Technology
Proud PCI Associate for 18 years
T: 214.530.0394
------------------------------------------------------------------------
*Did you know last year, PCI raised over 9 million dollars in donations
for our clients? Ask us how!*
*
*
This Email is covered by the Electronic Communications Privacy Act, 18
U.S.C. Sections 2510-2521 and is legally privileged. The information
contained in this Email is only for the intended recipient. If the
reader of this message is not the intended recipient, you are hereby
notified that any dissemination, distributions or copying of this
communication is strictly prohibited. If you have received this
communication in error, please notify us by telephone 1.800.395.4724
X160, and destroy the original message.