Discussion:
23_bayes_ignore_header.cf
Axb
2014-10-14 07:08:05 UTC
Permalink
Updated (in case you're using it.....)

http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf
Robert Schetterer
2014-10-14 08:08:07 UTC
Permalink
Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf
i am not really sure

but include

bayes_ignore_header X-getmail-filter-classifier

might be a good idea too

http://pyropus.ca/software/getmail/configuration.html

Filter_classifier — run the message through an external program, and
insert the output of the program into X-getmail-filter-classifier:
header fields in the message. Messages can be dropped by having the
filter return specific exit codes.


Best Regards
MfG Robert Schetterer
--
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein
Axb
2014-10-14 08:13:55 UTC
Permalink
Post by Robert Schetterer
Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf
i am not really sure
but include
bayes_ignore_header X-getmail-filter-classifier
might be a good idea too
http://pyropus.ca/software/getmail/configuration.html
Filter_classifier — run the message through an external program, and
header fields in the message. Messages can be dropped by having the
filter return specific exit codes.
added! thanks
Alessio Cecchi
2014-10-14 08:37:28 UTC
Permalink
Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf
I suggest these:

from qmail-scanner:
bayes_ignore_header X-Qmail-Scanner-Diagnostics
bayes_ignore_header X-Qmail-Scanner-MOVED-X-Spam-Status
bayes_ignore_header X-Originating-IP

from cloudmark:
bayes_ignore_header X-Spam-CMAE-Analysis
bayes_ignore_header X-CMAE-Match
bayes_ignore_header X-CMAE-Score
bayes_ignore_header X-CMAE-Analysis

from commtouch:
bayes_ignore_header X-Spam-CTCH-RefID
bayes_ignore_header X-CTCH-SenderID
bayes_ignore_header X-CTCH-SenderID-TotalMessages
bayes_ignore_header X-CTCH-SenderID-TotalSuspected
bayes_ignore_header X-CTCH-SenderID-TotalBulk
bayes_ignore_header X-CTCH-SenderID-TotalConfirmed
bayes_ignore_header X-CTCH-SenderID-TotalRecipients

from dcc:
bayes_ignore_header X-Spam-DCC

from sophos:
bayes_ignore_header X-PMX-Spam

Thanks
Axb
2014-10-14 08:44:51 UTC
Permalink
have you verified that some of these are not included?

X-Originating-IP will not be included as it can be used to help detect
ham or spam
Post by Alessio Cecchi
Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf
bayes_ignore_header X-Qmail-Scanner-Diagnostics
bayes_ignore_header X-Qmail-Scanner-MOVED-X-Spam-Status
bayes_ignore_header X-Originating-IP
bayes_ignore_header X-Spam-CMAE-Analysis
bayes_ignore_header X-CMAE-Match
bayes_ignore_header X-CMAE-Score
bayes_ignore_header X-CMAE-Analysis
bayes_ignore_header X-Spam-CTCH-RefID
bayes_ignore_header X-CTCH-SenderID
bayes_ignore_header X-CTCH-SenderID-TotalMessages
bayes_ignore_header X-CTCH-SenderID-TotalSuspected
bayes_ignore_header X-CTCH-SenderID-TotalBulk
bayes_ignore_header X-CTCH-SenderID-TotalConfirmed
bayes_ignore_header X-CTCH-SenderID-TotalRecipients
bayes_ignore_header X-Spam-DCC
bayes_ignore_header X-PMX-Spam
Thanks
Alessio Cecchi
2014-10-14 08:50:14 UTC
Permalink
Post by Axb
have you verified that some of these are not included?
Yes, twice.
Post by Axb
X-Originating-IP will not be included as it can be used to help detect
ham or spam
Ok, thanks
Axb
2014-10-14 08:58:23 UTC
Permalink
Post by Alessio Cecchi
Post by Axb
have you verified that some of these are not included?
Yes, twice.
Post by Axb
X-Originating-IP will not be included as it can be used to help detect
ham or spam
Ok, thanks
sorted / uniq'd / commited
RW
2014-10-14 11:51:39 UTC
Permalink
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.

In particular X-Delivered-To can be very useful to Bayes when users
have multiple addresses.
Axb
2014-10-14 11:58:27 UTC
Permalink
Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....
Post by RW
In particular X-Delivered-To can be very useful to Bayes when users
have multiple addresses.
oops that sneeked in.. fixed
Reindl Harald
2014-10-14 12:02:05 UTC
Permalink
Post by Axb
Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....
but are those IP's not mostly dynamic ones from botnets and so you end
in a lot of tokens over the time?
Tom Hendrikx
2014-10-14 13:34:01 UTC
Permalink
Post by Reindl Harald
Post by Axb
Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....
but are those IP's not mostly dynamic ones from botnets and so you end
in a lot of tokens over the time?
Or it is a good machine that is rooted, and then cleaned up and restored
to business by its whitehat admin. But it still is blocked by your
self-inflicted 'bayes poison' :)

Use rbls for ip-based reputation, not bayes.

Tom
Axb
2014-10-14 13:47:09 UTC
Permalink
Post by Tom Hendrikx
Post by Reindl Harald
Post by Axb
Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....
but are those IP's not mostly dynamic ones from botnets and so you end
in a lot of tokens over the time?
OMG... my 4.3GB SCSI2 disk will explode if I have a few (more) pointless
tokens...
Post by Tom Hendrikx
Or it is a good machine that is rooted, and then cleaned up and restored
to business by its whitehat admin. But it still is blocked by your
self-inflicted 'bayes poison' :)
sensibly tuned expiration doesn't permit self inflicted 'bayes poison'
Reindl Harald
2014-10-14 14:09:03 UTC
Permalink
Post by Axb
Post by Tom Hendrikx
Post by Reindl Harald
Post by Axb
Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....
but are those IP's not mostly dynamic ones from botnets and so you end
in a lot of tokens over the time?
OMG... my 4.3GB SCSI2 disk will explode if I have a few (more) pointless
tokens...
the question is: are they pointless and if yes why store them
Post by Axb
Post by Tom Hendrikx
Or it is a good machine that is rooted, and then cleaned up and restored
to business by its whitehat admin. But it still is blocked by your
self-inflicted 'bayes poison' :)
sensibly tuned expiration doesn't permit self inflicted 'bayes poison'
in case of hand maintained bayes that won't happen
anyways, that's what "local.cf" is for
RW
2014-10-14 15:07:52 UTC
Permalink
On Tue, 14 Oct 2014 13:58:27 +0200
Post by Axb
Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you may
want it to be tokenized...
As I do with, for example:

X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]

in this spam Bayes found

0.999-4--HX-AntiAbuse:32007

These numbers seem to be very good indicators for me.


Most of the headers in the file have never appeared in my ham, so
they'll be pure spam indicators if they are ever faked. In general
it's difficult for a spammer to gain an overall advantage against
an average per user database using faked headers.

Whatever the merits of this on system-wide Bayes (if any beyond
reducing token count), I think it would have a negative effect on
per user Bayes.
Axb
2014-10-14 21:54:56 UTC
Permalink
Post by RW
On Tue, 14 Oct 2014 13:58:27 +0200
Post by Axb
Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you may
want it to be tokenized...
X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]
in this spam Bayes found
0.999-4--HX-AntiAbuse:32007
These numbers seem to be very good indicators for me.
Most of the headers in the file have never appeared in my ham, so
they'll be pure spam indicators if they are ever faked. In general
it's difficult for a spammer to gain an overall advantage against
an average per user database using faked headers.
Whatever the merits of this on system-wide Bayes (if any beyond
reducing token count), I think it would have a negative effect on
per user Bayes.
oooooooooooook..
now here's a suprise (it's all in the code :)

the Bayes.pm plugin alreafy includes:


# Which headers should we scan for tokens? Don't use all of them, as
it's easy
# to pick up spurious clues from some. What we now do is use all of them
# *less* these well-known headers; that way we can pick up spammers'
tracking
# headers (which are obviously not well-known in advance!).

# Received is handled specially
$IGNORED_HDRS = qr{(?: (?:X-)?Sender # misc noise
|Delivered-To |Delivery-Date
|(?:X-)?Envelope-To
|X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text

|Subject # not worth a tiny gain vs. to db size increase

# Date: can provide invalid cues if your spam corpus is
# older/newer than ham
|Date

# List headers: ignore. a spamfiltering mailing list will
# become a nonspam sign.
|X-List|(?:X-)?Mailing-List
|(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe
|Unsubscribe|Host|Id|Manager|Admin|Comment
|Name|Url)
|X-Unsub(?:scribe)?
|X-Mailman-Version |X-Been[Tt]here |X-Loop
|Mail-Followup-To
|X-eGroups-(?:Return|From)
|X-MDMailing-List
|X-XEmacs-List

# gatewayed through mailing list (thanks to Allen Smith)
|(?:X-)?Resent-(?:From|To|Date)
|(?:X-)?Original-(?:From|To|Date)

# Spamfilter/virus-scanner headers: too easy to chain from
# these
|X-MailScanner(?:-SpamCheck)?
|X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))?
|X-Antispam |X-RBL-Warning |X-Mailscanner
|X-MDaemon-Deliver-To |X-Virus-Scanned
|X-Mass-Check-Id
|X-Pyzor |X-DCC-\S{2,25}-Metrics
|X-Filtered-B[Yy] |X-Scanned-By |X-Scanner
|X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status
|X-SpamCop-[^:]+
|X-SMTPD |(?:X-)?Spam-Apparently-To
|SPAM |X-Perlmx-Spam
|X-Bogosity

# some noisy Outlook headers that add no good clues:
|Content-Class |Thread-(?:Index|Topic)
|X-Original[Aa]rrival[Tt]ime

# Annotations from IMAP, POP, and MH:
|(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded
|Lines |Content-Length
|X-UIDL? |X-IMAPbase

# Annotations from Bugzilla
|X-Bugzilla-[^:]+

# Annotations from VM: (thanks to Allen Smith)
|X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified
|Summary-Format|VHeader|v\d-Data|Message-Order)

# Annotations from Gnus:
| X-Gnus-Mail-Source
| Xref

)}x;

# Note only the presence of these headers, in order to reduce the
# hapaxen they generate.
$MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face
|X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint
|D(?:KIM|omainKey)-Signature
)}ix;

funny...
Tom Hendrikx
2014-10-15 07:19:10 UTC
Permalink
Post by Axb
Post by RW
On Tue, 14 Oct 2014 13:58:27 +0200
Post by Axb
Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you may
want it to be tokenized...
X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]
in this spam Bayes found
0.999-4--HX-AntiAbuse:32007
These numbers seem to be very good indicators for me.
Most of the headers in the file have never appeared in my ham, so
they'll be pure spam indicators if they are ever faked. In general
it's difficult for a spammer to gain an overall advantage against
an average per user database using faked headers.
Whatever the merits of this on system-wide Bayes (if any beyond
reducing token count), I think it would have a negative effect on
per user Bayes.
oooooooooooook..
now here's a suprise (it's all in the code :)
# Which headers should we scan for tokens? Don't use all of them, as
it's easy
# to pick up spurious clues from some. What we now do is use all of them
# *less* these well-known headers; that way we can pick up spammers'
tracking
# headers (which are obviously not well-known in advance!).
# Received is handled specially
$IGNORED_HDRS = qr{(?: (?:X-)?Sender # misc noise
|Delivered-To |Delivery-Date
|(?:X-)?Envelope-To
|X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text
|Subject # not worth a tiny gain vs. to db size increase
# Date: can provide invalid cues if your spam corpus is
# older/newer than ham
|Date
# List headers: ignore. a spamfiltering mailing list will
# become a nonspam sign.
|X-List|(?:X-)?Mailing-List
|(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe
|Unsubscribe|Host|Id|Manager|Admin|Comment
|Name|Url)
|X-Unsub(?:scribe)?
|X-Mailman-Version |X-Been[Tt]here |X-Loop
|Mail-Followup-To
|X-eGroups-(?:Return|From)
|X-MDMailing-List
|X-XEmacs-List
# gatewayed through mailing list (thanks to Allen Smith)
|(?:X-)?Resent-(?:From|To|Date)
|(?:X-)?Original-(?:From|To|Date)
# Spamfilter/virus-scanner headers: too easy to chain from
# these
|X-MailScanner(?:-SpamCheck)?
|X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))?
|X-Antispam |X-RBL-Warning |X-Mailscanner
|X-MDaemon-Deliver-To |X-Virus-Scanned
|X-Mass-Check-Id
|X-Pyzor |X-DCC-\S{2,25}-Metrics
|X-Filtered-B[Yy] |X-Scanned-By |X-Scanner
|X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status
|X-SpamCop-[^:]+
|X-SMTPD |(?:X-)?Spam-Apparently-To
|SPAM |X-Perlmx-Spam
|X-Bogosity
|Content-Class |Thread-(?:Index|Topic)
|X-Original[Aa]rrival[Tt]ime
|(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded
|Lines |Content-Length
|X-UIDL? |X-IMAPbase
# Annotations from Bugzilla
|X-Bugzilla-[^:]+
# Annotations from VM: (thanks to Allen Smith)
|X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified
|Summary-Format|VHeader|v\d-Data|Message-Order)
| X-Gnus-Mail-Source
| Xref
)}x;
# Note only the presence of these headers, in order to reduce the
# hapaxen they generate.
$MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face
|X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint
|D(?:KIM|omainKey)-Signature
)}ix;
funny...
Doing this in code has some drawbacks, just like the tld listing: it's
not visible to most people (like this thread nicely illustrates), and
you actually want to have it configurable. This one actually is
configurable, so now there are 2 tuneables for this problem: the code
(mostly static, hidden from view and unreachable for 99% of the users),
and the config file.

I propose to simplify, and move the code-wise exclusion to a config file
too: one tuneable (and one location to look at) is better than two.
Besides, the config file is far easier to read for the not so
regex-capable admin :)

Regards,
Tom
Martin Gregorie
2014-10-15 10:54:32 UTC
Permalink
Post by Tom Hendrikx
I propose to simplify, and move the code-wise exclusion to a config file
too: one tuneable (and one location to look at) is better than two.
Besides, the config file is far easier to read for the not so
regex-capable admin :)
That sounds good to me and, while this sort of change is in the air, can
we do the same with the TLD list please? I ask because, as of last
Friday, the TLD update that adds .link to the list hadn't filtered
through into the Fedora package update.

It also seems like a good idea to include two configuration items in
sa_update's repertoire: the TLD list seems unlikely to stabilise any
time soon, so it would be particularly useful for it.


Martin
Axb
2014-10-15 11:06:36 UTC
Permalink
This post might be inappropriate. Click to display it.
Anthony Cartmell
2014-10-15 08:54:52 UTC
Permalink
Post by Axb
oooooooooooook..
now here's a suprise (it's all in the code :)
<snip>
Post by Axb
# Spamfilter/virus-scanner headers: too easy to chain from
# these
|X-MailScanner(?:-SpamCheck)?
For some time now MailScanner has recommended that users modify the
MailScanner header names to include their business name to add some
uniqueness.

So my mails contain "X-Fonant-MailScanner" and
"X-Fonant-MailScanner-SpamCheck" headers, for example.

The regexp might need updating to account for this, if we think
MailScanner headers are common enough to warrant this?

I also note that your 23_bayes_ignore_header.cf file has:

bayes_ignore_header X-ServerMaster-MailScanner

but that only covers one particular MailScanner user.

Anthony
--
www.fonant.com - Quality web sites
Tel. 01903 867 810
Fonant Ltd is registered in England and Wales, company No. 7006596
Registered office: Amelia House, Crescent Road, Worthing, West Sussex,
BN11 1QR
Axb
2014-10-15 09:09:46 UTC
Permalink
Post by Anthony Cartmell
Post by Axb
oooooooooooook..
now here's a suprise (it's all in the code :)
<snip>
Post by Axb
# Spamfilter/virus-scanner headers: too easy to chain from
# these
|X-MailScanner(?:-SpamCheck)?
For some time now MailScanner has recommended that users modify the
MailScanner header names to include their business name to add some
uniqueness.
So my mails contain "X-Fonant-MailScanner" and
"X-Fonant-MailScanner-SpamCheck" headers, for example.
The regexp might need updating to account for this, if we think
MailScanner headers are common enough to warrant this?
bayes_ignore_header X-ServerMaster-MailScanner
but that only covers one particular MailScanner user.
that creeped in when merging Reindl's data

removed
RW
2014-10-15 13:34:49 UTC
Permalink
On Tue, 14 Oct 2014 23:54:56 +0200
Post by Axb
Post by RW
On Tue, 14 Oct 2014 13:58:27 +0200
Post by Axb
Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200
Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you
may want it to be tokenized...
X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]
in this spam Bayes found
0.999-4--HX-AntiAbuse:32007
These numbers seem to be very good indicators for me.
Most of the headers in the file have never appeared in my ham, so
they'll be pure spam indicators if they are ever faked. In general
it's difficult for a spammer to gain an overall advantage against
an average per user database using faked headers.
Whatever the merits of this on system-wide Bayes (if any beyond
reducing token count), I think it would have a negative effect on
per user Bayes.
oooooooooooook..
now here's a suprise (it's all in the code :)
It wasn't a surprise to me. Many of them I agree with, some I don't. On
the whole I don't care enough to patch it.

I'm not against ignoring things that obviously, or empirically, don't
help, what I didn't want was a huge list being imposed on everyone,
which was your original plan.

I certainly would patch it if X-Delivered-To were included;
Delivered-To, (?:X-)?Envelope-To definitely shouldn't be there IMO.
Post by Axb
|Subject # not worth a tiny gain vs. to db size increase
I'd forgotten about that one. The subject is already tokenized
through the body. And it probably made a lot of sense when spammers
weren't taking statistical filters seriously. But word frequencies
can be different in the subject, and spammers are now very good at
denying Bayes useful tokens. I think it's unfortunate that that
exclusion is unconditional.

The trouble is that a lot of this is that it's a judgement about
cost/benefit. But for me Bayes uses 20 millipennies of storage, and
catches 72% of spam at BAYES_9*, and Bogofilter uses 200 millipennies
but catches 94% of spam. To me that's 180 millipennies well spent and
I wouldn't begrudge Bayes a similar amount - I might even go to
whole pennies.

Axb
2014-10-14 14:10:52 UTC
Permalink
Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related to
AV/filter products.
David F. Skoll
2014-10-14 14:17:05 UTC
Permalink
On Tue, 14 Oct 2014 16:10:52 +0200
Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.
I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?

I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?

Regards,

David.
Axb
2014-10-14 14:32:58 UTC
Permalink
Post by David F. Skoll
On Tue, 14 Oct 2014 16:10:52 +0200
Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.
I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?
I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?
David,

The "boys_ignore" file will not become a part of SA default .cf files.
My intention is to keep a central repository in case somebody else wants
to use it instead of mantaining in my local repo.

I believe in *some* massaging, as in "works for me".

I assume it depends on how you feed bayes and what kind of traffic you
deal with.

The concept of avoiding bayes from learning other filter's stuff is
ancient (there's a commented example in local.cf) but as with so much
in SA tuning , it's trial and possible error till you feel cozy.
Adam Katz
2014-10-14 21:08:34 UTC
Permalink
Post by David F. Skoll
Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.
I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?
I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?
The purpose of bayes_ignore_header is twofold:

1. Prevent inheriting other systems' false positives (ensure better
independence)
2. Prevent relying upon headers that won't exist at delivery time (e.g.
added by the mailbox server)

This is why it's so important to ignore other spam engines, which
basically fit into both of those categories.
Axb
2014-10-14 21:37:36 UTC
Permalink
Post by Adam Katz
Post by David F. Skoll
Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.
I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?
I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?
1. Prevent inheriting other systems' false positives (ensure better
independence)
2. Prevent relying upon headers that won't exist at delivery time (e.g.
added by the mailbox server)
This is why it's so important to ignore other spam engines, which
basically fit into both of those categories.
I'd love to have the option (switch) to use Bayes on msg bodies ONLY,
though I doubt anybody would be a taker for such a project.
(I'd even be willing to "$pon$or" such an addition to SA)
Reindl Harald
2014-10-14 22:53:58 UTC
Permalink
Post by Axb
Post by Adam Katz
Post by David F. Skoll
Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.
I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?
I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?
1. Prevent inheriting other systems' false positives (ensure better
independence)
2. Prevent relying upon headers that won't exist at delivery time (e.g.
added by the mailbox server)
This is why it's so important to ignore other spam engines, which
basically fit into both of those categories.
I'd love to have the option (switch) to use Bayes on msg bodies ONLY,
though I doubt anybody would be a taker for such a project.
(I'd even be willing to "$pon$or" such an addition to SA)
or someting like the opposit as now:

bayes_include_header received
bayes_include_header subject
bayes_include_header x-mailer
Jeff Mincy
2014-10-14 23:04:59 UTC
Permalink
From: Axb <***@gmail.com>
Date: Tue, 14 Oct 2014 23:37:36 +0200
Post by Adam Katz
Post by David F. Skoll
Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.
I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?
I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?
1. Prevent inheriting other systems' false positives (ensure better
independence)
2. Prevent relying upon headers that won't exist at delivery time (e.g.
added by the mailbox server)
This is why it's so important to ignore other spam engines, which
basically fit into both of those categories.
I'd love to have the option (switch) to use Bayes on msg bodies ONLY,
though I doubt anybody would be a taker for such a project.
(I'd even be willing to "$pon$or" such an addition to SA)

Wouldn't that be fairly easy to implement by intercepting the call to
_tokenize_headers in Plugin/Bayes.pm?

# Tokenize the headers
my %hdrs = $self->_tokenize_headers ($msg);
while( my($prefix, $value) = each %hdrs ) {
push(@tokens, $self->_tokenize_line ($value, "H$prefix:", 0));
}

-jeff
Loading...