23_bayes_ignore

Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf

i am not really sure

but include

bayes_ignore_header X-getmail-filter-classifier

might be a good idea too

http://pyropus.ca/software/getmail/configuration.html

Filter_classifier — run the message through an external program, and
insert the output of the program into X-getmail-filter-classifier:
header fields in the message. Messages can be dropped by having the
filter return specific exit codes.

Best Regards
MfG Robert Schetterer

--
[*] sys4 AG

http://sys4.de, +49 (89) 30 90 46 64
Franziskanerstraße 15, 81669 München

Sitz der Gesellschaft: München, Amtsgericht München: HRB 199263
Vorstand: Patrick Ben Koetter, Marc Schiffbauer
Aufsichtsratsvorsitzender: Florian Kirstein

Axb

2014-10-14 08:13:55 UTC

Post by Robert Schetterer

Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf

i am not really sure
but include
bayes_ignore_header X-getmail-filter-classifier
might be a good idea too
http://pyropus.ca/software/getmail/configuration.html
Filter_classifier — run the message through an external program, and
header fields in the message. Messages can be dropped by having the
filter return specific exit codes.

added! thanks

Alessio Cecchi

2014-10-14 08:37:28 UTC

Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf

I suggest these:

from qmail-scanner:
bayes_ignore_header X-Qmail-Scanner-Diagnostics
bayes_ignore_header X-Qmail-Scanner-MOVED-X-Spam-Status
bayes_ignore_header X-Originating-IP

from cloudmark:
bayes_ignore_header X-Spam-CMAE-Analysis
bayes_ignore_header X-CMAE-Match
bayes_ignore_header X-CMAE-Score
bayes_ignore_header X-CMAE-Analysis

from commtouch:
bayes_ignore_header X-Spam-CTCH-RefID
bayes_ignore_header X-CTCH-SenderID
bayes_ignore_header X-CTCH-SenderID-TotalMessages
bayes_ignore_header X-CTCH-SenderID-TotalSuspected
bayes_ignore_header X-CTCH-SenderID-TotalBulk
bayes_ignore_header X-CTCH-SenderID-TotalConfirmed
bayes_ignore_header X-CTCH-SenderID-TotalRecipients

from dcc:
bayes_ignore_header X-Spam-DCC

from sophos:
bayes_ignore_header X-PMX-Spam

Thanks

Axb

2014-10-14 08:44:51 UTC

have you verified that some of these are not included?

X-Originating-IP will not be included as it can be used to help detect
ham or spam

Post by Alessio Cecchi

Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf

bayes_ignore_header X-Qmail-Scanner-Diagnostics
bayes_ignore_header X-Qmail-Scanner-MOVED-X-Spam-Status
bayes_ignore_header X-Originating-IP
bayes_ignore_header X-Spam-CMAE-Analysis
bayes_ignore_header X-CMAE-Match
bayes_ignore_header X-CMAE-Score
bayes_ignore_header X-CMAE-Analysis
bayes_ignore_header X-Spam-CTCH-RefID
bayes_ignore_header X-CTCH-SenderID
bayes_ignore_header X-CTCH-SenderID-TotalMessages
bayes_ignore_header X-CTCH-SenderID-TotalSuspected
bayes_ignore_header X-CTCH-SenderID-TotalBulk
bayes_ignore_header X-CTCH-SenderID-TotalConfirmed
bayes_ignore_header X-CTCH-SenderID-TotalRecipients
bayes_ignore_header X-Spam-DCC
bayes_ignore_header X-PMX-Spam
Thanks

Alessio Cecchi

2014-10-14 08:50:14 UTC

Post by Axb
have you verified that some of these are not included?

Yes, twice.

Post by Axb
X-Originating-IP will not be included as it can be used to help detect
ham or spam

Ok, thanks

Axb

2014-10-14 08:58:23 UTC

Post by Alessio Cecchi

Post by Axb
have you verified that some of these are not included?

Yes, twice.

Post by Axb
X-Originating-IP will not be included as it can be used to help detect
ham or spam

Ok, thanks

sorted / uniq'd / commited

2014-10-14 11:51:39 UTC

On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

In particular X-Delivered-To can be very useful to Bayes when users
have multiple addresses.

Axb

2014-10-14 11:58:27 UTC

Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....

Post by RW
In particular X-Delivered-To can be very useful to Bayes when users
have multiple addresses.

oops that sneeked in.. fixed

Reindl Harald

2014-10-14 12:02:05 UTC

Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....

but are those IP's not mostly dynamic ones from botnets and so you end
in a lot of tokens over the time?

Tom Hendrikx

2014-10-14 13:34:01 UTC

Post by Reindl Harald

Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....

but are those IP's not mostly dynamic ones from botnets and so you end
in a lot of tokens over the time?

Or it is a good machine that is rooted, and then cleaned up and restored
to business by its whitehat admin. But it still is blocked by your
self-inflicted 'bayes poison' :)

Use rbls for ip-based reputation, not bayes.

Tom

Axb

2014-10-14 13:47:09 UTC

Post by Tom Hendrikx

Post by Reindl Harald

Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....

but are those IP's not mostly dynamic ones from botnets and so you end
in a lot of tokens over the time?

OMG... my 4.3GB SCSI2 disk will explode if I have a few (more) pointless
tokens...

Post by Tom Hendrikx
Or it is a good machine that is rooted, and then cleaned up and restored
to business by its whitehat admin. But it still is blocked by your
self-inflicted 'bayes poison' :)

sensibly tuned expiration doesn't permit self inflicted 'bayes poison'

Reindl Harald

2014-10-14 14:09:03 UTC

Post by Tom Hendrikx

Post by Reindl Harald

Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

for example, if you get a flood of 419s from the same source, you may
want it to be tokenized... or not?
or if it only sends ham....

but are those IP's not mostly dynamic ones from botnets and so you end
in a lot of tokens over the time?

OMG... my 4.3GB SCSI2 disk will explode if I have a few (more) pointless
tokens...

the question is: are they pointless and if yes why store them

Post by Tom Hendrikx
Or it is a good machine that is rooted, and then cleaned up and restored
to business by its whitehat admin. But it still is blocked by your
self-inflicted 'bayes poison' :)

sensibly tuned expiration doesn't permit self inflicted 'bayes poison'

in case of hand maintained bayes that won't happen
anyways, that's what "local.cf" is for

2014-10-14 15:07:52 UTC

On Tue, 14 Oct 2014 13:58:27 +0200

Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

for example, if you get a flood of 419s from the same source, you may
want it to be tokenized...

As I do with, for example:

X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]

in this spam Bayes found

0.999-4--HX-AntiAbuse:32007

These numbers seem to be very good indicators for me.

Most of the headers in the file have never appeared in my ham, so
they'll be pure spam indicators if they are ever faked. In general
it's difficult for a spammer to gain an overall advantage against
an average per user database using faked headers.

Whatever the merits of this on system-wide Bayes (if any beyond
reducing token count), I think it would have a negative effect on
per user Bayes.

Axb

2014-10-14 21:54:56 UTC

Post by RW
On Tue, 14 Oct 2014 13:58:27 +0200

Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

for example, if you get a flood of 419s from the same source, you may
want it to be tokenized...

X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]
in this spam Bayes found
0.999-4--HX-AntiAbuse:32007
These numbers seem to be very good indicators for me.
Most of the headers in the file have never appeared in my ham, so
they'll be pure spam indicators if they are ever faked. In general
it's difficult for a spammer to gain an overall advantage against
an average per user database using faked headers.
Whatever the merits of this on system-wide Bayes (if any beyond
reducing token count), I think it would have a negative effect on
per user Bayes.

oooooooooooook..
now here's a suprise (it's all in the code :)

the Bayes.pm plugin alreafy includes:

# Which headers should we scan for tokens? Don't use all of them, as
it's easy
# to pick up spurious clues from some. What we now do is use all of them
# *less* these well-known headers; that way we can pick up spammers'
tracking
# headers (which are obviously not well-known in advance!).

# Received is handled specially
$IGNORED_HDRS = qr{(?: (?:X-)?Sender # misc noise
|Delivered-To |Delivery-Date
|(?:X-)?Envelope-To
|X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text

|Subject # not worth a tiny gain vs. to db size increase

# Date: can provide invalid cues if your spam corpus is
# older/newer than ham
|Date

# List headers: ignore. a spamfiltering mailing list will
# become a nonspam sign.
|X-List|(?:X-)?Mailing-List
|(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe
|Unsubscribe|Host|Id|Manager|Admin|Comment
|Name|Url)
|X-Unsub(?:scribe)?
|X-Mailman-Version |X-Been[Tt]here |X-Loop
|Mail-Followup-To
|X-eGroups-(?:Return|From)
|X-MDMailing-List
|X-XEmacs-List

# gatewayed through mailing list (thanks to Allen Smith)
|(?:X-)?Resent-(?:From|To|Date)
|(?:X-)?Original-(?:From|To|Date)

# Spamfilter/virus-scanner headers: too easy to chain from
# these
|X-MailScanner(?:-SpamCheck)?
|X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))?
|X-Antispam |X-RBL-Warning |X-Mailscanner
|X-MDaemon-Deliver-To |X-Virus-Scanned
|X-Mass-Check-Id
|X-Pyzor |X-DCC-\S{2,25}-Metrics
|X-Filtered-B[Yy] |X-Scanned-By |X-Scanner
|X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status
|X-SpamCop-[^:]+
|X-SMTPD |(?:X-)?Spam-Apparently-To
|SPAM |X-Perlmx-Spam
|X-Bogosity

# some noisy Outlook headers that add no good clues:
|Content-Class |Thread-(?:Index|Topic)
|X-Original[Aa]rrival[Tt]ime

# Annotations from IMAP, POP, and MH:
|(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded
|Lines |Content-Length
|X-UIDL? |X-IMAPbase

# Annotations from Bugzilla
|X-Bugzilla-[^:]+

# Annotations from VM: (thanks to Allen Smith)
|X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified
|Summary-Format|VHeader|v\d-Data|Message-Order)

# Annotations from Gnus:
| X-Gnus-Mail-Source
| Xref

)}x;

# Note only the presence of these headers, in order to reduce the
# hapaxen they generate.
$MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face
|X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint
|D(?:KIM|omainKey)-Signature
)}ix;

funny...

Tom Hendrikx

2014-10-15 07:19:10 UTC

Post by RW
On Tue, 14 Oct 2014 13:58:27 +0200

Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

for example, if you get a flood of 419s from the same source, you may
want it to be tokenized...

oooooooooooook..
now here's a suprise (it's all in the code :)
# Which headers should we scan for tokens? Don't use all of them, as
it's easy
# to pick up spurious clues from some. What we now do is use all of them
# *less* these well-known headers; that way we can pick up spammers'
tracking
# headers (which are obviously not well-known in advance!).
# Received is handled specially
$IGNORED_HDRS = qr{(?: (?:X-)?Sender # misc noise
|Delivered-To |Delivery-Date
|(?:X-)?Envelope-To
|X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text
|Subject # not worth a tiny gain vs. to db size increase
# Date: can provide invalid cues if your spam corpus is
# older/newer than ham
|Date
# List headers: ignore. a spamfiltering mailing list will
# become a nonspam sign.
|X-List|(?:X-)?Mailing-List
|(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe
|Unsubscribe|Host|Id|Manager|Admin|Comment
|Name|Url)
|X-Unsub(?:scribe)?
|X-Mailman-Version |X-Been[Tt]here |X-Loop
|Mail-Followup-To
|X-eGroups-(?:Return|From)
|X-MDMailing-List
|X-XEmacs-List
# gatewayed through mailing list (thanks to Allen Smith)
|(?:X-)?Resent-(?:From|To|Date)
|(?:X-)?Original-(?:From|To|Date)
# Spamfilter/virus-scanner headers: too easy to chain from
# these
|X-MailScanner(?:-SpamCheck)?
|X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))?
|X-Antispam |X-RBL-Warning |X-Mailscanner
|X-MDaemon-Deliver-To |X-Virus-Scanned
|X-Mass-Check-Id
|X-Pyzor |X-DCC-\S{2,25}-Metrics
|X-Filtered-B[Yy] |X-Scanned-By |X-Scanner
|X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status
|X-SpamCop-[^:]+
|X-SMTPD |(?:X-)?Spam-Apparently-To
|SPAM |X-Perlmx-Spam
|X-Bogosity
|Content-Class |Thread-(?:Index|Topic)
|X-Original[Aa]rrival[Tt]ime
|(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded
|Lines |Content-Length
|X-UIDL? |X-IMAPbase
# Annotations from Bugzilla
|X-Bugzilla-[^:]+
# Annotations from VM: (thanks to Allen Smith)
|X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified
|Summary-Format|VHeader|v\d-Data|Message-Order)
| X-Gnus-Mail-Source
| Xref
)}x;
# Note only the presence of these headers, in order to reduce the
# hapaxen they generate.
$MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face
|X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint
|D(?:KIM|omainKey)-Signature
)}ix;
funny...

Doing this in code has some drawbacks, just like the tld listing: it's
not visible to most people (like this thread nicely illustrates), and
you actually want to have it configurable. This one actually is
configurable, so now there are 2 tuneables for this problem: the code
(mostly static, hidden from view and unreachable for 99% of the users),
and the config file.

I propose to simplify, and move the code-wise exclusion to a config file
too: one tuneable (and one location to look at) is better than two.
Besides, the config file is far easier to read for the not so
regex-capable admin :)

Regards,
Tom

Martin Gregorie

2014-10-15 10:54:32 UTC

Post by Tom Hendrikx
I propose to simplify, and move the code-wise exclusion to a config file
too: one tuneable (and one location to look at) is better than two.
Besides, the config file is far easier to read for the not so
regex-capable admin :)

That sounds good to me and, while this sort of change is in the air, can
we do the same with the TLD list please? I ask because, as of last
Friday, the TLD update that adds .link to the list hadn't filtered
through into the Fedora package update.

It also seems like a good idea to include two configuration items in
sa_update's repertoire: the TLD list seems unlikely to stabilise any
time soon, so it would be particularly useful for it.

Martin

Axb

2014-10-15 11:06:36 UTC

This post might be inappropriate. Click to display it.

Anthony Cartmell

2014-10-15 08:54:52 UTC

Post by Axb
oooooooooooook..
now here's a suprise (it's all in the code :)

<snip>

Post by Axb
# Spamfilter/virus-scanner headers: too easy to chain from
# these
|X-MailScanner(?:-SpamCheck)?

For some time now MailScanner has recommended that users modify the
MailScanner header names to include their business name to add some
uniqueness.

So my mails contain "X-Fonant-MailScanner" and
"X-Fonant-MailScanner-SpamCheck" headers, for example.

The regexp might need updating to account for this, if we think
MailScanner headers are common enough to warrant this?

I also note that your 23_bayes_ignore_header.cf file has:

bayes_ignore_header X-ServerMaster-MailScanner

but that only covers one particular MailScanner user.

Anthony

--
www.fonant.com - Quality web sites
Tel. 01903 867 810
Fonant Ltd is registered in England and Wales, company No. 7006596
Registered office: Amelia House, Crescent Road, Worthing, West Sussex,
BN11 1QR

Axb

2014-10-15 09:09:46 UTC

Post by Anthony Cartmell

Post by Axb
oooooooooooook..
now here's a suprise (it's all in the code :)

<snip>

Post by Axb
# Spamfilter/virus-scanner headers: too easy to chain from
# these
|X-MailScanner(?:-SpamCheck)?

For some time now MailScanner has recommended that users modify the
MailScanner header names to include their business name to add some
uniqueness.
So my mails contain "X-Fonant-MailScanner" and
"X-Fonant-MailScanner-SpamCheck" headers, for example.
The regexp might need updating to account for this, if we think
MailScanner headers are common enough to warrant this?
bayes_ignore_header X-ServerMaster-MailScanner
but that only covers one particular MailScanner user.

that creeped in when merging Reindl's data

removed

2014-10-15 13:34:49 UTC

On Tue, 14 Oct 2014 23:54:56 +0200

Post by RW
On Tue, 14 Oct 2014 13:58:27 +0200

Post by RW
On Tue, 14 Oct 2014 10:44:51 +0200

Post by Axb
have you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam

It's really no different to other headers you are ignoring.

for example, if you get a flood of 419s from the same source, you
may want it to be tokenized...

oooooooooooook..
now here's a suprise (it's all in the code :)

It wasn't a surprise to me. Many of them I agree with, some I don't. On
the whole I don't care enough to patch it.

I'm not against ignoring things that obviously, or empirically, don't
help, what I didn't want was a huge list being imposed on everyone,
which was your original plan.

I certainly would patch it if X-Delivered-To were included;
Delivered-To, (?:X-)?Envelope-To definitely shouldn't be there IMO.

Post by Axb
|Subject # not worth a tiny gain vs. to db size increase

I'd forgotten about that one. The subject is already tokenized
through the body. And it probably made a lot of sense when spammers
weren't taking statistical filters seriously. But word frequencies
can be different in the subject, and spammers are now very good at
denying Bayes useful tokens. I think it's unfortunate that that
exclusion is unconditional.

The trouble is that a lot of this is that it's a judgement about
cost/benefit. But for me Bayes uses 20 millipennies of storage, and
catches 72% of spam at BAYES_9*, and Bogofilter uses 200 millipennies
but catches 94% of spam. To me that's 180 millipennies well spent and
I wouldn't begrudge Bayes a similar amount - I might even go to
whole pennies.

Axb

2014-10-14 14:10:52 UTC

Post by Axb
Updated (in case you're using it.....)
http://svn.apache.org/repos/asf/spamassassin/trunk/rulesrc/sandbox/axb/23_bayes_ignore_header.cf

and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related to
AV/filter products.

David F. Skoll

2014-10-14 14:17:05 UTC

On Tue, 14 Oct 2014 16:10:52 +0200

Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.

I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?

I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?

Regards,

David.

Axb

2014-10-14 14:32:58 UTC

Post by David F. Skoll
On Tue, 14 Oct 2014 16:10:52 +0200

Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.

I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?
I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?

David,

The "boys_ignore" file will not become a part of SA default .cf files.
My intention is to keep a central repository in case somebody else wants
to use it instead of mantaining in my local repo.

I believe in *some* massaging, as in "works for me".

I assume it depends on how you feed bayes and what kind of traffic you
deal with.

The concept of avoiding bayes from learning other filter's stuff is
ancient (there's a commented example in local.cf) but as with so much
in SA tuning , it's trial and possible error till you feel cozy.

Adam Katz

2014-10-14 21:08:34 UTC

Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.

I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?
I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?

The purpose of bayes_ignore_header is twofold:

1. Prevent inheriting other systems' false positives (ensure better
independence)
2. Prevent relying upon headers that won't exist at delivery time (e.g.
added by the mailbox server)

This is why it's so important to ignore other spam engines, which
basically fit into both of those categories.

Axb

2014-10-14 21:37:36 UTC

Post by Adam Katz

Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.

I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?
I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?

1. Prevent inheriting other systems' false positives (ensure better
independence)
2. Prevent relying upon headers that won't exist at delivery time (e.g.
added by the mailbox server)
This is why it's so important to ignore other spam engines, which
basically fit into both of those categories.

I'd love to have the option (switch) to use Bayes on msg bodies ONLY,
though I doubt anybody would be a taker for such a project.
(I'd even be willing to "$pon$or" such an addition to SA)

Reindl Harald

2014-10-14 22:53:58 UTC

Post by Adam Katz

Post by Axb
and to avoid further discussions of what header may pollute bayes or
not, I've removed all header entries which are not directly related
to AV/filter products.

I'm not sure I agree with being too clever about Bayes. Surely by its
very nature, the Bayes algorithm will itself indicate which tokens
are relevant and which are not? Isn't that the whole point of Bayes?
I think being to clever about massaging the data that gets fed to
Bayes may be counter-productive. For sure, *some* massaging is in order;
a token should be a semantic unit, so something like "www.example.com"
should probably be one token rather than three, but beyond that I wonder
if it's good or not to massage the data?

I'd love to have the option (switch) to use Bayes on msg bodies ONLY,
though I doubt anybody would be a taker for such a project.
(I'd even be willing to "$pon$or" such an addition to SA)

or someting like the opposit as now:

bayes_include_header received
bayes_include_header subject
bayes_include_header x-mailer

Jeff Mincy

2014-10-14 23:04:59 UTC

From: Axb <***@gmail.com>
Date: Tue, 14 Oct 2014 23:37:36 +0200

Post by Adam Katz