Post by RWOn Tue, 14 Oct 2014 13:58:27 +0200
Post by AxbPost by RWOn Tue, 14 Oct 2014 10:44:51 +0200
Post by Axbhave you verified that some of these are not included?
X-Originating-IP will not be included as it can be used to help
detect ham or spam
It's really no different to other headers you are ignoring.
for example, if you get a flood of 419s from the same source, you may
want it to be tokenized...
X-AntiAbuse: Originator/Caller UID/GID - [514 32007] / [47 12]
in this spam Bayes found
0.999-4--HX-AntiAbuse:32007
These numbers seem to be very good indicators for me.
Most of the headers in the file have never appeared in my ham, so
they'll be pure spam indicators if they are ever faked. In general
it's difficult for a spammer to gain an overall advantage against
an average per user database using faked headers.
Whatever the merits of this on system-wide Bayes (if any beyond
reducing token count), I think it would have a negative effect on
per user Bayes.
oooooooooooook..
now here's a suprise (it's all in the code :)
the Bayes.pm plugin alreafy includes:
# Which headers should we scan for tokens? Don't use all of them, as
it's easy
# to pick up spurious clues from some. What we now do is use all of them
# *less* these well-known headers; that way we can pick up spammers'
tracking
# headers (which are obviously not well-known in advance!).
# Received is handled specially
$IGNORED_HDRS = qr{(?: (?:X-)?Sender # misc noise
|Delivered-To |Delivery-Date
|(?:X-)?Envelope-To
|X-MIME-Auto[Cc]onverted |X-Converted-To-Plain-Text
|Subject # not worth a tiny gain vs. to db size increase
# Date: can provide invalid cues if your spam corpus is
# older/newer than ham
|Date
# List headers: ignore. a spamfiltering mailing list will
# become a nonspam sign.
|X-List|(?:X-)?Mailing-List
|(?:X-)?List-(?:Archive|Help|Id|Owner|Post|Subscribe
|Unsubscribe|Host|Id|Manager|Admin|Comment
|Name|Url)
|X-Unsub(?:scribe)?
|X-Mailman-Version |X-Been[Tt]here |X-Loop
|Mail-Followup-To
|X-eGroups-(?:Return|From)
|X-MDMailing-List
|X-XEmacs-List
# gatewayed through mailing list (thanks to Allen Smith)
|(?:X-)?Resent-(?:From|To|Date)
|(?:X-)?Original-(?:From|To|Date)
# Spamfilter/virus-scanner headers: too easy to chain from
# these
|X-MailScanner(?:-SpamCheck)?
|X-Spam(?:-(?:Status|Level|Flag|Report|Hits|Score|Checker-Version))?
|X-Antispam |X-RBL-Warning |X-Mailscanner
|X-MDaemon-Deliver-To |X-Virus-Scanned
|X-Mass-Check-Id
|X-Pyzor |X-DCC-\S{2,25}-Metrics
|X-Filtered-B[Yy] |X-Scanned-By |X-Scanner
|X-AP-Spam-(?:Score|Status) |X-RIPE-Spam-Status
|X-SpamCop-[^:]+
|X-SMTPD |(?:X-)?Spam-Apparently-To
|SPAM |X-Perlmx-Spam
|X-Bogosity
# some noisy Outlook headers that add no good clues:
|Content-Class |Thread-(?:Index|Topic)
|X-Original[Aa]rrival[Tt]ime
# Annotations from IMAP, POP, and MH:
|(?:X-)?Status |X-Flags |X-Keywords |Replied |Forwarded
|Lines |Content-Length
|X-UIDL? |X-IMAPbase
# Annotations from Bugzilla
|X-Bugzilla-[^:]+
# Annotations from VM: (thanks to Allen Smith)
|X-VM-(?:Bookmark|(?:POP|IMAP)-Retrieved|Labels|Last-Modified
|Summary-Format|VHeader|v\d-Data|Message-Order)
# Annotations from Gnus:
| X-Gnus-Mail-Source
| Xref
)}x;
# Note only the presence of these headers, in order to reduce the
# hapaxen they generate.
$MARK_PRESENCE_ONLY_HDRS = qr{(?: X-Face
|X-(?:Gnu-?PG|PGP|GPG)(?:-Key)?-Fingerprint
|D(?:KIM|omainKey)-Signature
)}ix;
funny...