Manually training SpamAssassin by forwarding mail

Discussion:

Sander Holthaus - Orange XL

2005-02-03 00:59:21 UTC

I've been interested in offering customers to train manually train the
SpamAssassin Bayes filter for ham and spam (to reduce false positives and
negatives). However, I can only find documentation to this for local
mailboxes and IMAP. Most users however, retrieve their mail through POP and
use Outlook (Express) as mail client. Is there a way to train SpamAssassin
with such a setup (e.g. forwarding mail with Outlook (Express) using SMTP)?

Kind Regards,
Sander Holthaus

Will Yardley

2005-02-03 01:05:32 UTC

Permalink

On Thu, Feb 03, 2005 at 01:59:21AM +0100, Sander Holthaus - Orange XL wrote:

> I've been interested in offering customers to train manually train the
> SpamAssassin Bayes filter for ham and spam (to reduce false positives and
> negatives). However, I can only find documentation to this for local
> mailboxes and IMAP. Most users however, retrieve their mail through POP and
> use Outlook (Express) as mail client. Is there a way to train SpamAssassin
> with such a setup (e.g. forwarding mail with Outlook (Express) using SMTP)?

There are various schemes to do this; the tricky part is getting people
to submit emails in a consistent format - if you can get them to forward
them as mesage/rfc822 attachments, it probably wouldn't be too hard to
write a program to extract them and train... I imagine this would be too
complicated for many users, though.

One scheme that we've used is to have specially named IMAP folders that
users can place mis-classified emails in for training.. then you can
have a server-side robot which trains the filter and then discards the
emails.

Matt Kettler

2005-02-03 01:07:49 UTC

Permalink

At 07:59 PM 2/2/2005, Sander Holthaus - Orange XL wrote:
>I've been interested in offering customers to train manually train the
>SpamAssassin Bayes filter for ham and spam (to reduce false positives and
>negatives). However, I can only find documentation to this for local
>mailboxes and IMAP. Most users however, retrieve their mail through POP
>and use Outlook (Express) as mail client. Is there a way to train
>SpamAssassin with such a setup (e.g. forwarding mail with Outlook
>(Express) using SMTP)?
>

Only if you can somehow get the users to forward an un-mangled message,
complete with original headers, as an attachment. You can then have a
script strip off the attachments and feed those to sa-learn.

The fundamental problem with normal forwarding is that from a SA
perspective, the forwarded message looks very little like the original. New
headers, different encoding, extra text often added to the body, superflous
mime sections dropped, others added.

Since SA learns from the message headers and some of the message encoding
has an impact on learning, these changes cause problems..

Kevin Sullivan

2005-02-04 03:32:25 UTC

Permalink

--On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
> I've been interested in offering customers to train manually train the
> SpamAssassin Bayes filter for ham and spam (to reduce false positives and
> negatives). However, I can only find documentation to this for local
> mailboxes and IMAP. Most users however, retrieve their mail through POP
> and use Outlook (Express) as mail client. Is there a way to train
> SpamAssassin with such a setup (e.g. forwarding mail with Outlook
> (Express) using SMTP)?

If you want to do a lot of programming, you could save all incoming
messages for a few days in a database somewhere. When a user forwards a
message to a special "ham" or "spam" mailbox, you pull the message-id from
the message and use it to recover the original message from your database.

-Kevin

Peter Marshall

2005-02-04 13:17:55 UTC

Permalink

Kevin Sullivan wrote:

> --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
>
>> I've been interested in offering customers to train manually train the
>> SpamAssassin Bayes filter for ham and spam (to reduce false positives
>> and
>> negatives). However, I can only find documentation to this for local
>> mailboxes and IMAP. Most users however, retrieve their mail through POP
>> and use Outlook (Express) as mail client. Is there a way to train
>> SpamAssassin with such a setup (e.g. forwarding mail with Outlook
>> (Express) using SMTP)?
>
>
> If you want to do a lot of programming, you could save all incoming
> messages for a few days in a database somewhere. When a user forwards
> a message to a special "ham" or "spam" mailbox, you pull the
> message-id from the message and use it to recover the original message
> from your database.
>
> -Kevin

My question is the same as Henrik, I have a bunch of email that is spam
(either tagged by spam assassin or not tagged at all. I forwared it as
an attachment to a "spam" mail box. What do I have to do now before I
can get bayes to learn the message ... I read you have to remove the
headers .... Could anyone give me a little more detail ?

Thanks,
Peter

Kevin Sullivan

2005-02-04 14:41:05 UTC

Permalink

--On 02/04/05 09:17:55 -0400 Peter Marshall wrote:
> My question is the same as Henrik, I have a bunch of email that is spam
> (either tagged by spam assassin or not tagged at all. I forwared it as
> an attachment to a "spam" mail box. What do I have to do now before I
> can get bayes to learn the message ... I read you have to remove the
> headers .... Could anyone give me a little more detail ?

There's no 100% good way to do this; it depends on how the message was
mangled by the client (and possibly server). The only guaranteed way is
(as I described) to save a copy at the same point as it is inspected by
SpamAssassin so you can use it later.

That being said, forwarding a message as an attachment will usually
preserve the headers pretty well. The perl MailTools and MIME-tools
modules have procedures to pull out attachments and save them in the Unix
format which sa-learn wants.

Sorry I don't have any ready-made scripts for this; my users dump messages
into shared IMAP mailboxes which don't need any preprocessing before being
fed to sa-learn.

-Kevin

Sander Holthaus - Orange XL

2005-02-04 15:08:53 UTC

Permalink

> --On 02/04/05 09:17:55 -0400 Peter Marshall wrote:
> > My question is the same as Henrik, I have a bunch of email that is
> > spam (either tagged by spam assassin or not tagged at all.
> I forwared
> > it as an attachment to a "spam" mail box. What do I have to do now
> > before I can get bayes to learn the message ... I read you have to
> > remove the headers .... Could anyone give me a little more detail ?
>
> There's no 100% good way to do this; it depends on how the
> message was mangled by the client (and possibly server). The
> only guaranteed way is (as I described) to save a copy at the
> same point as it is inspected by SpamAssassin so you can use it later.
>
> That being said, forwarding a message as an attachment will
> usually preserve the headers pretty well. The perl MailTools
> and MIME-tools modules have procedures to pull out
> attachments and save them in the Unix format which sa-learn wants.
>
> Sorry I don't have any ready-made scripts for this; my users
> dump messages into shared IMAP mailboxes which don't need any
> preprocessing before being fed to sa-learn.
>
> -Kevin

Basically, I've got two option. All mail that is received is backupped on
the mailserver before adding any headers. I could match those with mail
received in the spam-learn and ham-learn accounts. However, mail is
backupped only for a limited amount of time before being moved, after which
the mail-server hasn't got any access to it. So unless people report mail
that found it's way through the filters on a very regular basis it won't be
a full proof sollution.

The other option sounds more viable, I would only need to strip off the
X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my
setup for bayes anyhow), BUT I have no guarentee that the message is in it's
original format. Some MIME-Boundry rewriting may be done by the mailserver
(where necessary), as is converting 8bit to 7bit where possible. And I think
that there are many client-sided mailfiltering engines, spamscanners and
virusscanners out there that may do some rewriting as well.

>From above, I'm not sure that learning spam-assassin using forwarded
messages that may or may not be in the original format as SpamAssassin
received them the first time is a good idea. But I don't have enough
knowledge of SpamAssassin's internal workings and it's bayes-filter to be
sure...

Kind Regards,
Sander Holthaus

Stuart Johnston

2005-02-04 16:20:24 UTC

Permalink

Peter Marshall wrote:
> Kevin Sullivan wrote:
>
>> --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
>>
>>> I've been interested in offering customers to train manually train the
>>> SpamAssassin Bayes filter for ham and spam (to reduce false positives
>>> and
>>> negatives). However, I can only find documentation to this for local
>>> mailboxes and IMAP. Most users however, retrieve their mail through POP
>>> and use Outlook (Express) as mail client. Is there a way to train
>>> SpamAssassin with such a setup (e.g. forwarding mail with Outlook
>>> (Express) using SMTP)?
>>
>>
>>
>> If you want to do a lot of programming, you could save all incoming
>> messages for a few days in a database somewhere. When a user forwards
>> a message to a special "ham" or "spam" mailbox, you pull the
>> message-id from the message and use it to recover the original message
>> from your database.
>>
>> -Kevin
>
>
> My question is the same as Henrik, I have a bunch of email that is spam
> (either tagged by spam assassin or not tagged at all. I forwared it as
> an attachment to a "spam" mail box. What do I have to do now before I
> can get bayes to learn the message ... I read you have to remove the
> headers .... Could anyone give me a little more detail ?

I use a modified version of the DMZS-sa-learn.pl from:
http://www.dmzs.com/tools/files/spam.phtml

When someone forwards a spam to me, I move the message to a special imap
folder that gets processed by the script. My additions look something like:

use Email::MIME;
...
my $msg = Email::MIME->new($raw_message_body);

my @parts = $msg->parts;

foreach (@parts) {
if ($_->content_type =~ m|message/rfc822|) {
sa_learn($_->body_raw);
}
}

I've tested this with messages forwarded as attachment from Outlook and
Thunderbird. I'm not sure how effective it is though. I'm sure that it
still looses something in the translation. All imap is really the way
to go if you can.

Stuart Johnston

Sander Holthaus - Orange XL

2005-02-04 18:07:32 UTC

Permalink

> -----Original Message-----
> From: Stuart Johnston [mailto:***@ebby.com]
> Sent: Friday, February 04, 2005 5:20 PM
> To: ***@spamassassin.apache.org
> Cc: Peter Marshall
> Subject: Re: Manually training SpamAssassin by forwarding mail
>
> Peter Marshall wrote:
> > Kevin Sullivan wrote:
> >
> >> --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
> >>
> >>> I've been interested in offering customers to train
> manually train
> >>> the SpamAssassin Bayes filter for ham and spam (to reduce false
> >>> positives and negatives). However, I can only find
> documentation to
> >>> this for local mailboxes and IMAP. Most users however, retrieve
> >>> their mail through POP and use Outlook (Express) as mail
> client. Is
> >>> there a way to train SpamAssassin with such a setup (e.g.
> forwarding
> >>> mail with Outlook
> >>> (Express) using SMTP)?
> >>
> >>
> >>
> >> If you want to do a lot of programming, you could save all
> incoming
> >> messages for a few days in a database somewhere. When a user
> >> forwards a message to a special "ham" or "spam" mailbox,
> you pull the
> >> message-id from the message and use it to recover the original
> >> message from your database.
> >>
> >> -Kevin
> >
> >
> > My question is the same as Henrik, I have a bunch of email that is
> > spam (either tagged by spam assassin or not tagged at all.
> I forwared
> > it as an attachment to a "spam" mail box. What do I have to do now
> > before I can get bayes to learn the message ... I read you have to
> > remove the headers .... Could anyone give me a little more detail ?
>
> I use a modified version of the DMZS-sa-learn.pl from:
> http://www.dmzs.com/tools/files/spam.phtml
>
> When someone forwards a spam to me, I move the message to a
> special imap folder that gets processed by the script. My
> additions look something like:
>
> use Email::MIME;
> ...
> my $msg = Email::MIME->new($raw_message_body);
>
> my @parts = $msg->parts;
>
> foreach (@parts) {
> if ($_->content_type =~ m|message/rfc822|) {
> sa_learn($_->body_raw);
> }
> }
>
>
> I've tested this with messages forwarded as attachment from
> Outlook and Thunderbird. I'm not sure how effective it is
> though. I'm sure that it still looses something in the
> translation. All imap is really the way to go if you can.
>
>
> Stuart Johnston

Would it be an idea to stip the delivered to-header from the message, as
this will have no meaning to distinct between ham/spam?

Also, I was wondering if anybody who is using spam-learn and ham-learn has
any protection build in to stop non-system users from mailing to those
addresses?

Kind Regards,
Sander Holthaus

Peter Marshall

2005-02-04 18:19:14 UTC

Permalink

Stuart Johnston wrote:
> Peter Marshall wrote:
>
>> Kevin Sullivan wrote:
>>
>>> --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
>>>
>>>> I've been interested in offering customers to train manually train the
>>>> SpamAssassin Bayes filter for ham and spam (to reduce false
>>>> positives and
>>>> negatives). However, I can only find documentation to this for local
>>>> mailboxes and IMAP. Most users however, retrieve their mail through POP
>>>> and use Outlook (Express) as mail client. Is there a way to train
>>>> SpamAssassin with such a setup (e.g. forwarding mail with Outlook
>>>> (Express) using SMTP)?
>>>
>>>
>>>
>>>
>>> If you want to do a lot of programming, you could save all incoming
>>> messages for a few days in a database somewhere. When a user
>>> forwards a message to a special "ham" or "spam" mailbox, you pull the
>>> message-id from the message and use it to recover the original
>>> message from your database.
>>>
>>> -Kevin
>>
>>
>>
>> My question is the same as Henrik, I have a bunch of email that is
>> spam (either tagged by spam assassin or not tagged at all. I forwared
>> it as an attachment to a "spam" mail box. What do I have to do now
>> before I can get bayes to learn the message ... I read you have to
>> remove the headers .... Could anyone give me a little more detail ?
>
>
> I use a modified version of the DMZS-sa-learn.pl from:
> http://www.dmzs.com/tools/files/spam.phtml
>
> When someone forwards a spam to me, I move the message to a special imap
> folder that gets processed by the script. My additions look something
> like:
>
> use Email::MIME;
> ...
> my $msg = Email::MIME->new($raw_message_body);
>
> my @parts = $msg->parts;
>
> foreach (@parts) {
> if ($_->content_type =~ m|message/rfc822|) {
> sa_learn($_->body_raw);
> }
> }
>
>
> I've tested this with messages forwarded as attachment from Outlook and
> Thunderbird. I'm not sure how effective it is though. I'm sure that it
> still looses something in the translation. All imap is really the way
> to go if you can.
>
>
> Stuart Johnston
>
>
But I have no imap .. only pop .. they would forwared (as attachment) to
a mailbox, and then I have to run sa-learn ... I assume as root ?

Will the stuff you posted work for this setup as well ??

Would there be big problems just running it after the forwared as
attachment. ??

Can users also forwared as attachemtn mail that was sent that was
already marked as spam ... or is there any advantage to this ?

Thanks,
Peter

Stuart Johnston

2005-02-04 18:34:54 UTC

Permalink

Peter Marshall wrote:
> Stuart Johnston wrote:
>
>> Peter Marshall wrote:
>>
>>> Kevin Sullivan wrote:
>>>
>>>> --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
>>>>
>>>>> I've been interested in offering customers to train manually train the
>>>>> SpamAssassin Bayes filter for ham and spam (to reduce false
>>>>> positives and
>>>>> negatives). However, I can only find documentation to this for local
>>>>> mailboxes and IMAP. Most users however, retrieve their mail through
>>>>> POP
>>>>> and use Outlook (Express) as mail client. Is there a way to train
>>>>> SpamAssassin with such a setup (e.g. forwarding mail with Outlook
>>>>> (Express) using SMTP)?
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> If you want to do a lot of programming, you could save all incoming
>>>> messages for a few days in a database somewhere. When a user
>>>> forwards a message to a special "ham" or "spam" mailbox, you pull
>>>> the message-id from the message and use it to recover the original
>>>> message from your database.
>>>>
>>>> -Kevin
>>>
>>>
>>>
>>>
>>> My question is the same as Henrik, I have a bunch of email that is
>>> spam (either tagged by spam assassin or not tagged at all. I
>>> forwared it as an attachment to a "spam" mail box. What do I have to
>>> do now before I can get bayes to learn the message ... I read you
>>> have to remove the headers .... Could anyone give me a little more
>>> detail ?
>>
>>
>>
>> I use a modified version of the DMZS-sa-learn.pl from:
>> http://www.dmzs.com/tools/files/spam.phtml
>>
>> When someone forwards a spam to me, I move the message to a special
>> imap folder that gets processed by the script. My additions look
>> something like:
>>
>> use Email::MIME;
>> ...
>> my $msg = Email::MIME->new($raw_message_body);
>>
>> my @parts = $msg->parts;
>>
>> foreach (@parts) {
>> if ($_->content_type =~ m|message/rfc822|) {
>> sa_learn($_->body_raw);
>> }
>> }
>>
>>
>> I've tested this with messages forwarded as attachment from Outlook
>> and Thunderbird. I'm not sure how effective it is though. I'm sure
>> that it still looses something in the translation. All imap is really
>> the way to go if you can.
>>
>>
>> Stuart Johnston
>>
>>
> But I have no imap .. only pop .. they would forwared (as attachment) to
> a mailbox, and then I have to run sa-learn ... I assume as root ?
>
> Will the stuff you posted work for this setup as well ??
>
> Would there be big problems just running it after the forwared as
> attachment. ??

The code I posted only shows how you can extract the attached spam from
the email. You'll need to write your own code to integrate it into your
particular setup.

BTW, in Outlook, you can easily attach multiple spams to one message and
this code should handle it.

>
> Can users also forwared as attachemtn mail that was sent that was
> already marked as spam ... or is there any advantage to this ?

If you use Bayes auto learn, I suspect that this wouldn't do much.
Otherwise, it might help.

Stuart Johnston

Sander Holthaus - Orange XL

2005-02-04 18:47:40 UTC

Permalink

> -----Original Message-----
> From: Stuart Johnston [mailto:***@ebby.com]
> Sent: Friday, February 04, 2005 7:35 PM
> To: Peter Marshall; SpamAssassin Users
> Subject: Re: Manually training SpamAssassin by forwarding mail
>
> Peter Marshall wrote:
> > Stuart Johnston wrote:
> >
> >> Peter Marshall wrote:
> >>
> >>> Kevin Sullivan wrote:
> >>>
> >>>> --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
> >>>>
> >>>>> I've been interested in offering customers to train
> manually train
> >>>>> the SpamAssassin Bayes filter for ham and spam (to reduce false
> >>>>> positives and negatives). However, I can only find
> documentation
> >>>>> to this for local mailboxes and IMAP. Most users
> however, retrieve
> >>>>> their mail through POP and use Outlook (Express) as
> mail client.
> >>>>> Is there a way to train SpamAssassin with such a setup (e.g.
> >>>>> forwarding mail with Outlook
> >>>>> (Express) using SMTP)?
> >>>>
> >>>>
> >>>>
> >>>>
> >>>>
> >>>> If you want to do a lot of programming, you could save
> all incoming
> >>>> messages for a few days in a database somewhere. When a user
> >>>> forwards a message to a special "ham" or "spam" mailbox,
> you pull
> >>>> the message-id from the message and use it to recover
> the original
> >>>> message from your database.
> >>>>
> >>>> -Kevin
> >>>
> >>>
> >>>
> >>>
> >>> My question is the same as Henrik, I have a bunch of
> email that is
> >>> spam (either tagged by spam assassin or not tagged at all. I
> >>> forwared it as an attachment to a "spam" mail box. What
> do I have
> >>> to do now before I can get bayes to learn the message ...
> I read you
> >>> have to remove the headers .... Could anyone give me a
> little more
> >>> detail ?
> >>
> >>
> >>
> >> I use a modified version of the DMZS-sa-learn.pl from:
> >> http://www.dmzs.com/tools/files/spam.phtml
> >>
> >> When someone forwards a spam to me, I move the message to
> a special
> >> imap folder that gets processed by the script. My additions look
> >> something like:
> >>
> >> use Email::MIME;
> >> ...
> >> my $msg = Email::MIME->new($raw_message_body);
> >>
> >> my @parts = $msg->parts;
> >>
> >> foreach (@parts) {
> >> if ($_->content_type =~ m|message/rfc822|) {
> >> sa_learn($_->body_raw);
> >> }
> >> }
> >>
> >>
> >> I've tested this with messages forwarded as attachment
> from Outlook
> >> and Thunderbird. I'm not sure how effective it is though.
> I'm sure
> >> that it still looses something in the translation. All imap is
> >> really the way to go if you can.
> >>
> >>
> >> Stuart Johnston
> >>
> >>
> > But I have no imap .. only pop .. they would forwared (as
> attachment)
> > to a mailbox, and then I have to run sa-learn ... I assume as root ?
> >
> > Will the stuff you posted work for this setup as well ??
> >
> > Would there be big problems just running it after the forwared as
> > attachment. ??
>
> The code I posted only shows how you can extract the attached
> spam from the email. You'll need to write your own code to
> integrate it into your particular setup.
>
> BTW, in Outlook, you can easily attach multiple spams to one
> message and this code should handle it.

CTRL-a, right click, "Forward Items" will indeed do the trick.

> >
> > Can users also forwared as attachemtn mail that was sent that was
> > already marked as spam ... or is there any advantage to this ?
>
> If you use Bayes auto learn, I suspect that this wouldn't do much.
> Otherwise, it might help.

I would check the headers of the forwarded messages to see if their
spam-score is above your auto-learning threshold. If it is, relearning is is
perhaps quite useless. You might wonder why they received the message anyway
(I would think that something that is good enough to autolearn is good
enough to refuse or discard).

Kind Regards,
Sander Holthaus

Joe Polk

2005-02-04 19:36:10 UTC

Permalink

First, I had understood that Bayes can learn previously tagged emails without
stripping Spamassassin tags. Has this changed?

Second, all of my users use a webmail client, though they can use OE if they
wish. It is probably best for them to use IMAP so that server-side scanning
can better be setup. I currently have 2 scripts that run nightly. The first
takes everthing in the user's /home/user/mail/Spam folder and learns it as
spam then empties it. The second does the same for Ham, but moved that mail to
a Cleaned folder. All the user has to do is move untagged spam into Spam and
false-positives into Ham.

--
<<JAV>>

---------- Original Message -----------
From: "Sander Holthaus - Orange XL" <***@orangexl.com>
To: "'SpamAssassin Users'" <***@spamassassin.apache.org>
Cc: "'Stuart Johnston'" <***@ebby.com>, "'Peter Marshall'"
<***@caris.com>
Sent: Fri, 4 Feb 2005 19:47:40 +0100
Subject: RE: Manually training SpamAssassin by forwarding mail

> > -----Original Message-----
> > From: Stuart Johnston [mailto:***@ebby.com]
> > Sent: Friday, February 04, 2005 7:35 PM
> > To: Peter Marshall; SpamAssassin Users
> > Subject: Re: Manually training SpamAssassin by forwarding mail
> >
> > Peter Marshall wrote:
> > > Stuart Johnston wrote:
> > >
> > >> Peter Marshall wrote:
> > >>
> > >>> Kevin Sullivan wrote:
> > >>>
> > >>>> --On 02/03/05 01:59:21 +0100 Sander Holthaus - Orange XL wrote:
> > >>>>
> > >>>>> I've been interested in offering customers to train
> > manually train
> > >>>>> the SpamAssassin Bayes filter for ham and spam (to reduce false
> > >>>>> positives and negatives). However, I can only find
> > documentation
> > >>>>> to this for local mailboxes and IMAP. Most users
> > however, retrieve
> > >>>>> their mail through POP and use Outlook (Express) as
> > mail client.
> > >>>>> Is there a way to train SpamAssassin with such a setup (e.g.
> > >>>>> forwarding mail with Outlook
> > >>>>> (Express) using SMTP)?
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>>
> > >>>> If you want to do a lot of programming, you could save
> > all incoming
> > >>>> messages for a few days in a database somewhere. When a user
> > >>>> forwards a message to a special "ham" or "spam" mailbox,
> > you pull
> > >>>> the message-id from the message and use it to recover
> > the original
> > >>>> message from your database.
> > >>>>
> > >>>> -Kevin
> > >>>
> > >>>
> > >>>
> > >>>
> > >>> My question is the same as Henrik, I have a bunch of
> > email that is
> > >>> spam (either tagged by spam assassin or not tagged at all. I
> > >>> forwared it as an attachment to a "spam" mail box. What
> > do I have
> > >>> to do now before I can get bayes to learn the message ...
> > I read you
> > >>> have to remove the headers .... Could anyone give me a
> > little more
> > >>> detail ?
> > >>
> > >>
> > >>
> > >> I use a modified version of the DMZS-sa-learn.pl from:
> > >> http://www.dmzs.com/tools/files/spam.phtml
> > >>
> > >> When someone forwards a spam to me, I move the message to
> > a special
> > >> imap folder that gets processed by the script. My additions look
> > >> something like:
> > >>
> > >> use Email::MIME;
> > >> ...
> > >> my $msg = Email::MIME->new($raw_message_body);
> > >>
> > >> my @parts = $msg->parts;
> > >>
> > >> foreach (@parts) {
> > >> if ($_->content_type =~ m|message/rfc822|) {
> > >> sa_learn($_->body_raw);
> > >> }
> > >> }
> > >>
> > >>
> > >> I've tested this with messages forwarded as attachment
> > from Outlook
> > >> and Thunderbird. I'm not sure how effective it is though.
> > I'm sure
> > >> that it still looses something in the translation. All imap is
> > >> really the way to go if you can.
> > >>
> > >>
> > >> Stuart Johnston
> > >>
> > >>
> > > But I have no imap .. only pop .. they would forwared (as
> > attachment)
> > > to a mailbox, and then I have to run sa-learn ... I assume as root ?
> > >
> > > Will the stuff you posted work for this setup as well ??
> > >
> > > Would there be big problems just running it after the forwared as
> > > attachment. ??
> >
> > The code I posted only shows how you can extract the attached
> > spam from the email. You'll need to write your own code to
> > integrate it into your particular setup.
> >
> > BTW, in Outlook, you can easily attach multiple spams to one
> > message and this code should handle it.
>
> CTRL-a, right click, "Forward Items" will indeed do the trick.
>
> > >
> > > Can users also forwared as attachemtn mail that was sent that was
> > > already marked as spam ... or is there any advantage to this ?
> >
> > If you use Bayes auto learn, I suspect that this wouldn't do much.
> > Otherwise, it might help.
>
> I would check the headers of the forwarded messages to see if their
> spam-score is above your auto-learning threshold. If it is,
> relearning is is perhaps quite useless. You might wonder why they
> received the message anyway
> (I would think that something that is good enough to autolearn is
> good enough to refuse or discard).
>
> Kind Regards,
> Sander Holthaus
------- End of Original Message -------

Kevin Sullivan

2005-02-04 15:23:33 UTC

Permalink

--On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote:
> Basically, I've got two option. All mail that is received is backupped on
> the mailserver before adding any headers. I could match those with mail
> received in the spam-learn and ham-learn accounts. However, mail is
> backupped only for a limited amount of time before being moved, after
> which the mail-server hasn't got any access to it. So unless people
> report mail that found it's way through the filters on a very regular
> basis it won't be a full proof sollution.

You don't really need a 100% solution; something which works 80% of the
time would probably be fine. But you may not want to do the programming
needed to automate this.

> The other option sounds more viable, I would only need to strip off the
> X-Scanned-By, X-Spam-* and X-Sanitized headers (which are ignored in my
> setup for bayes anyhow), BUT I have no guarentee that the message is in
> it's original format. Some MIME-Boundry rewriting may be done by the
> mailserver (where necessary), as is converting 8bit to 7bit where
> possible. And I think that there are many client-sided mailfiltering
> engines, spamscanners and virusscanners out there that may do some
> rewriting as well.

You'll probably find that the various changes don't affect bayes that much.
When a re-written message is learned you may make bayes miss email which
(in an ideal world) it would have caught, but I think it will tend to
classify messages around 50% "I don't know if this is ham or spam" rather
than classifying it incorrectly. And there should be enough unchanged
tokens in the messages to let bayes work anyways.

So I say strip off what you can but don't obsess about the rest. Feed it
into bayes and see how it works, and only try to fix it if you see bayes
misclassifying email.

-Kevin

Sander Holthaus - Orange XL

2005-02-04 16:19:52 UTC

Permalink

> --On 02/04/05 16:08:53 +0100 Sander Holthaus - Orange XL wrote:
> > Basically, I've got two option. All mail that is received
> is backupped
> > on the mailserver before adding any headers. I could match
> those with
> > mail received in the spam-learn and ham-learn accounts.
> However, mail
> > is backupped only for a limited amount of time before being moved,
> > after which the mail-server hasn't got any access to it. So unless
> > people report mail that found it's way through the filters
> on a very
> > regular basis it won't be a full proof sollution.
>
> You don't really need a 100% solution; something which works
> 80% of the time would probably be fine. But you may not want
> to do the programming needed to automate this.

I don't have the time for it yet, but I should be able t make something in
Perl. Personally, I'm no big fan of the 80% rule in programming as that last
undone 20% usually forms 80% of my problems :-)

> > The other option sounds more viable, I would only need to strip off
> > the X-Scanned-By, X-Spam-* and X-Sanitized headers (which
> are ignored
> > in my setup for bayes anyhow), BUT I have no guarentee that the
> > message is in it's original format. Some MIME-Boundry
> rewriting may be
> > done by the mailserver (where necessary), as is converting 8bit to
> > 7bit where possible. And I think that there are many client-sided
> > mailfiltering engines, spamscanners and virusscanners out
> there that
> > may do some rewriting as well.
>
> You'll probably find that the various changes don't affect
> bayes that much.
> When a re-written message is learned you may make bayes miss
> email which (in an ideal world) it would have caught, but I
> think it will tend to classify messages around 50% "I don't
> know if this is ham or spam" rather than classifying it
> incorrectly. And there should be enough unchanged tokens in
> the messages to let bayes work anyways.
>
> So I say strip off what you can but don't obsess about the
> rest. Feed it into bayes and see how it works, and only try
> to fix it if you see bayes misclassifying email.

I'm not sure if I know of a good system to check and see if BAYES is
misclassifing, but I should be able to get some of that information from the
logfiles. Perhaps throing away mail that has been rewritten/reformatted
would be a sollution, thouh I don't know if those can be recognized easily.
We'll see :-)

Thanks for all the help and suggestions!

Kind Regards,
Sander Holthaus