spamc -L apparently not working properly

Discussion:

spamc -L apparently not working properly

Sergio Durigan Junior

11 years ago

Hey there,

I am using Debian Wheezy here (therefore, Exim + Dovecot for e-mail),
and I am still deciding how to run SpamAssassin. I am divided between
running it by directly calling spamassassin, or by running spamd and
calling spamc. Both methods are going to be used via my .procmailrc.

Well, but so far I have been testing spamd + spamc because it is the
Debian recommended way. I still haven't enabled it via .procmailrc, and
just did tests by calling spamc via CLI. However, I am seeing a strange
behavior when I try to feed spamd with a false-negative message. Here's
what I am doing:

#> spamc -c < spam.file
0.0/5.0
#> spamc -L spam < spam.file
(successful message saying that the spam was learned)
#> spamc -c < spam.file
0.0/5.0

I have already updated my Bayesian database, restarted the spamd
service, etc. I was expecting that I'd get a high rate after feeding
the spam to SpamAssassin, but that's not happening. Any suggestions?

I am running spamd with the following options:

--create-prefs --max-children 5 --helper-home-dir --allow-tell

And the version I am using is:

SpamAssassin version 3.3.2
running on Perl version 5.14.2

Comments and suggestions are appreciated. Thanks!

--
Sergio

John Hardin

11 years ago

Post by Sergio Durigan Junior
I am using Debian Wheezy here (therefore, Exim + Dovecot for e-mail),
and I am still deciding how to run SpamAssassin. I am divided between
running it by directly calling spamassassin, or by running spamd and
calling spamc. Both methods are going to be used via my .procmailrc.

Not directly addressing your other questions but: running spamassassin
directly is only really suitable for *very* low-traffic environments, as
that will parse and compile all of the rules and other config *per
message*, which is a lot of overhead. spamc+spamd is strongly recommended
for production use.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
***@impsec.org FALaholic #11174 pgpk -a ***@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
The more you believe you can create heaven on earth the more
likely you are to set up guillotines in the public square to
hasten the process. -- James Lileks
-----------------------------------------------------------------------
3 days until Veterans Day

Sergio Durigan Junior

11 years ago

Post by John Hardin
Not directly addressing your other questions but: running spamassassin
directly is only really suitable for *very* low-traffic environments,
as that will parse and compile all of the rules and other config *per
message*, which is a lot of overhead. spamc+spamd is strongly
recommended for production use.

Thanks a lot for the input, John. I guess I will end up using spamd and
spamc, after all. I'll just wait for the answer to my question, and
then I'll set everything up here.

Regards,

--
Sergio

John Hardin

11 years ago

Post by Sergio Durigan Junior
#> spamc -c < spam.file
0.0/5.0
#> spamc -L spam < spam.file
(successful message saying that the spam was learned)
#> spamc -c < spam.file
0.0/5.0
I have already updated my Bayesian database, restarted the spamd
service, etc. I was expecting that I'd get a high rate after feeding
the spam to SpamAssassin, but that's not happening. Any suggestions?

Try using sa-learn to train Bayes.

The big thing to keep in mind is that the user running the training needs
to be the same user that spamd is running as; if not, depending on your
bayes database config, you may be training a different Bayes database than
the one spamd is reading.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
***@impsec.org FALaholic #11174 pgpk -a ***@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
From the Liberty perspective, it doesn't matter if it's a
jackboot or a Birkenstock smashing your face. -- Robb Allen
-----------------------------------------------------------------------
3 days until Veterans Day

Sergio Durigan Junior

11 years ago

Post by John Hardin

Post by Sergio Durigan Junior
#> spamc -c < spam.file
0.0/5.0
#> spamc -L spam < spam.file
(successful message saying that the spam was learned)
#> spamc -c < spam.file
0.0/5.0
I have already updated my Bayesian database, restarted the spamd
service, etc. I was expecting that I'd get a high rate after feeding
the spam to SpamAssassin, but that's not happening. Any suggestions?

Try using sa-learn to train Bayes.

I don't think sa-learn can help with spamd. Its own manpage mention
that, for spamd users, "spamc -L" is the way to go.

Post by John Hardin
The big thing to keep in mind is that the user running the training
needs to be the same user that spamd is running as; if not, depending
on your bayes database config, you may be training a different Bayes
database than the one spamd is reading.

Hm, really? I thought spamd kept a global Bayes database, and that
everyone calling "spamc -L" would end up feeding this database, and not
some local one.

--
Sergio

Amir 'CG' Caspi

11 years ago

Post by Sergio Durigan Junior
I don't think sa-learn can help with spamd. Its own manpage mention
that, for spamd users, "spamc -L" is the way to go.
Hm, really? I thought spamd kept a global Bayes database, and that
everyone calling "spamc -L" would end up feeding this database, and not
some local one.

It depends on how spamc is called. If spamd is running as root and spamc
is called with the -u flag, then spamd will su to the named user, and will
then use that user's local database (and local prefs, if allow_user_prefs
is enabled). spamc -L -u would work on the local database; spamc -L
(without -u) would work on the database applicable to the spamd user.

It all depends on whether you want your users to have individual databases
tailored to their own spam/ham, or a global database.

--- Amir

Sergio Durigan Junior

11 years ago

...

My spamd is currently running as root, but I am thinking about changing
it to run using Debian's pre-setup user (debian-spamd). Unless you guys
have better recommendations.

Post by Amir 'CG' Caspi
It all depends on whether you want your users to have individual databases
tailored to their own spam/ham, or a global database.

The problem with having a user-tailored database is that I will have to
run sa-update for every user, right? Currently, Debian provides the
aforementioned spamd user (debian-spamd) and runs sa-update on behalf of
it. Therefore, I believe using a global database is probably better in
this case. What's your opinion?

--
Sergio

Amir 'CG' Caspi

11 years ago

Post by Sergio Durigan Junior
The problem with having a user-tailored database is that I will have to
run sa-update for every user, right?

No, or at least, not that I've seen. If spamd is running as root, it will
load the sa-update rules from the root installation
(/var/lib/spamassassin); it will only su to the user when called by spamc,
and then it will only load that user's local Bayes DB and local rules (if
enabled); it doesn't have to load any of the main rules, which are kept in
memory from when spamd was first initiated (and were loaded from the root
installation). This is also why it's important to restart spamd when
sa-update actually updates rules (the sa-update cron script should do this
for you).

At least, this is how it works on my system, which has a pretty vanilla
install of SA.

Even if your users are running spamassassin versus spamc, it should be
able to read the rules in the root install location, as long as your users
have read permission. If you're running on a virtual host platform with
multiple chroot environments (e.g. cPanel, Parallels Pro Control Panel,
etc.) then you may need to run sa-update for each environment, but you
should still only need the one root install (and one sa-update command)
for running spamd as root.

Post by Sergio Durigan Junior
What's your opinion?

I would run spamd as root and initiate spamc with the -u option, to allow
each user to have his/her own Bayes DB. However, again, it really depends
on what kind of email system you're running, and how you want to handle
spam. If you're running a corporate server, you might prefer a global DB;
if you're running a server with personal users whose email characteristics
vary widely, you might prefer per-user DBs. For my setup, I prefer
per-user DBs.

--- Amir

Sergio Durigan Junior

11 years ago

...

Thanks for the opinion. I was considering doing that, and your message
was the final word I needed.

Now everything is setup per-user, and I am feeding the Bayes DB with
what I have.

Thanks,

--
Sergio

Karsten Bräckelmann

11 years ago

Post by Sergio Durigan Junior

Post by Amir 'CG' Caspi
I would run spamd as root and initiate spamc with the -u option, to allow
each user to have his/her own Bayes DB. However, again, it really depends
on what kind of email system you're running, and how you want to handle
spam. If you're running a corporate server, you might prefer a global DB;
if you're running a server with personal users whose email characteristics
vary widely, you might prefer per-user DBs. For my setup, I prefer
per-user DBs.

You mentioned using SA from procmail, so there usually is no need for
the -u user option (see that other sub-thread about this option).

Running the spamd daemon as root and calling spamc as the receiving user
is an easy way to get per-user Bayes databases. Keep in mind though,
this requires Bayes training per user, and every user needs its own
$HOME or related options.

Post by Sergio Durigan Junior
Thanks for the opinion. I was considering doing that, and your message
was the final word I needed.
Now everything is setup per-user, and I am feeding the Bayes DB with
what I have.

What I wrote above was partially triggered by this. Not "the Bayes DB",
which sounds like a single one to me, but one Bayes db per user. Which
requires initial training per user.

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Karsten Bräckelmann

11 years ago

Post by Amir 'CG' Caspi

Post by Sergio Durigan Junior
I don't think sa-learn can help with spamd. Its own manpage mention
that, for spamd users, "spamc -L" is the way to go.

Fundamentally, there is no difference between sa-learn and spamc -L.

...

The latter is incorrect -- spamc by default sends the effective user ID,
and spamd switches users before processing the mail (assuming the daemon
has been started as root). The -u user option is only necessary to
change that default.

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Amir 'CG' Caspi

11 years ago

Post by Karsten BrÃ¤ckelmann
The latter is incorrect -- spamc by default sends the effective user ID,
and spamd switches users before processing the mail (assuming the daemon
has been started as root). The -u user option is only necessary to
change that default.

Whoops, you're perfectly right. On a system where spamc is run as some
fixed user (e.g. nobody), you need the -u option to get the per-user
options to work correctly. If spamc is being run as the receiving user
already (e.g. via procmail, barring some weird setuid behavior) then you
don't need the -u option (although it won't break anything if you use it,
it's just unnecessary).

Sorry for the incomplete info.

--- Amir

John Hardin

11 years ago

...

Not true. sa-learn is just fine for spamd with a global Bayes database,
and it's recommended for administrative simplicity if you have that
environment.

Post by Sergio Durigan Junior

Post by John Hardin
The big thing to keep in mind is that the user running the training
needs to be the same user that spamd is running as; if not, depending
on your bayes database config, you may be training a different Bayes
database than the one spamd is reading.

Hm, really? I thought spamd kept a global Bayes database, and that
everyone calling "spamc -L" would end up feeding this database, and not
some local one.

Global vs. per-user Bayes databases is a site-specific config. However, it
should be consistent - spamd should be reading from and training to the
bayes database of the user running spamc, so I don't off the top of my
head know why it dosn't appear to be working for you.

What are the Bayes database statistics before and after running spamc -L?
(sa-learn --dump magic)

I use a global database and sa-learn, so I don't have any direct
experience with spamc -L quirks, sorry. That's why I suggested sa-learn.

--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
***@impsec.org FALaholic #11174 pgpk -a ***@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
From the Liberty perspective, it doesn't matter if it's a
jackboot or a Birkenstock smashing your face. -- Robb Allen
-----------------------------------------------------------------------
3 days until Veterans Day

Sergio Durigan Junior

11 years ago

Post by John Hardin

Post by Sergio Durigan Junior
I don't think sa-learn can help with spamd. Its own manpage mention
that, for spamd users, "spamc -L" is the way to go.

Not true. sa-learn is just fine for spamd with a global Bayes
database, and it's recommended for administrative simplicity if you
have that environment.

Aha, interesting, thanks for explaining.

...

Nice, thank you. I am more inclined to use a per-user database, and
call "spamc -u myuser -L spam". Let's see how that goes.

--
Sergio

Karsten Bräckelmann

11 years ago

Post by Sergio Durigan Junior
Nice, thank you. I am more inclined to use a per-user database, and
call "spamc -u myuser -L spam". Let's see how that goes.

The real difference between sa-learn and spamc -L is how to feed it.

The spamc way expects a single message on STDIN, which usually is most
applicable for integration with your MUA. It also easily enables mail
storage and SA to be on different machines.

sa-learn expects the message(s) as file name. Requires direct access of
the mail storage, but enables training of entire mail folders with a
single command.

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Karsten Bräckelmann

11 years ago

Post by Sergio Durigan Junior
#> spamc -c < spam.file
0.0/5.0
#> spamc -L spam < spam.file
(successful message saying that the spam was learned)
#> spamc -c < spam.file
0.0/5.0

You mentioned that's a fresh install, actually not even in production
yet. The Bayes sub-system requires some training (minimum of 200 ham and
spam each) by default, before Bayes rules kick in for scanning.

Instead of -c check only, use the -R option to print the report. You'll
notice there is no BAYES_xx rule (yet).

Post by Sergio Durigan Junior
I have already updated my Bayesian database, restarted the spamd

I'm curious -- what does updating your Bayes db mean?

Post by Sergio Durigan Junior
service, etc. I was expecting that I'd get a high rate after feeding
the spam to SpamAssassin, but that's not happening. Any suggestions?

In addition to required initial training:

The Bayesian classifier works on a per-token (think: word) basis. Thus,
depending on the tokens in the message and existing ones in the db, the
impact of learning can vary quite a lot -- from hardly noticeable to
clear detection.

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Sergio Durigan Junior

11 years ago

Post by Karsten BrÃ¤ckelmann

Post by Sergio Durigan Junior
#> spamc -c < spam.file
0.0/5.0
#> spamc -L spam < spam.file
(successful message saying that the spam was learned)
#> spamc -c < spam.file
0.0/5.0

You mentioned that's a fresh install, actually not even in production
yet. The Bayes sub-system requires some training (minimum of 200 ham and
spam each) by default, before Bayes rules kick in for scanning.
Instead of -c check only, use the -R option to print the report. You'll
notice there is no BAYES_xx rule (yet).

...

Thanks. I had used -R before, without much success. But yeah, I found
some discussions on this list about Bayes databases, and people saying
that at least 200 messages are needed before Bayes can start doing its
job.

BTW, one spam has just sneaked in right now. On the one hand I'm sad
because of those false-negatives, but OTOH I'm happy because I'll be
able to train the database faster :-).

Post by Karsten BrÃ¤ckelmann

Post by Sergio Durigan Junior
I have already updated my Bayesian database, restarted the spamd

I'm curious -- what does updating your Bayes db mean?

Oh, I only meant that I ran "sa-learn" or "spamc -L". Sorry if that is
a wrong nomenclature.

Post by Karsten BrÃ¤ckelmann

Post by Sergio Durigan Junior
service, etc. I was expecting that I'd get a high rate after feeding
the spam to SpamAssassin, but that's not happening. Any suggestions?

The Bayesian classifier works on a per-token (think: word) basis. Thus,
depending on the tokens in the message and existing ones in the db, the
impact of learning can vary quite a lot -- from hardly noticeable to
clear detection.

All right. Since I don't have a good database yet (only 4 or 5 spams
learned), I won't worry about it for now. Let's see when I have a
bigger DB...

Thanks a lot,

--
Sergio

Karsten Bräckelmann

11 years ago

Post by Sergio Durigan Junior

Post by Karsten BrÃ¤ckelmann
You mentioned that's a fresh install, actually not even in production
yet. The Bayes sub-system requires some training (minimum of 200 ham and
spam each) by default, before Bayes rules kick in for scanning.
Instead of -c check only, use the -R option to print the report. You'll
notice there is no BAYES_xx rule (yet).

Thanks. I had used -R before, without much success. But yeah, I found
some discussions on this list about Bayes databases, and people saying
that at least 200 messages are needed before Bayes can start doing its
job.
BTW, one spam has just sneaked in right now. On the one hand I'm sad
because of those false-negatives, but OTOH I'm happy because I'll be
able to train the database faster :-).

...

You don't have any kind of archive of spam? If so, train on recent ones,
feel free to exceed the minimum limit, but don't bother too much with
old spam. It changes much faster over time than ham does.

Also, at least until you reached the minimum required training, do train
with identified spam, too. Same with ham. For now, keep training in a
ratio somewhere between 1:1 or spam to ham ratio.

Post by Sergio Durigan Junior

Post by Karsten BrÃ¤ckelmann

Post by Sergio Durigan Junior
service, etc. I was expecting that I'd get a high rate after feeding
the spam to SpamAssassin, but that's not happening. Any suggestions?

The Bayesian classifier works on a per-token (think: word) basis. Thus,
depending on the tokens in the message and existing ones in the db, the
impact of learning can vary quite a lot -- from hardly noticeable to
clear detection.

All right. Since I don't have a good database yet (only 4 or 5 spams
learned), I won't worry about it for now. Let's see when I have a
bigger DB...

...

Do train. Spam, as well as ham. If you got some recent-ish archives.

Post by Sergio Durigan Junior
Thanks a lot,

You're welcome. :)

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Sergio Durigan Junior

11 years ago

Post by Karsten BrÃ¤ckelmann
You don't have any kind of archive of spam? If so, train on recent ones,
feel free to exceed the minimum limit, but don't bother too much with
old spam. It changes much faster over time than ham does.
Also, at least until you reached the minimum required training, do train
with identified spam, too. Same with ham. For now, keep training in a
ratio somewhere between 1:1 or spam to ham ratio.

[Note: By ham I assume you mean false-positives, and not just regular
e-mail.]

No, (un)fortunately I don't. I've been running this server for 5 months
now, and only received about 10 spams so far. I decided to start
running SA now because I've received 5 spams in the last 3 days, which
triggered my internal alarm.

Post by Karsten BrÃ¤ckelmann
Do train. Spam, as well as ham. If you got some recent-ish archives.

Will do. However, I don't have false-positives (ham) to train. As I
said above, I only have about 10 spam messages, which I already used to
train Bayes. Not sure if it is possible/would be good to search for
recent spam archives on the net. I believe not...

--
Sergio

John Hardin

11 years ago

Post by Sergio Durigan Junior
[Note: By ham I assume you mean false-positives, and not just regular
e-mail.]

No, Train with correctly-classified ham as well.
--
John Hardin KA7OHZ http://www.impsec.org/~jhardin/
***@impsec.org FALaholic #11174 pgpk -a ***@impsec.org
key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
...to announce there must be no criticism of the President or to
stand by the President right or wrong is not only unpatriotic and
servile, but is morally treasonous to the American public.
-- Theodore Roosevelt, 1918
-----------------------------------------------------------------------
3 days until Veterans Day

David B Funk

11 years ago

...

For Bayes to work it needs at least 200 examples of Ham (e-mail that
you want) and 200 examples of Spam (e-mail that you don't want).
It doesn't matter if the messages were correctly or not correctly
classified by the rules-based SA engine, just what you consider
Ham/Spam (IE correctly classified by -you-).
In essence you are "teaching" the Bayes system how to recognize
your preferences in e-mail classifying.

So the messages you've kept in your INBOX should be good for Ham.

--
Dave Funk University of Iowa
<dbfunk (at) engineering.uiowa.edu> College of Engineering
319/335-5751 FAX: 319/384-0549 1256 Seamans Center
Sys_admin/Postmaster/cell_admin Iowa City, IA 52242-1527
#include <std_disclaimer.h>
Better is not better, 'standard' is better. B{

Karsten Bräckelmann

11 years ago

...

You're assuming wrong.

Ham is good mail, messages you want (or actually subscribed to),
messages sent to you with your consent. Spam is junk, unsolicited mail
sent to you without your consent. Regardless of SA classification or
score.

False positives and negatives are messages mis-classified by SA.

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Sergio Durigan Junior

11 years ago

...

Nice, thanks both of you for the answers.

I am now feeding SA with ham from my INBOX, while I also feed it with
false-negatives (interestingly, I am receiving now *much* more spam than
I was a week ago...).

So, I now have yet another question. I let auto_learn active for SA,
and now for every false-negative SA will learn that it is not spam,
although it is. I'm now thinking that maybe auto_learn is not a good
idea, at least until I have a good enough Bayes database (strangely, SA
did not catch *any* spam in the last 48 hours...). Can you confirm
this?

Thanks a lot, and sorry if I'm asking too much :-).

--
Sergio

Karsten Bräckelmann

11 years ago

Post by Sergio Durigan Junior
Nice, thanks both of you for the answers.
I am now feeding SA with ham from my INBOX, while I also feed it with
false-negatives (interestingly, I am receiving now *much* more spam than
I was a week ago...).

Given what you stated about your spam volume before, entirely possible.
However, you're not using catch-all, do you?

Post by Sergio Durigan Junior
So, I now have yet another question. I let auto_learn active for SA,
and now for every false-negative SA will learn that it is not spam,

No. False negative (not classified spam, although it is) is NOT what
triggers auto-learn ham.

Post by Sergio Durigan Junior
although it is. I'm now thinking that maybe auto_learn is not a good
idea, at least until I have a good enough Bayes database (strangely, SA
did not catch *any* spam in the last 48 hours...). Can you confirm
this?
Thanks a lot, and sorry if I'm asking too much :-).

Just leave auto-learn enabled. And, yet again, do train both ham and
spam (all, not only mis-classified messages) for initial training.

Auto-learning in SA Bayes is much more than a pure feedback loop, as you
described. A message just being classified ham (< 5.0) is NOT learned as
ham. Neither are messages scored spam (>= 5.0) learned as spam.

(1) The thresholds for auto-learning are 0.1 and 12.0 by default. Not
the required_score threshold of 5.0 default.
(2) Certain rules are not considered for auto-learning, to prevent self-
feeding.
(3) A minimum of header and body rules are required, to prevent biasing.

See M::SA::Plugin::AutoLearnThreshold docs for more details.

Part of the X-Spam-Status header way down the end tells you about SA
auto-learning or not. Hardly surprising, that's
autolearn=(ham|spam|no|unavailable)

In your case, I'd say just let SA do it's job. Monitor the results, and
train both ham and spam, at the very least until BAYES_xx rules show up
in X-Spam-Status headers.

Keep training Bayes after that, to improve performance. Definitely do
train on false positives and negatives.

Wait, observe, and learn how to read X-Spam headers. :)

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Sergio Durigan Junior

11 years ago

Post by Karsten BrÃ¤ckelmann

Post by Sergio Durigan Junior
Nice, thanks both of you for the answers.
I am now feeding SA with ham from my INBOX, while I also feed it with
false-negatives (interestingly, I am receiving now *much* more spam than
I was a week ago...).

Given what you stated about your spam volume before, entirely possible.
However, you're not using catch-all, do you?

No, I'm not.

Post by Karsten BrÃ¤ckelmann

Post by Sergio Durigan Junior
So, I now have yet another question. I let auto_learn active for SA,
and now for every false-negative SA will learn that it is not spam,

No. False negative (not classified spam, although it is) is NOT what
triggers auto-learn ham.

All right, I misunderstood things then. I assumed that because of
sa-learn --dump magic output:

...
0.000 0 37 0 non-token data: nham
...

And this number increases every time I receive a message (whether it is
a false-negative or a true-negative). Since I have too little spam to
train, it is hard to keep up with the number of ham received.

But I will read the docs and learn how this works.

Post by Karsten BrÃ¤ckelmann

Post by Sergio Durigan Junior
although it is. I'm now thinking that maybe auto_learn is not a good
idea, at least until I have a good enough Bayes database (strangely, SA
did not catch *any* spam in the last 48 hours...). Can you confirm
this?
Thanks a lot, and sorry if I'm asking too much :-).

Just leave auto-learn enabled. And, yet again, do train both ham and
spam (all, not only mis-classified messages) for initial training.

I am already doing that, thanks for the advice.

Post by Karsten BrÃ¤ckelmann
Auto-learning in SA Bayes is much more than a pure feedback loop, as you
described. A message just being classified ham (< 5.0) is NOT learned as
ham. Neither are messages scored spam (>= 5.0) learned as spam.
(1) The thresholds for auto-learning are 0.1 and 12.0 by default. Not
the required_score threshold of 5.0 default.
(2) Certain rules are not considered for auto-learning, to prevent self-
feeding.
(3) A minimum of header and body rules are required, to prevent biasing.
See M::SA::Plugin::AutoLearnThreshold docs for more details.
Part of the X-Spam-Status header way down the end tells you about SA
auto-learning or not. Hardly surprising, that's
autolearn=(ham|spam|no|unavailable)

...

Great, thanks a lot for the pointers and the explanation.

Post by Karsten BrÃ¤ckelmann
In your case, I'd say just let SA do it's job. Monitor the results, and
train both ham and spam, at the very least until BAYES_xx rules show up
in X-Spam-Status headers.
Keep training Bayes after that, to improve performance. Definitely do
train on false positives and negatives.
Wait, observe, and learn how to read X-Spam headers. :)

Nice, I will keep monitoring everything the way I'm doing. And I will
definitely read more about the headers and SA in general.

Thanks a lot for the replies and the patience. It's been very
educational :-).

--
Sergio

Karsten Bräckelmann

11 years ago

...

nham is the "Number of HAM" learned, in messages. Same for nspam. Keep
training until both are at least 200 -- accuracy should improve
dramatically after that.

Keep an eye on the X-Spam-Status header, autolearn bit.

If that happens frequently for FNs, there's a problem somewhere. We'd
need the X-Spam headers and preferably the full, raw message put up a
pastebin for debugging. After some initial training.

There's one thing worrying in your comment: "whether false-negative or
true-negative". You DO have spam also, right? I mean, classified spam is
not just silently discarded without you ever seeing it? That would be
really bad at this stage. Take it, verify it, learn it.

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Sergio Durigan Junior

11 years ago

Post by Karsten BrÃ¤ckelmann
nham is the "Number of HAM" learned, in messages. Same for nspam. Keep
training until both are at least 200 -- accuracy should improve
dramatically after that.

I figured that out.

Post by Karsten BrÃ¤ckelmann
Keep an eye on the X-Spam-Status header, autolearn bit.
If that happens frequently for FNs, there's a problem somewhere. We'd
need the X-Spam headers and preferably the full, raw message put up a
pastebin for debugging. After some initial training.

For all messages that I received since I started using SA (about 20
messages, of which 5 were false-negatives, and the rest were
true-negatives), autolearn seems to be working OK, i.e., when messages
score below the threshold, autolearn works, and when messages score
above the threshold, I see "autolearn=no".

Post by Karsten BrÃ¤ckelmann
There's one thing worrying in your comment: "whether false-negative or
true-negative". You DO have spam also, right? I mean, classified spam is
not just silently discarded without you ever seeing it? That would be
really bad at this stage. Take it, verify it, learn it.

I do receive spam. About 1 or 2 per day. But so far SA hasn't been
able to catch any of them, and all spam I receive has been marked as ham
so far. The message headers are OK, there is nothing apparently wrong
with SA, but it is just not catching most of my spam. I assume this is
normal behavior since I just started using SA a few days ago.

For every spam message that I received, I analyze its headers, verify
that everything is OK with SA, and then feed it to sa-learn.

--
Sergio

Karsten Bräckelmann

11 years ago

Post by Sergio Durigan Junior
For all messages that I received since I started using SA (about 20
messages, of which 5 were false-negatives, and the rest were
true-negatives), [...]

Given you state below no spam has been identified yet, you're confusing
terms.

SA tests for spam. Thus a positive result is "classified spam", and "not
spam" is a negative test result. True means the result is correct,
whereas false indicates a mis-classification by the test.

False (mis-classified) negatives (rated not-spam) are spam, which SA
failed to classify spam.

If you prefer, refer to them as missed spam, or (in)correctly classified
ham and spam.

Post by Sergio Durigan Junior
I do receive spam. About 1 or 2 per day. But so far SA hasn't been
able to catch any of them, and all spam I receive has been marked as ham
so far. The message headers are OK, there is nothing apparently wrong
with SA, but it is just not catching most of my spam. I assume this is
normal behavior since I just started using SA a few days ago.

No, that is not normal. In fact, since no spam has been identified at
all yet, there is something really broken or mis-configured.

I suggest to start a new thread (no reply) about this. For starters,
we'd need details about your environment and how you set up SA. Plus
some X-Spam-Status headers of ham and (missed) spam.

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Sergio Durigan Junior

11 years ago

Post by Karsten BrÃ¤ckelmann

Post by Sergio Durigan Junior
For all messages that I received since I started using SA (about 20
messages, of which 5 were false-negatives, and the rest were
true-negatives), [...]

Given you state below no spam has been identified yet, you're confusing
terms.
SA tests for spam. Thus a positive result is "classified spam", and "not
spam" is a negative test result. True means the result is correct,
whereas false indicates a mis-classification by the test.
False (mis-classified) negatives (rated not-spam) are spam, which SA
failed to classify spam.

...

I don't think I am confusing terms.

false-negative: spam that got classified as ham
false-positive: ham that got classified as spam
true-negative: ham
true-positive: spam

Maybe my terms aren't the correct ones, and if that's the case, sorry
about it.

Post by Karsten BrÃ¤ckelmann
If you prefer, refer to them as missed spam, or (in)correctly classified
ham and spam.

OK, I will make use of those terms if it makes things clearer for you.

Post by Karsten BrÃ¤ckelmann

Post by Sergio Durigan Junior
I do receive spam. About 1 or 2 per day. But so far SA hasn't been
able to catch any of them, and all spam I receive has been marked as ham
so far. The message headers are OK, there is nothing apparently wrong
with SA, but it is just not catching most of my spam. I assume this is
normal behavior since I just started using SA a few days ago.

No, that is not normal. In fact, since no spam has been identified at
all yet, there is something really broken or mis-configured.

...

Indeed, no spam has been classified at all since I started running SA.

An interesting fact is that, before I started using SA, I had some spams
left in my INBOX. Well, when I decided that it was time to use SA, I
manually fed those spams to spamc (for testing purposes), and SA
correctly identified almost all of them! But now, as I said, SA is
failing to classify the spam I've been receiving.

Post by Karsten BrÃ¤ckelmann
I suggest to start a new thread (no reply) about this. For starters,
we'd need details about your environment and how you set up SA. Plus
some X-Spam-Status headers of ham and (missed) spam.

OK, fair enough. Unfortunately, I don't have any spam messages left. I
used them all to feed sa-learn, and then deleted them. But as soon as I
get another misclassified spam, I will start another thread on this
topic, with all the information requested (BTW, I am using a default
Debian SA configuration, and did not modify anything so far).

Thanks,

--
Sergio

Karsten Bräckelmann

11 years ago

Post by Sergio Durigan Junior

Post by Karsten BrÃ¤ckelmann
Given you state below no spam has been identified yet, you're confusing
terms.

Gnah. I was falsely thinking "received" when I wrote "identified" there.

Post by Sergio Durigan Junior
I don't think I am confusing terms.

True, my bad, sorry.

Post by Sergio Durigan Junior

Post by Karsten BrÃ¤ckelmann
If you prefer, refer to them as missed spam, or (in)correctly classified
ham and spam.

OK, I will make use of those terms if it makes things clearer for you.

That however should not be possible. I guess I am entirely capable of
handling the terms FP and FN... ;)

Post by Sergio Durigan Junior
Indeed, no spam has been classified at all since I started running SA.
An interesting fact is that, before I started using SA, I had some spams
left in my INBOX. Well, when I decided that it was time to use SA, I
manually fed those spams to spamc (for testing purposes), and SA
correctly identified almost all of them! But now, as I said, SA is
failing to classify the spam I've been receiving.

'sa-learn --dump magic' still shows less than 200 nham / nspam, right?

Post by Sergio Durigan Junior

Post by Karsten BrÃ¤ckelmann
I suggest to start a new thread (no reply) about this. For starters,
we'd need details about your environment and how you set up SA. Plus
some X-Spam-Status headers of ham and (missed) spam.

OK, fair enough. Unfortunately, I don't have any spam messages left. I
used them all to feed sa-learn, and then deleted them. But as soon as I
get another misclassified spam, I will start another thread on this
topic, with all the information requested

Until that issue is resolved, please keep the spam for potential further
post-receiving tests.

Post by Sergio Durigan Junior
(BTW, I am using a default Debian SA configuration, and did not modify
anything so far).

Not strictly SA configuration, but you probably want to change the
following Debian defaults in /etc/default/spamassassin

ENABLED=0
CRON=0

and enable the spamd daemon system-wide, as well as sa-update.

If you didn't yet run sa-update, do so now. Restart spamd afterward.
FWIW, this counts as "modifying SA config", since it updates the stock
rule-set.

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Sergio Durigan Junior

11 years ago

Post by Karsten BrÃ¤ckelmann
'sa-learn --dump magic' still shows less than 200 nham / nspam, right?

Yes, it does.

Post by Karsten BrÃ¤ckelmann
Until that issue is resolved, please keep the spam for potential further
post-receiving tests.

Will certainly do.

Post by Karsten BrÃ¤ckelmann
Not strictly SA configuration, but you probably want to change the
following Debian defaults in /etc/default/spamassassin
ENABLED=0
CRON=0
and enable the spamd daemon system-wide, as well as sa-update.
If you didn't yet run sa-update, do so now. Restart spamd afterward.
FWIW, this counts as "modifying SA config", since it updates the stock
rule-set.

Oh, I did that, yeah. I meant to say that I did not touch in any file
under /etc/spamassassin. So my /etc/spamassassin/local.cf, for example,
is exactly what is shipped with Debian.

Thanks,

--
Sergio

30 Replies
46 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Sergio Durigan Junior 11 years ago

John Hardin 11 years ago

Sergio Durigan Junior 11 years ago

John Hardin 11 years ago

Sergio Durigan Junior 11 years ago

Amir 'CG' Caspi 11 years ago

Sergio Durigan Junior 11 years ago

Amir 'CG' Caspi 11 years ago

Sergio Durigan Junior 11 years ago

Karsten Bräckelmann 11 years ago

Karsten Bräckelmann 11 years ago

Amir 'CG' Caspi 11 years ago

John Hardin 11 years ago

Sergio Durigan Junior 11 years ago

Karsten Bräckelmann 11 years ago

Karsten Bräckelmann 11 years ago

Sergio Durigan Junior 11 years ago

Karsten Bräckelmann 11 years ago

Sergio Durigan Junior 11 years ago

John Hardin 11 years ago

David B Funk 11 years ago

Karsten Bräckelmann 11 years ago

Sergio Durigan Junior 11 years ago

Karsten Bräckelmann 11 years ago

Sergio Durigan Junior 11 years ago

Karsten Bräckelmann 11 years ago

Sergio Durigan Junior 11 years ago

Karsten Bräckelmann 11 years ago

Sergio Durigan Junior 11 years ago

Karsten Bräckelmann 11 years ago

Sergio Durigan Junior 11 years ago

about - legalese

Loading...