Setting SpamAssassin scores for SURBL lists

Discussion:

Jeff Chan

2004-09-05 10:22:48 UTC

Eric Kolve and I were looking at how to best set the default SpamCopURI
scores for the various SURBL lists and at first we tried looking at the

http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf
# The following block of scores were generated using the mass-checking
# scripts, and a perceptron to determine the optimum scores which
# resulted in minimum false positives or negatives. The scores are
# weighted to produce roughly 1 false positive in 2500 non-spam messages
# using the default threshold of 5.0.
score URIBL_AB_SURBL 0 2.007 0 0.417
score URIBL_OB_SURBL 0 1.996 0 3.213
score URIBL_PH_SURBL 0 0.839 0 2.000
score URIBL_SC_SURBL 0 3.897 0 4.263
score URIBL_WS_SURBL 0 0.539 0 1.462

I was trying to figure out what the different score columns meant,

$ perldoc Mail::SpamAssassin::Conf
[...]
If four valid scores are listed, then the score that is used
depends on how SpamAssassin is being used. The first score is used
when both Bayes and network tests are disabled (score set 0). The
second score is used when Bayes is disabled, but network tests are
enabled (score set 1). The third score is used when Bayes is
enabled and network tests are disabled (score set 2). The fourth
score is used when Bayes is enabled and network tests are enabled
(score set 3).

We wondered if we could somehow use those scores with SpamCopURI
and were unable to come up with a good answer.

Theo suggested looking at Spam versus ham rates as a good way to

OVERALL% SPAM% HAM% S/O RANK SCORE NAME
121405 22516 98889 0.185 0.00 0.00 (all messages)
100.000 18.5462 81.4538 0.185 0.00 0.00 (all messages as %)
13.453 70.3766 0.4925 0.993 1.00 1.00 SURBL_WS
3.807 20.3811 0.0334 0.998 0.50 1.00 SURBL_SC
2.650 14.2565 0.0071 1.000 0.50 1.00 SURBL_AB
0.019 0.0933 0.0020 0.979 0.50 1.00 SURBL_PH
12.624 67.6275 0.1001 0.999 0.50 1.00 SURBL_OB
which shows a pretty high FP rate for WS, less for the others.
Do you happen to have access to any more recent corpus check data
like this? Could be useful to have another snapshot for a more
complete picture.

high spam + low ham is good from an FP standpoint, but having a "significant"
(for your definition thereof) ham hitrate means the score shouldn't be too

[Theo's wild guess scores for Justin's June data: -- Jeff C.]

WS 1.2
SC 2.5
AB 3.5
OB 1.8
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
416072 365031 51041 0.877 0.00 0.00 (all messages)
100.000 87.7327 12.2673 0.877 0.00 0.00 (all messages as %)
set1 30.923 35.2466 0.0000 1.000 0.99 0.00 URIBL_SC_SURBL
set1 72.231 82.3273 0.0274 1.000 0.98 1.00 URIBL_OB_SURBL
set1 19.375 22.0847 0.0000 1.000 0.98 1.00 URIBL_AB_SURBL
set1 74.883 85.2939 0.4310 0.995 0.74 0.00 URIBL_WS_SURBL
set1 0.001 0.0000 0.0059 0.000 0.48 0.00 URIBL_PH_SURBL
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
119215 67094 52121 0.563 0.00 0.00 (all messages)
100.000 56.2798 43.7202 0.563 0.00 0.00 (all messages as %)
set3 39.217 69.6605 0.0288 1.000 0.98 1.00 URIBL_OB_SURBL
set3 10.340 18.3727 0.0000 1.000 0.97 0.00 URIBL_SC_SURBL
set3 5.998 10.6582 0.0000 1.000 0.94 1.00 URIBL_AB_SURBL
set3 42.730 75.5522 0.4797 0.994 0.73 0.00 URIBL_WS_SURBL
set3 0.008 0.0089 0.0058 0.608 0.49 0.00 URIBL_PH_SURBL
WS 1.3
SC 4.0
AB 3.0
OB 2.2
since the hit rates and S/O are a bit higher for me, related to the fact I ran
more spam through than Justin did.

Those final scores look like an excellent fit to the data to me.
Also while the PH spam hit rate [from Justin's stats] is low,
the data is of hand checked phishing scams, which deserve to be
blocked due to their potential danger and damage.
Therefore I would tend to give PH a medium-high score like
3 to 5.

So we'll probably adjust the default scores on SpamCopURI
to something like:

WS 1.3
SC 4.0
AB 3.0
OB 2.2
PH 4.5

and we recommend SpamCopURI users do likewise. Please be
sure to use the latest version of SpamCopURI with
multi.surbl.org:

http://sourceforge.net/projects/spamcopuri/
http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/

One thing stood out for me is that the FP rate (ham%) for
ws.surbl.org is way too high at about 0.45 to 0.5% across
multiple corpora. That FP rate needs to be reduced for WS
to be more fully useful.

I think Chris or maybe Raymond suggested that they had a way to
reduce FPs in WS further. If so, ***please*** try to apply it.
We need to get the FPs to be much less than 0.5%. The other
lists have FP rates 5 to 50 times lower.

Basically the higher the FP rate, the less useful a list is.

Does anyone have other corpus stats to share, in particular
FP rates?

Jeff C.

--
Jeff Chan
mailto:***@surbl.org
http://www.surbl.org/

Jeff Chan

2004-09-05 10:41:57 UTC

Permalink

Seeing those data it would be very interesting if we could test a seperate
list. Is that possible? I would like to test the Prolo and Joe's list
combined, without the rest of the WS list. I can generate the data for a
test like that. I have seen allmost zero FP's in the data i compose, so
perhaps its better to seperate the lists. I think people would benefit
from a less FP stuffed list. The current WS list is just compiled out of
too many datasources i think.

If you can make the different lists available to me by rsync,
I can easily set up some temporary local SURBLs for testing
them. Thank you rbldnsd! :-)

Unfortunately I don't have my own test corpora, so I need to
rely on the generosity of others who do. So I'd probably
need to ask Theo, Daniel, Justin or others with corpora to
test against them.

Jeff C.

Ryan Thompson

2004-09-05 17:32:57 UTC

Permalink

Post by Jeff Chan
Basically the higher the FP rate, the less useful a list is.

... or, rather, the lower it ought to be scored.

Post by Jeff Chan
Does anyone have other corpus stats to share, in particular
FP rates?

Sure. All of these messages were received in the past 10 days. A lot has
happened since June. :-)

WS: 44004/54185s, 61/19150s

OVERALL% SPAM% HAM% S/O RANK SCORE NAME
73335 54185 19150 0.739 0.00 0.00 (all messages)
100.000 73.8870 26.1130 0.739 0.00 0.00 (all messages as %)
60.087 81.2107 0.0836 0.999 0.00 0.00 WS_SURBL

HOWEVER... I decided to go through the ham hits (61 of them), and look
for false positive domains to submit. I found several, but, for the most
part, they've *already* been cleaned up and are no longer listed in WS.
(30 out of the 61 were in a massive mailing list thread for a single
domain that has since been whitelisted).

And, in that 19K ham corpus, I found the following FPs still listed
in WS:

buckeye-express.com -- Used in a personal email address, looks legit;
7 examples
nm.ru -- Used in a personal email address, looks legit
advanstar.com -- Legit uses; found in a well-known dental
newsletter; also personal email address of
one of the editors; 3 messages
00fun.com -- Confirmed, more than one user on our system
sent or received eCards from them
northstarconferences.com Legit conference host site subscribed to
by two users; 9 messages in this corpus
mardox.com -- Search engine; registered 1875 days ago, and
*looks* like the user did actually submit
their site to them.
postsnet.com -- Registered exactly one year ago, 51 NANAS,
blank home page, ehh... but I have 4
different legit newsletters with links to
them.
webspawner.com -- Created in 1996; free host/email
npdor.com -- Surveys; been around since 1999. 103 NANAS,
but they've been advertised by some reputable
"word of the day" mailers (dictionary.com)
Maybe a good candidate for UC. :-) 2
examples
imninc.com -- Domain is 507 days old; they do newsletters.
At least one of them is legit. :-)
worldhealth.net -- It's 3468 days old today (1995). One of our
users attended a conference of theirs, and
signed up for a newsletter.
hoteldiscounts.com -- 2459 days old (1997), found in actual room
booking confirmations for Comfort Inn.

(I'll re-post these in another thread, just so everybody sees them).

AND, I found 2 spams that were incorrectly hand-classified as ham.

So, if I take those out, the numbers look more like:

WS: 44006/54187s, 0/19148s

OVERALL% SPAM% HAM% S/O RANK SCORE NAME
73335 54187 19148 0.739 0.00 0.00 (all messages)
100.007 73.8897 26.1103 0.739 0.00 0.00
60.087 81.2111 0.0000 1.000 0.00 0.00 WS_SURBL

Is that more like what you had in mind..? No, I'm not making that up.
:-)

Anyone with ham corpora, just search for WS_SURBL hits and give 'em a
hand-check.

- Ryan

--
Ryan Thompson <***@sasknow.com>

SaskNow Technologies - http://www.sasknow.com
901-1st Avenue North - Saskatoon, SK - S7K 1Y4

Tel: 306-664-3600 Fax: 306-244-7037 Saskatoon
Toll-Free: 877-727-5669 (877-SASKNOW) North America

Jeff Chan

2004-09-05 20:50:19 UTC

Permalink

Post by Ryan Thompson

Post by Jeff Chan
Basically the higher the FP rate, the less useful a list is.

... or, rather, the lower it ought to be scored.

Yes, but please remember that not everyone has the ability to
"score" their SURBL hits. Not everyone using SURBLs is using
SpamAssassin.

Post by Ryan Thompson

Post by Jeff Chan
Does anyone have other corpus stats to share, in particular
FP rates?

Thanks for sharing your data. I know this can be a somewhat
painful subject for people, but it's very important to clean
up the false positives and make the lists better and more useful.

Post by Ryan Thompson
Sure. All of these messages were received in the past 10 days. A lot has
happened since June. :-)
WS: 44004/54185s, 61/19150s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
73335 54185 19150 0.739 0.00 0.00 (all messages)
100.000 73.8870 26.1130 0.739 0.00 0.00 (all messages as %)
60.087 81.2107 0.0836 0.999 0.00 0.00 WS_SURBL
HOWEVER... I decided to go through the ham hits (61 of them), and look
for false positive domains to submit.

That kind of checking should become a policy. For people who can
do that kind of checking, they should do it every time. Every
tool we have for reducing FPs should be used.

Letting FPs in just hurts the usefulness of the lists.

Post by Ryan Thompson
I found several, but, for the most
part, they've *already* been cleaned up and are no longer listed in WS.
(30 out of the 61 were in a massive mailing list thread for a single
domain that has since been whitelisted).
And, in that 19K ham corpus, I found the following FPs still listed
buckeye-express.com -- Used in a personal email address, looks legit;
7 examples
nm.ru -- Used in a personal email address, looks legit
advanstar.com -- Legit uses; found in a well-known dental
newsletter; also personal email address of
one of the editors; 3 messages
00fun.com -- Confirmed, more than one user on our system
sent or received eCards from them
northstarconferences.com Legit conference host site subscribed to
by two users; 9 messages in this corpus
mardox.com -- Search engine; registered 1875 days ago, and
*looks* like the user did actually submit
their site to them.
postsnet.com -- Registered exactly one year ago, 51 NANAS,
blank home page, ehh... but I have 4
different legit newsletters with links to
them.
webspawner.com -- Created in 1996; free host/email
npdor.com -- Surveys; been around since 1999. 103 NANAS,
but they've been advertised by some reputable
"word of the day" mailers (dictionary.com)
Maybe a good candidate for UC. :-) 2
examples
imninc.com -- Domain is 507 days old; they do newsletters.
At least one of them is legit. :-)
worldhealth.net -- It's 3468 days old today (1995). One of our
users attended a conference of theirs, and
signed up for a newsletter.
hoteldiscounts.com -- 2459 days old (1997), found in actual room
booking confirmations for Comfort Inn.

Thanks. I agree those look like false positives and have
whitelisted all of them across SURBLs. Signing up for a
newsletter then forgetting about does not make a message
spam.

Instead of having these go into SURBLs, they should be checked
**before** they get added. Hopefully they would be detected
then and not get added to begin with. Wouldn't that be better?

Should hand-checking catch these as mostly legitimate?

Are we hand-checking? If not we should!

Post by Ryan Thompson
(I'll re-post these in another thread, just so everybody sees them).
AND, I found 2 spams that were incorrectly hand-classified as ham.
WS: 44006/54187s, 0/19148s
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
73335 54187 19148 0.739 0.00 0.00 (all messages)
100.007 73.8897 26.1103 0.739 0.00 0.00
60.087 81.2111 0.0000 1.000 0.00 0.00 WS_SURBL
Is that more like what you had in mind..? No, I'm not making that up.
:-)

Looks good, but this corpus is perhaps too small to make
representative measurements for emails in general. That
said, any reduction in FPs is important and welcome.

Post by Ryan Thompson
Anyone with ham corpora, just search for WS_SURBL hits and give 'em a
hand-check.
- Ryan

Thanks for your stats and checking, and yes please anyone else
with ham corpora, please check for FPs.

Jeff C.

Chris Santerre

2004-09-07 13:42:05 UTC

Permalink

*snip*

Post by Jeff Chan
Thanks for your stats and checking, and yes please anyone else
with ham corpora, please check for FPs.

There is one SARE ninja testing guru that will come online with SURBL when
3.0 is released. I expect a LOT of testing, because he is addicted to it :)
He's actually been trying to work with another ninja to test for SURBL FPs
already. But the initial tests failed :(

So we should have a really great tester online soon.

With human errors, I think an SO ratio of 99% is pretty cool. OF course we
won't even have that unless we strive for zero FPs. Its a goal. I hand check
pretty much everything, but sometimes one slips by :)

--Chris

Jeff Chan

2004-09-07 14:17:56 UTC

Permalink

Post by Chris Santerre
There is one SARE ninja testing guru that will come online with SURBL when
3.0 is released. I expect a LOT of testing, because he is addicted to it :)
He's actually been trying to work with another ninja to test for SURBL FPs
already. But the initial tests failed :(
So we should have a really great tester online soon.

Excellent news!

Post by Chris Santerre
With human errors, I think an SO ratio of 99% is pretty cool. OF course we
won't even have that unless we strive for zero FPs. Its a goal. I hand check
pretty much everything, but sometimes one slips by :)
--Chris

It's good have have more testing to try to close the loop
by checking things. Things are still a little too open
ended probably. Additional feedback and testing should
be able to improve our accuracy.

Jeff C.