Jeff Chan
2004-09-05 10:22:48 UTC
Eric Kolve and I were looking at how to best set the default SpamCopURI
scores for the various SURBL lists and at first we tried looking at the
and were unable to come up with a good answer.
Theo suggested looking at Spam versus ham rates as a good way to
Also while the PH spam hit rate [from Justin's stats] is low,
the data is of hand checked phishing scams, which deserve to be
blocked due to their potential danger and damage.
Therefore I would tend to give PH a medium-high score like
3 to 5.
So we'll probably adjust the default scores on SpamCopURI
to something like:
WS 1.3
SC 4.0
AB 3.0
OB 2.2
PH 4.5
and we recommend SpamCopURI users do likewise. Please be
sure to use the latest version of SpamCopURI with
multi.surbl.org:
http://sourceforge.net/projects/spamcopuri/
http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/
One thing stood out for me is that the FP rate (ham%) for
ws.surbl.org is way too high at about 0.45 to 0.5% across
multiple corpora. That FP rate needs to be reduced for WS
to be more fully useful.
I think Chris or maybe Raymond suggested that they had a way to
reduce FPs in WS further. If so, ***please*** try to apply it.
We need to get the FPs to be much less than 0.5%. The other
lists have FP rates 5 to 50 times lower.
Basically the higher the FP rate, the less useful a list is.
Does anyone have other corpus stats to share, in particular
FP rates?
Jeff C.
scores for the various SURBL lists and at first we tried looking at the
http://spamassassin.apache.org/full/3.0.x/dist/rules/50_scores.cf
# The following block of scores were generated using the mass-checking
# scripts, and a perceptron to determine the optimum scores which
# resulted in minimum false positives or negatives. The scores are
# weighted to produce roughly 1 false positive in 2500 non-spam messages
# using the default threshold of 5.0.
score URIBL_AB_SURBL 0 2.007 0 0.417
score URIBL_OB_SURBL 0 1.996 0 3.213
score URIBL_PH_SURBL 0 0.839 0 2.000
score URIBL_SC_SURBL 0 3.897 0 4.263
score URIBL_WS_SURBL 0 0.539 0 1.462
I was trying to figure out what the different score columns meant,# The following block of scores were generated using the mass-checking
# scripts, and a perceptron to determine the optimum scores which
# resulted in minimum false positives or negatives. The scores are
# weighted to produce roughly 1 false positive in 2500 non-spam messages
# using the default threshold of 5.0.
score URIBL_AB_SURBL 0 2.007 0 0.417
score URIBL_OB_SURBL 0 1.996 0 3.213
score URIBL_PH_SURBL 0 0.839 0 2.000
score URIBL_SC_SURBL 0 3.897 0 4.263
score URIBL_WS_SURBL 0 0.539 0 1.462
$ perldoc Mail::SpamAssassin::Conf
[...]
If four valid scores are listed, then the score that is used
depends on how SpamAssassin is being used. The first score is used
when both Bayes and network tests are disabled (score set 0). The
second score is used when Bayes is disabled, but network tests are
enabled (score set 1). The third score is used when Bayes is
enabled and network tests are disabled (score set 2). The fourth
score is used when Bayes is enabled and network tests are enabled
(score set 3).
We wondered if we could somehow use those scores with SpamCopURI[...]
If four valid scores are listed, then the score that is used
depends on how SpamAssassin is being used. The first score is used
when both Bayes and network tests are disabled (score set 0). The
second score is used when Bayes is disabled, but network tests are
enabled (score set 1). The third score is used when Bayes is
enabled and network tests are disabled (score set 2). The fourth
score is used when Bayes is enabled and network tests are enabled
(score set 3).
and were unable to come up with a good answer.
Theo suggested looking at Spam versus ham rates as a good way to
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
121405 22516 98889 0.185 0.00 0.00 (all messages)
100.000 18.5462 81.4538 0.185 0.00 0.00 (all messages as %)
13.453 70.3766 0.4925 0.993 1.00 1.00 SURBL_WS
3.807 20.3811 0.0334 0.998 0.50 1.00 SURBL_SC
2.650 14.2565 0.0071 1.000 0.50 1.00 SURBL_AB
0.019 0.0933 0.0020 0.979 0.50 1.00 SURBL_PH
12.624 67.6275 0.1001 0.999 0.50 1.00 SURBL_OB
which shows a pretty high FP rate for WS, less for the others.
Do you happen to have access to any more recent corpus check data
like this? Could be useful to have another snapshot for a more
complete picture.
[Theo's wild guess scores for Justin's June data: -- Jeff C.]121405 22516 98889 0.185 0.00 0.00 (all messages)
100.000 18.5462 81.4538 0.185 0.00 0.00 (all messages as %)
13.453 70.3766 0.4925 0.993 1.00 1.00 SURBL_WS
3.807 20.3811 0.0334 0.998 0.50 1.00 SURBL_SC
2.650 14.2565 0.0071 1.000 0.50 1.00 SURBL_AB
0.019 0.0933 0.0020 0.979 0.50 1.00 SURBL_PH
12.624 67.6275 0.1001 0.999 0.50 1.00 SURBL_OB
which shows a pretty high FP rate for WS, less for the others.
Do you happen to have access to any more recent corpus check data
like this? Could be useful to have another snapshot for a more
complete picture.
high spam + low ham is good from an FP standpoint, but having a "significant"
(for your definition thereof) ham hitrate means the score shouldn't be too
(for your definition thereof) ham hitrate means the score shouldn't be too
WS 1.2
SC 2.5
AB 3.5
OB 1.8
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
416072 365031 51041 0.877 0.00 0.00 (all messages)
100.000 87.7327 12.2673 0.877 0.00 0.00 (all messages as %)
set1 30.923 35.2466 0.0000 1.000 0.99 0.00 URIBL_SC_SURBL
set1 72.231 82.3273 0.0274 1.000 0.98 1.00 URIBL_OB_SURBL
set1 19.375 22.0847 0.0000 1.000 0.98 1.00 URIBL_AB_SURBL
set1 74.883 85.2939 0.4310 0.995 0.74 0.00 URIBL_WS_SURBL
set1 0.001 0.0000 0.0059 0.000 0.48 0.00 URIBL_PH_SURBL
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
119215 67094 52121 0.563 0.00 0.00 (all messages)
100.000 56.2798 43.7202 0.563 0.00 0.00 (all messages as %)
set3 39.217 69.6605 0.0288 1.000 0.98 1.00 URIBL_OB_SURBL
set3 10.340 18.3727 0.0000 1.000 0.97 0.00 URIBL_SC_SURBL
set3 5.998 10.6582 0.0000 1.000 0.94 1.00 URIBL_AB_SURBL
set3 42.730 75.5522 0.4797 0.994 0.73 0.00 URIBL_WS_SURBL
set3 0.008 0.0089 0.0058 0.608 0.49 0.00 URIBL_PH_SURBL
WS 1.3
SC 4.0
AB 3.0
OB 2.2
since the hit rates and S/O are a bit higher for me, related to the fact I ran
more spam through than Justin did.
Those final scores look like an excellent fit to the data to me.SC 2.5
AB 3.5
OB 1.8
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
416072 365031 51041 0.877 0.00 0.00 (all messages)
100.000 87.7327 12.2673 0.877 0.00 0.00 (all messages as %)
set1 30.923 35.2466 0.0000 1.000 0.99 0.00 URIBL_SC_SURBL
set1 72.231 82.3273 0.0274 1.000 0.98 1.00 URIBL_OB_SURBL
set1 19.375 22.0847 0.0000 1.000 0.98 1.00 URIBL_AB_SURBL
set1 74.883 85.2939 0.4310 0.995 0.74 0.00 URIBL_WS_SURBL
set1 0.001 0.0000 0.0059 0.000 0.48 0.00 URIBL_PH_SURBL
OVERALL% SPAM% HAM% S/O RANK SCORE NAME
119215 67094 52121 0.563 0.00 0.00 (all messages)
100.000 56.2798 43.7202 0.563 0.00 0.00 (all messages as %)
set3 39.217 69.6605 0.0288 1.000 0.98 1.00 URIBL_OB_SURBL
set3 10.340 18.3727 0.0000 1.000 0.97 0.00 URIBL_SC_SURBL
set3 5.998 10.6582 0.0000 1.000 0.94 1.00 URIBL_AB_SURBL
set3 42.730 75.5522 0.4797 0.994 0.73 0.00 URIBL_WS_SURBL
set3 0.008 0.0089 0.0058 0.608 0.49 0.00 URIBL_PH_SURBL
WS 1.3
SC 4.0
AB 3.0
OB 2.2
since the hit rates and S/O are a bit higher for me, related to the fact I ran
more spam through than Justin did.
Also while the PH spam hit rate [from Justin's stats] is low,
the data is of hand checked phishing scams, which deserve to be
blocked due to their potential danger and damage.
Therefore I would tend to give PH a medium-high score like
3 to 5.
to something like:
WS 1.3
SC 4.0
AB 3.0
OB 2.2
PH 4.5
and we recommend SpamCopURI users do likewise. Please be
sure to use the latest version of SpamCopURI with
multi.surbl.org:
http://sourceforge.net/projects/spamcopuri/
http://search.cpan.org/dist/Mail-SpamAssassin-SpamCopURI/
One thing stood out for me is that the FP rate (ham%) for
ws.surbl.org is way too high at about 0.45 to 0.5% across
multiple corpora. That FP rate needs to be reduced for WS
to be more fully useful.
I think Chris or maybe Raymond suggested that they had a way to
reduce FPs in WS further. If so, ***please*** try to apply it.
We need to get the FPs to be much less than 0.5%. The other
lists have FP rates 5 to 50 times lower.
Basically the higher the FP rate, the less useful a list is.
Does anyone have other corpus stats to share, in particular
FP rates?
Jeff C.
--
Jeff Chan
mailto:***@surbl.org
http://www.surbl.org/
Jeff Chan
mailto:***@surbl.org
http://www.surbl.org/