Discussion:
German spam corpus / foreign language spam
Daniel Roethlisberger
2002-08-21 12:05:29 UTC
Permalink
I've been lurking the SA lists since I installed SA on a production
machine a while back. While SA did a surprisingly accurate job on
detecting English language spam, it did not succeed very well on German
language spam, which I keep getting increasing amounts of lately. I've
got a lousy results with out of the box scores, very few spam is acually
cought.

What is the strategy with respect to foreign language spam recognition
in SA? I've seen extremely few non-english rules. Is there foreign
language rule development going on? Has anybody done work on German
spam?

In any case, I've started spam/nonspam corpi consisting of only German
(and Swiss-German, respectively) messages, to be able to help with
German rules. Anybody willing to contribute to the corpus feel free to
resend/bounce German spam in a sane way to ***@roe.ch . I cannot be
bothered to subscribe to SAsightings just for the odd German spam every
hundred++ messages.. how about a list for foreign language spam
sightings?

Has anybody done this before or am I on the edge of duplicating effort
here?

I've been thinking on this a bit. I think it would be best if there
would be general provisions for foreign language rules. In the spirit of
the ok_languages option; let users easily enable or disable rules in
certain languages. Like a foreign_rules option which could be used to
control which foreign rulesets are active. Usually people would want to
use checks in all languages which are in the ok_languages list.

Is there any development or are there plans along those lines? Are there
other people willing to contribute to effective spam filtering rules in
German language?

Any kind of feedback is welcome, even flames ;)

Cheers,
Dan
--
Daniel Roethlisberger <***@roe.ch>
OpenPGP key id 0x804A06B1 (1024/4096 DSA/ElGamal)
144D 6A5E 0C88 E5D7 0775 FCFD 3974 0E98 804A 06B1
privacy through technology, not legislation <<
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
Justin Mason
2002-08-21 12:22:10 UTC
Permalink
Post by Daniel Roethlisberger
What is the strategy with respect to foreign language spam recognition
in SA? I've seen extremely few non-english rules. Is there foreign
language rule development going on? Has anybody done work on German
spam?
There's a few poeple working on it -- we'd be happy to give you whatever
help you need, and CVS commit access, as long as you know what you're
doing when writing the rules ;)

The main problem restricting the current number of non-english rules, is
that the core developers use English as their primary language; I think
that's the main issue.

Some guidelines:

- when writing rules, don't forget: .* in regexps Is Bad. It'll kill
SpamAssassin performance, and can cause hangs. Use ".{0,20}" or
similar instead.

- be conservative about body tests; stuff that can come up in opt-in
commercial mails can cause trouble and FPs, and you really need to
test them carefully. But this is really where keeping a good corpus
according to masses/CORPUS_POLICY helps.

- body tests should be designed based on frequency analysis, rather than
thinking, "the phrase 'call now' sounds a bit spammish, that'd make a
good test". Again, see masses/CORPUS_POLICY.

- don't forget that, no matter what lang a spam is in, forged headers
are forged headers, and Razor is Razor ;) SpamAssassin isn't totally
ineffective in that respect.
Post by Daniel Roethlisberger
I've been thinking on this a bit. I think it would be best if there
would be general provisions for foreign language rules. In the spirit of
the ok_languages option; let users easily enable or disable rules in
certain languages. Like a foreign_rules option which could be used to
control which foreign rulesets are active. Usually people would want to
use checks in all languages which are in the ok_languages list.
It's a point of discussion, alright.

My opinion is that, even english-language users might get a
german-language spam, since spammers aren't famed for their ability to
keep a well-targeted list. Or an english-language user at a domain in .de
might be "targeted" for german-language spam as a result, and would want
them.

Also, rules with .* or similar, that cause hangs, have in the past cropped
up in lang-specific tests. So for QA reasons it makes more sense to run
all the foreign-language rules too, so hang bugs will show up no matter
what locale you're running SpamAssassin in.

--j.
--
'Justin Mason' => { url => http://jmason.org/ , blog => http://taint.org/ }


-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
Malte S. Stretz
2002-08-21 12:52:51 UTC
Permalink
Post by Justin Mason
Post by Daniel Roethlisberger
What is the strategy with respect to foreign language spam recognition
in SA? I've seen extremely few non-english rules. Is there foreign
language rule development going on? Has anybody done work on German
spam?
There's a few poeple working on it -- we'd be happy to give you whatever
help you need, and CVS commit access, as long as you know what you're
doing when writing the rules ;)
I'm accidentally native German, too :o) So if you need any assistance,
contact me. I'll send you a whole bunch of spams.de, just have to sort them
out. Would it be possible for you to make the spam corpus more-or-less-
public available? Oh, and it would be really nice if you found some rules
against that dialer scam :)
Post by Justin Mason
[...]
Post by Daniel Roethlisberger
I've been thinking on this a bit. I think it would be best if there
would be general provisions for foreign language rules. In the spirit
of the ok_languages option; let users easily enable or disable rules in
certain languages. Like a foreign_rules option which could be used to
control which foreign rulesets are active. Usually people would want to
use checks in all languages which are in the ok_languages list.
It's a point of discussion, alright.
My opinion is that, even english-language users might get a
german-language spam, since spammers aren't famed for their ability to
keep a well-targeted list. Or an english-language user at a domain in
.de might be "targeted" for german-language spam as a result, and would
want them.
Currently rules tagged with 'lang xx' are run only if the correct
environment variables are set. That's IMO a kinda broken approach. My
proposal is to run all rules per default but have a config option where you
can disable some languages if you're pretty shure that you won't ever
receive any spam in that language. (To me it seems like the German spammers
are much better in targeting than the US and Chinese ones though ;-)
Post by Justin Mason
[...]
Malte
--
-- Coding is art.
--
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
Daniel Roethlisberger
2002-08-21 13:14:23 UTC
Permalink
Post by Malte S. Stretz
I'm accidentally native German, too :o) So if you need any assistance,
contact me. I'll send you a whole bunch of spams.de, just have to sort
them out.
Well yes, send me whatever you got (attached as mbox would be most
convenient I guess).
Post by Malte S. Stretz
Would it be possible for you to make the spam corpus more-or-less-
public available?
Of course, at least the non-private part of it. How do you folks handle
this with english corpi? In the main CVS rep on sf.net? Or on private
servers?
Post by Malte S. Stretz
Oh, and it would be really nice if you found some rules against that
dialer scam :)
That would be a good place to start, yes. But I'd rather wait for the
more general questions on foreign language rules to be sorted out first.
Post by Malte S. Stretz
Currently rules tagged with 'lang xx' are run only if the correct
environment variables are set. That's IMO a kinda broken approach.
[...]

Oh, didn't know that (see other subthread).

Cheers,
Dan
--
Daniel Roethlisberger <***@roe.ch>
OpenPGP key id 0x804A06B1 (1024/4096 DSA/ElGamal)
144D 6A5E 0C88 E5D7 0775 FCFD 3974 0E98 804A 06B1
Post by Malte S. Stretz
privacy through technology, not legislation <<
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
Malte S. Stretz
2002-08-21 14:01:02 UTC
Permalink
Post by Daniel Roethlisberger
Post by Malte S. Stretz
I'm accidentally native German, too :o) So if you need any assistance,
contact me. I'll send you a whole bunch of spams.de, just have to sort
them out.
Well yes, send me whatever you got (attached as mbox would be most
convenient I guess).
I sent it as a bzipped Maildir, hope you don't care :o)
Post by Daniel Roethlisberger
Post by Malte S. Stretz
Would it be possible for you to make the spam corpus more-or-less-
public available?
Of course, at least the non-private part of it. How do you folks handle
this with english corpi? In the main CVS rep on sf.net? Or on private
servers?
There's so much English spam that everybody's got enough on his own ;-)

But a separate module (eg. spam-archive-de) in the spamassassin CVS might be
useful. Or a public IMAP server. But a private mailserver from where you
can download bzipped mboxes should be enough (maybe password protected).

Malte
--
-- Coding is art.
--
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
Daniel Roethlisberger
2002-08-21 13:01:24 UTC
Permalink
Post by Justin Mason
There's a few poeple working on it --
Feel free to speak up, those few people ;-)
Post by Justin Mason
we'd be happy to give you whatever help you need, and CVS commit
access, as long as you know what you're doing when writing the rules
;)
No hurry, but thanks for the rule design advice. I'm quite perl and
regex savvy, but still rather newish to SA. And before I even start
thinking on rule design, I need a large enough spam corpus, sorted in
German and non-German, spam and nonspam, to see just how well the rules
are doing. Which is what I am building up now.

I was thinking that it would make sense to join with those who are
already working on German spam rules, or if there's nobody to kick off a
German language rules development effort. I don't plan on running a
one-man-show if it can be avoided, a German-rules team would be nice.
Post by Justin Mason
My opinion is that, even english-language users might get a
german-language spam, since spammers aren't famed for their ability to
keep a well-targeted list. Or an english-language user at a domain in
.de might be "targeted" for german-language spam as a result, and
would want them.
Yes, but english-language users can sufficiently up the scores on
foreign language spam with ok_languages, which together with the
non-language-specific rules tackles spam well enough, from what I've
seen.

Therefore I think it best to allow easy enabling/disabling of language
specific tests in foreign languages. This will also limit the number of
rules that SA sites run. From the discussion about dumping rules which
match too infrequently I gathered that this is a main concern.

If foreign language rules are grouped, and SA provides for easy
activation/deactivation or languages selection, this should not become a
problem. I wouldn't think that many people in native English countries
would want to run all the German, French, Swedish, Finnish, Russian, and
whatnot <insert-more-languages-here> rules too. Sites wanting to run
with all languages could still set foreign_rules to all.

Or is this thinking flawed in some way?

Cheers,
Dan
--
Daniel Roethlisberger <***@roe.ch>
OpenPGP key id 0x804A06B1 (1024/4096 DSA/ElGamal)
144D 6A5E 0C88 E5D7 0775 FCFD 3974 0E98 804A 06B1
Post by Justin Mason
privacy through technology, not legislation <<
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
Klaus Heinz
2002-08-21 18:32:13 UTC
Permalink
Post by Daniel Roethlisberger
Post by Justin Mason
There's a few poeple working on it --
Feel free to speak up, those few people ;-)
Some time ago in June, 'zsolt /at/ gmx at' was asking on
de.admin.net-abuse.mail for spam mails in German to build up a corpus.
I don't know what came out of this.

At the time of 2.21 I sent an almost complete version of 30_text_de.cf
to the initiator of this file (H Stich?), but it hasn't been integrated/
submitted into CVS yet and is probably way out of date now.

Since then I am slowly collecting spam in German language but there's
not much of it in my mails (should I deplore this? probably not :-).
Post by Daniel Roethlisberger
I was thinking that it would make sense to join with those who are
already working on German spam rules, or if there's nobody to kick off a
German language rules development effort. I don't plan on running a
one-man-show if it can be avoided, a German-rules team would be nice.
I still have other things to do, but contributing some translations and a
few rules might be possible. On the other hand, my ability to do
meaningful checks is limited, as I don't have a big corpus of (German)
emails.

ciao
Klaus



-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
Daniel Roethlisberger
2002-08-22 07:14:31 UTC
Permalink
Post by Klaus Heinz
At the time of 2.21 I sent an almost complete version of 30_text_de.cf
to the initiator of this file (H Stich?), but it hasn't been
integrated/ submitted into CVS yet and is probably way out of date
now.
Well, translated rule descriptions are nice, sure enough, but what we
need is a 25_body_tests_de.cf / 25_head_tests_de.cf .
Post by Klaus Heinz
I still have other things to do, but contributing some translations
and a few rules might be possible. On the other hand, my ability to do
meaningful checks is limited, as I don't have a big corpus of (German)
emails.
How about a german mailing list for german rules development? What would
the SA dev team think about that? (I'd be happy set up one, if it should
not be an sf.net list)

Cheers
Dan
--
Daniel Roethlisberger <***@roe.ch>
OpenPGP key id 0x804A06B1 (1024/4096 DSA/ElGamal)
144D 6A5E 0C88 E5D7 0775 FCFD 3974 0E98 804A 06B1
Post by Klaus Heinz
privacy through technology, not legislation <<
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390
Craig R.Hughes
2002-08-22 16:23:29 UTC
Permalink
I thought we'd sort of decided that after 2.40 we'd switch the
meaning of the "lang" prefix to actually mean "run these rules
if language matches", where "language" is what TextCat spits out
for a message, as opposed to using the locale which is how
things currently work.

C
Post by Justin Mason
My opinion is that, even english-language users might get a
german-language spam, since spammers aren't famed for their ability to
keep a well-targeted list. Or an english-language user at a
domain in .de
might be "targeted" for german-language spam as a result, and
would want
them.
Also, rules with .* or similar, that cause hangs, have in the
past cropped
up in lang-specific tests. So for QA reasons it makes more
sense to run
all the foreign-language rules too, so hang bugs will show up no matter
what locale you're running SpamAssassin in.
-------------------------------------------------------
This sf.net email is sponsored by: OSDN - Tired of that same old
cell phone? Get a new here for FREE!
https://www.inphonic.com/r.asp?r=sourceforge1&refcode1=vs3390

Loading...