Discussion:
UTF-8 Spam rules
Kevin A. McGrail
2013-09-16 14:12:03 UTC
Permalink
Anyone have some examples of rules designed to catch words by content in
UTF-8 encoded messages? I'm doing some work on improving this.

Regards,
KAM
Jay Sekora
2013-09-19 19:09:27 UTC
Permalink
Post by Kevin A. McGrail
Anyone have some examples of rules designed to catch words by content in
UTF-8 encoded messages? I'm doing some work on improving this.
Are you trying to match UTF-8 encoded messages as a stream of bytes, or
are you using normalize_charset? (And if the latter, how is it working
for you? I asked on this list a while back whether the advice I'd seen
that normalize_charset is dangerous resource-wise was still valid, and
didn't get any replies.)

I guess I don't have anything to offer other than that I really want to
see what you come up with, too. :-)

Jay
Kevin A. McGrail
2013-09-20 18:20:58 UTC
Permalink
Post by Jay Sekora
Post by Kevin A. McGrail
Anyone have some examples of rules designed to catch words by content in
UTF-8 encoded messages? I'm doing some work on improving this.
Are you trying to match UTF-8 encoded messages as a stream of bytes,
or are you using normalize_charset? (And if the latter, how is it
working for you? I asked on this list a while back whether the advice
I'd seen that normalize_charset is dangerous resource-wise was still
valid, and didn't get any replies.)
I guess I don't have anything to offer other than that I really want
to see what you come up with, too. :-)
Right now, I'm just having problems with really putting a nail in the
coffin of spams using UTF8 from and Subjects.

For Example:

From: "=?utf-8?B?RNGWcmVjdCDOknV5?=" <***@wholesalefirst-munged.co>
Subject: =?utf-8?B?VG9wIM6ScmFuZHMgQXQgV2hvbGVzYWxlIM6hctGWY9GWbmc=?=

As of yet, I'm not using normalize_charset and researching what hits
things the best. Most of these still look REALLY spammy from a pathway
analysis though.
David F. Skoll
2013-09-20 18:30:27 UTC
Permalink
On Fri, 20 Sep 2013 14:20:58 -0400
Post by Kevin A. McGrail
As of yet, I'm not using normalize_charset and researching what hits
things the best.
You won't like my answer, but...

You really *have* to normalize everything to Unicode (possible using UTF-8
as the canonical on-disk format) before trying to apply rules or extract
Bayes tokens. Then you can do nice things like blocking CJK spams
with a rule like:

header CJK_SUBJECT Subject =~ /\p{CJK_Unified_Ideographs}

and have absolute confidence it will work no matter how the subject is
encoded.

I haven't looked extremely closely at the SpamAssassin code so I'm not
sure how its normalization works nor whether it can do the necessary
transformations for a subject rule such as my example to work.

Regards,

David.
Kevin A. McGrail
2013-09-27 21:19:19 UTC
Permalink
Post by David F. Skoll
You won't like my answer, but...
You really*have* to normalize everything to Unicode (possible using UTF-8
as the canonical on-disk format) before trying to apply rules or extract
Bayes tokens. Then you can do nice things like blocking CJK spams
header CJK_SUBJECT Subject =~ /\p{CJK_Unified_Ideographs}
and have absolute confidence it will work no matter how the subject is
encoded.
I haven't looked extremely closely at the SpamAssassin code so I'm not
sure how its normalization works nor whether it can do the necessary
transformations for a subject rule such as my example to work.
Your answer helps because I'm sort of hitting what I think is a bit of a
systemic issue. I'm banging away at this.

Regards,
KAM
Karsten Bräckelmann
2013-09-26 03:15:41 UTC
Permalink
Post by Kevin A. McGrail
Post by Kevin A. McGrail
Anyone have some examples of rules designed to catch words by content in
UTF-8 encoded messages? I'm doing some work on improving this.
Right now, I'm just having problems with really putting a nail in the
coffin of spams using UTF8 from and Subjects.
Using UTF-8 encoded headers (or body) is absolutely no sign of spam
whatsoever. Have a look at this mail's headers. I know you know, but
your wording was just unfortunate.
Post by Kevin A. McGrail
Subject: =?utf-8?B?VG9wIM6ScmFuZHMgQXQgV2hvbGVzYWxlIM6hctGWY9GWbmc=?=
What exactly is your problem? These match your sample.

header FOO_FROM From =~ /Dіrect Βuy/
header FOO_SUBJ Subject =~ /Top Βrands At Wholesale Ρrіcіng/

This one, though, doesn't.

header BAR_FROM From =~ /Direct Buy/

Confused yet? The From header rules look identical, you say?

Indeed, they do. Look identical. They aren't. The patterns are UTF-8
encoded, the latter one I typed in manually based on "what I see". The
first set of patterns are straight *copied* from a UTF-8 capable MUA.

A hex dump visualizes the differences.

00000000 44 D1 96 72 65 63 74 20 CE 92 75 79 D..rect ..uy

00000000 54 6F 70 20 CE 92 72 61 6E 64 73 20 41 74 20 57 Top ..rands At W
00000010 68 6F 6C 65 73 61 6C 65 20 CE A1 72 D1 96 63 D1 holesale ..r..c.
00000020 96 6E 67 .ng

So yeah, there are UTF-8 chars injected not part of ASCII, but looking
identical to the ASCII char they are recognized as by the reader.

D196 CYRILLIC SMALL LETTER BELORUSSIAN-UKRAINIAN I (U+0456)
CE92 GREEK CAPITAL LETTER BETA (U+0392)
CEA1 GREEK CAPITAL LETTER RHO (U+03A1)


That analysis done -- again, what exactly is your problem?

Matching a fixed string? Be sure to copy the UTF-8 encoded non-ASCII
chars, rather than typing in visually equivalent chars. SA can handle
UTF-8 strings in rules at least since SA 3.2 on Perl 5.8.x.

Matching specific words with either ASCII or non-ASCII chars? Hardcoded
custom rules, or better M::SA::Plugin::ReplaceTags rules.

Matching *any* UTF-8 non-ASCII char? Not a good idea. (I know you never
would, just for completeness in this post.)
Post by Kevin A. McGrail
As of yet, I'm not using normalize_charset and researching what hits
things the best. Most of these still look REALLY spammy from a pathway
analysis though.
Never used normalize_charset myself. But from a glimpse at the docs,

"detect character sets and normalize message content to Unicode."

it appears that option would only make sense with non-ASCII content that
is NOT UTF-8 encoded, to use UTF-8 encoded rules.
--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}
Kevin A. McGrail
2013-09-27 22:46:02 UTC
Permalink
Post by Karsten Bräckelmann
Post by Kevin A. McGrail
Post by Kevin A. McGrail
Anyone have some examples of rules designed to catch words by content in
UTF-8 encoded messages? I'm doing some work on improving this.
Right now, I'm just having problems with really putting a nail in the
coffin of spams using UTF8 from and Subjects.
Using UTF-8 encoded headers (or body) is absolutely no sign of spam
whatsoever. Have a look at this mail's headers. I know you know, but
your wording was just unfortunate.
Agreed. I don't view UTF-8 as an indicator of spam. But I do see it's
use on the uptick especially to add obfuscation.
Post by Karsten Bräckelmann
Post by Kevin A. McGrail
Subject: =?utf-8?B?VG9wIM6ScmFuZHMgQXQgV2hvbGVzYWxlIM6hctGWY9GWbmc=?=
What exactly is your problem? These match your sample.
header FOO_FROM From =~ /Dіrect Βuy/
header FOO_SUBJ Subject =~ /Top Βrands At Wholesale Ρrіcіng/
This one, though, doesn't.
header BAR_FROM From =~ /Direct Buy/
Confused yet? The From header rules look identical, you say?
Exactly. I know they are UTF-8 encoded variants that look identical.
Post by Karsten Bräckelmann
That analysis done -- again, what exactly is your problem?
Problem #1: What's the best way to write rules to catch these variants
using utf-8 encoded items?

D[іi]rect [ΒB]uy isn't exactly scalable

And ReplaceTags is just shorthand for writing the same thing.

But I think the answer is: I need to come up with "stock" replacetags for [B] and for [i] that I'm seeing in the wild and use those.

Problem #2: How to effectively cut and past this information and create rules.

- I use SecureCRT to login via SSH and edit rules with vim. Is there
anyway to get this so I can use it to cut and paste the UTF-8 decoded
information? I'm guessing not really though I've been playing with
various set utf-8 settings. I likely need to switch MY methods for
creating rules which is the systemic issue I'm hitting.
Post by Karsten Bräckelmann
Never used normalize_charset myself. But from a glimpse at the docs,
"detect character sets and normalize message content to Unicode." it
appears that option would only make sense with non-ASCII content that
is NOT UTF-8 encoded, to use UTF-8 encoded rules.
That's a good point.

Regards,
KAM

Continue reading on narkive:
Loading...