Post by Kevin A. McGrailPost by Kevin A. McGrailAnyone have some examples of rules designed to catch words by content in
UTF-8 encoded messages? I'm doing some work on improving this.
Right now, I'm just having problems with really putting a nail in the
coffin of spams using UTF8 from and Subjects.
Using UTF-8 encoded headers (or body) is absolutely no sign of spam
whatsoever. Have a look at this mail's headers. I know you know, but
your wording was just unfortunate.
Post by Kevin A. McGrailSubject: =?utf-8?B?VG9wIM6ScmFuZHMgQXQgV2hvbGVzYWxlIM6hctGWY9GWbmc=?=
What exactly is your problem? These match your sample.
header FOO_FROM From =~ /Dіrect Βuy/
header FOO_SUBJ Subject =~ /Top Βrands At Wholesale Ρrіcіng/
This one, though, doesn't.
header BAR_FROM From =~ /Direct Buy/
Confused yet? The From header rules look identical, you say?
Indeed, they do. Look identical. They aren't. The patterns are UTF-8
encoded, the latter one I typed in manually based on "what I see". The
first set of patterns are straight *copied* from a UTF-8 capable MUA.
A hex dump visualizes the differences.
00000000 44 D1 96 72 65 63 74 20 CE 92 75 79 D..rect ..uy
00000000 54 6F 70 20 CE 92 72 61 6E 64 73 20 41 74 20 57 Top ..rands At W
00000010 68 6F 6C 65 73 61 6C 65 20 CE A1 72 D1 96 63 D1 holesale ..r..c.
00000020 96 6E 67 .ng
So yeah, there are UTF-8 chars injected not part of ASCII, but looking
identical to the ASCII char they are recognized as by the reader.
D196 CYRILLIC SMALL LETTER BELORUSSIAN-UKRAINIAN I (U+0456)
CE92 GREEK CAPITAL LETTER BETA (U+0392)
CEA1 GREEK CAPITAL LETTER RHO (U+03A1)
That analysis done -- again, what exactly is your problem?
Matching a fixed string? Be sure to copy the UTF-8 encoded non-ASCII
chars, rather than typing in visually equivalent chars. SA can handle
UTF-8 strings in rules at least since SA 3.2 on Perl 5.8.x.
Matching specific words with either ASCII or non-ASCII chars? Hardcoded
custom rules, or better M::SA::Plugin::ReplaceTags rules.
Matching *any* UTF-8 non-ASCII char? Not a good idea. (I know you never
would, just for completeness in this post.)
Post by Kevin A. McGrailAs of yet, I'm not using normalize_charset and researching what hits
things the best. Most of these still look REALLY spammy from a pathway
analysis though.
Never used normalize_charset myself. But from a glimpse at the docs,
"detect character sets and normalize message content to Unicode."
it appears that option would only make sense with non-ASCII content that
is NOT UTF-8 encoded, to use UTF-8 encoded rules.
--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}