Operations on headers in UTF-8

Post by Alex
I'm not very familiar with how to manage language encoding, and hoped
someone could help. Some time ago I wrote a rule that looks for
subjects that consist of a single word that's more than N characters.
It works, but I'm learning that it's performed before the content of
the subject is converted into something human-readable.

This is not true. Header rules are matched against the decoded string by
default. To prevent decoding of quoted-printable or base-64 encoded
headers, the :raw modifier needs to be appended to the header name.

Post by Alex
Subject: =?utf-8?B?44CK546v55CD5peF6K6v44CL5Y6f5Yib77ya5Zyo57q/5peF5ri4?=

That's a base-64 encoded UTF-8 string, decoded for header rules. To see
for yourself, just echo your test header into

spamassassin -D -L --cf="header TEST Subject =~ /.+/"

and the debug output will show you what it matched.

dbg: rules: ran header rule TEST ======> got hit: "《环球旅讯》原创：在线旅游"

Post by Alex
How can I write a header rule that operates on the decoded utf
content?
header __SUB_NOSPACE Subject =~ /^.\S+$/
header __SUB_VERYLONG Subject =~ /^.{20,200}\S+$/
meta LOC_SUBNOSPACE (__SUB_VERYLONG && __SUB_NOSPACE)

Again, header rules by default operate on the decoded string.

I assume your actual problem is with the SUB_VERYLONG rule hitting.
Since the above test rule shows the complete decoded Subject, we can
tell it's 13 chars long, clearly below the "verylong" threshold of 20
chars.

That is not caused by the encoding, though, but because the regex
operates on bytes rather than characters.

Let's see what a 20 bytes chunk of that UTF-8 string looks like. A
modified rule will match the first 20 bytes only:

header TEST Subject =~ /^.{20}/

The result shows the string is longer than 20 bytes, and the match even
ends right within a single UTF-8 encoded char.

got hit: "《环球旅讯》<E5><8E>"

To make the regex matching aware of UTF-8 encoding, and match chars
instead of (raw) bytes, we will need the normalize_charset option.

header TEST Subject =~ /^.{10}/"
normalize_charset 1

Along with yet another modification of the test rule, now matching the
first 10 chars only.

got hit: "《环球旅讯》原创：在"

The effect is clear. That 10 (chars) long match with normalize_charset
enabled is even longer than the above 20 (byte) match.

--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}

Alex

2014-06-10 21:39:35 UTC

Hi,

I've also realized I made the improper assumption that, even if it was
operating on the encoded string, that the decoded string may have been
greater than 20 chars.
=?utf-8?B?44CK546v55CD5peF6K6v44CL5Y6f5Yib77ya5Zyo57q/5peF5ri4?=

Post by Karsten BrÃ¤ckelmann
That's a base-64 encoded UTF-8 string, decoded for header rules. To see
for yourself, just echo your test header into
spamassassin -D -L --cf="header TEST Subject =~ /.+/"
and the debug output will show you what it matched.
dbg: rules: ran header rule TEST ======> got hit: "ãç¯çæè®¯ãååïŒåšçº¿ææžž"

Great info, thanks.

Post by Alex
How can I write a header rule that operates on the decoded utf content?
header __SUB_NOSPACE Subject =~ /^.\S+$/
header __SUB_VERYLONG Subject =~ /^.{20,200}\S+$/
meta LOC_SUBNOSPACE (__SUB_VERYLONG && __SUB_NOSPACE)

Visually counting the individual characters, including the colon, is indeed
13. However, there are spaces, which should have negated the rule (\S), no?
Also, wc shows me the string is 41 chars long.

Post by Karsten BrÃ¤ckelmann
That is not caused by the encoding, though, but because the regex
operates on bytes rather than characters.

Is not each character exactly one byte? Or are you referring to the fact
that it takes untold multiple bytes to produce one encoded character?

Post by Karsten BrÃ¤ckelmann
Let's see what a 20 bytes chunk of that UTF-8 string looks like. A
header TEST Subject =~ /^.{20}/
The result shows the string is longer than 20 bytes, and the match even
ends right within a single UTF-8 encoded char.
got hit: "ãç¯çæè®¯ã<E5><8E>"

Yes, on my xterm it just produces an unintelligible character as a question
mark.

Post by Karsten BrÃ¤ckelmann
To make the regex matching aware of UTF-8 encoding, and match chars
instead of (raw) bytes, we will need the normalize_charset option.
header TEST Subject =~ /^.{10}/"
normalize_charset 1

Why wouldn't it make sense for this to be the default? What is the utility
in trying to match on an encoded string?

I think I'm also confused by your reference above that header rules are
matched against the decoded string. What then would be the purpose of
normalize_charset here? Does normalize here mean to decode it?

Post by Karsten BrÃ¤ckelmann
Along with yet another modification of the test rule, now matching the
first 10 chars only.
got hit: "ãç¯çæè®¯ãååïŒåš"
The effect is clear. That 10 (chars) long match with normalize_charset
enabled is even longer than the above 20 (byte) match.

Okay, I think I understand. So I don't want to avoid scanning encoded
headers, but it's also very unlikely to find a 20-byte string of Japanese
characters in any spam message, so I don't really know what I should do
with this.

I've also investigated a bit further, and it appears to hit quite a bit of
ham (really_long_spreadsheet.xls, for example), so maybe I need to meta it
with something, or just abandon it.

Thanks,
Alex

Karsten Bräckelmann

2014-06-11 00:45:25 UTC

Reading through your reply, I see we need to get to the basics first.
You are massively confusing different types of encoding and not fully
realizing the difference between a character and a byte.

Character Encoding

ASCII is a fixed width character encoding. A simple lookup table, where
each char corresponds directly to one of the 256 possible values. In
ASCII, a char is exactly 8 bit, 1 byte.

Since it is a fixed width encoding, there's an upper limit of different
chars it can represent: 256. This becomes a problem, when you want to
support more chars: Regional latin based chars, like the German Umlauts,
chars specific to French, Spanish, Norwegian, etc. And Greek. Cyrillic,
the Hebrew alphabet, Chinese and Japanese characters...

That's where UTF-8 enters the picture. It's a variable length charset
encoding. For backward compatibility, the first 128 (7 bit) are
identical to ASCII, covering the common latin chars. The 8th bit is used
to extend the number of characters available, by extending the
bit-length: The following byte becomes part of the same character.

In simple terms, there are 128 different byte values with the 8th bit
set. Each of these includes the following byte to form a 16 bit encoded
char. Since that second byte can hold 256 different bit-strings, we end
up with 128*256 characters 16 bit wide, plus the 128 characters 8 bit
wide. (The actual encoding is slightly different, and UTF-8 characters
can range from 1 up to 4 bytes in legth.)

Character Encoding is usually transparent, not visible to the user. The
Chinese(?) chars in this thread are an example. The Umlaut in my name is
another. You see a single character, which internally is represented by
2 or more bytes.

String Encoding

Base64 and quoted-printable are string encodings. They are unaware of a
possible (variable length) charset encoding. They even don't care about
the original string at all. They work strictly on a byte basis, not
differentiating between "human readable text" in whatever charset and a
binary blob.

String encodings are commonly used to ensure there are no bytes with the
8th bit set.

In raw mail headers, encoded words look like "=?utf-8?B?...?=".

The "B" indicates Base64 string encoding of the "..." encoded text.
This, the string encoding is what SA decodes by default for header
rules.

Post by Karsten BrÃ¤ckelmann
I assume your actual problem is with the SUB_VERYLONG rule hitting.
Since the above test rule shows the complete decoded Subject, we can
tell it's 13 chars long, clearly below the "verylong" threshold of 20
chars.

Visually counting the individual characters, including the colon, is
indeed 13. However, there are spaces, which should have negated the
rule (\S), no? Also, wc shows me the string is 41 chars long.

Correct, 13 characters. There are no spaces, though.

The string is 39 *bytes* long. (You included some line endings.) 'wc'
historically defaults to bytes, but supports chars, too.

$ echo -n "《环球旅讯》原创：在线旅游" | wc --chars --bytes
13 39

Post by Karsten BrÃ¤ckelmann
That is not caused by the encoding, though, but because the regex
operates on bytes rather than characters.

Is not each character exactly one byte? Or are you referring to the
fact that it takes untold multiple bytes to produce one encoded
character?

With UTF-8 charset encoding, a character may be up to 4 bytes long.
Doing the math, it is quite obvious the Chinese(?) chars each take 3
bytes.

(The "encoding" I refered to in that quote is the Base64 string encoding
blowing up the length of the encoded string.)

Why wouldn't it make sense for this to be the default? What is the
utility in trying to match on an encoded string?
I think I'm also confused by your reference above that header rules
are matched against the decoded string. What then would be the purpose
of normalize_charset here? Does normalize here mean to decode it?

It probably makes sense to switch to normalize_charset enabled by
default. It is disabled for historical reasons, and due to possible
side-effects or increased resource usage (particularly a concern with
earlier Perl versions). This needs to be investigated.

It's worth to point out that charset normalization is *not* a magical
UTF-8 support switch.

(a) Charset normalization does affect regex wildcard matching, changing
the meaning of /./ from byte to UTF-8 character. This is only relevant
with ranges, as can be seen here.

Moreover, it does make a much more significant difference with e.g.
Chinese. The occasional 2 bytes a German Umlaut takes makes almost no
difference overall even in German text. (Think an arbitrary boundary of
200 chars in pure US-ASCII, or an arbitrary boundary of 195 chars in
UTF-8 encoded German text, both being 200 bytes.)

(b) Normalization is not needed for using UTF-8 encoded strings in
regex-based rules. You can almost freely write REs including multi-byte
UTF-8 chars [1]. While internally represented by more than one byte,
your editor or shell will show you a single character.

Try it with a test header rule directly matching one of those Chinese(?)
characters.

Post by Karsten BrÃ¤ckelmann
Along with yet another modification of the test rule, now matching the
first 10 chars only.
got hit: "《环球旅讯》原创：在"
The effect is clear. That 10 (chars) long match with normalize_charset
enabled is even longer than the above 20 (byte) match.

Okay, I think I understand. So I don't want to avoid scanning encoded
headers, [...]

Nit-picking, but in the spirit of the whole thread: decoded. ;)

Header rules are matched against decoded (readable) strings of encoded
(gibberish) headers in the raw mail.

[1] Validating that claim of freely before sending, and trying to break
it, there are caveats.
First of all multi-byte chars cannot simply be optional by a
trailing question mark. The multi-byte char needs to be enclosed in
brackets for the optional affecting both bytes.
Worse, enabling charset normalization completely breaks UTF-8 chars
in the regex. At least in my ad-hoc --cf command line testing.

Daniel Staal

2014-06-11 01:22:54 UTC

--As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged to

Post by Karsten BrÃ¤ckelmann
Worse, enabling charset normalization completely breaks UTF-8 chars
in the regex. At least in my ad-hoc --cf command line testing.

--As for the rest, it is mine.

This sounds like something where `use feature 'unicode_strings'` might have
an affect - enabling normalization is probably setting the internal utf8
flag on incoming text, which could change the semantics of the regex
matching.

If that's the case, it raises the question of if we want Spamassassin to
require Perl 5.12 (which includes that feature) - the current base version
is 5.8.1. Unicode support has been evolving in Perl; 5.8 supports it
generally, but there were bugs. I think 5.12 got most of them, but I'm not
sure. (And of course it's not the current version of Perl.)

Daniel T. Staal

---------------------------------------------------------------
This email copyright the author. Unless otherwise noted, you
are expressly allowed to retransmit, quote, or otherwise use
the contents for non-commercial purposes. This copyright will
expire 5 years after the author's death, or in 30 years,
whichever is longer, unless such a period is in excess of
local copyright law.
---------------------------------------------------------------

Karsten Bräckelmann

2014-06-11 02:25:31 UTC

Post by Daniel Staal
--As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged to

Post by Karsten BrÃ¤ckelmann
Worse, enabling charset normalization completely breaks UTF-8 chars
in the regex. At least in my ad-hoc --cf command line testing.

--As for the rest, it is mine.
This sounds like something where `use feature 'unicode_strings'` might have
an affect

Possibly.

Post by Daniel Staal
enabling normalization is probably setting the internal utf8
flag on incoming text, which could change the semantics of the regex
matching.

Nope. *digging into code*

This option mainly affects rendered textual parts and headers, treating
them with Encode::Detect. More complex than just setting an internal
flag. What exactly made the ad-hoc regex rules fail is beyond the scope
of tonight's code-diving.

Post by Daniel Staal
If that's the case, it raises the question of if we want Spamassassin to
require Perl 5.12 (which includes that feature) - the current base version
is 5.8.1. Unicode support has been evolving in Perl; 5.8 supports it
generally, but there were bugs. I think 5.12 got most of them, but I'm not
sure. (And of course it's not the current version of Perl.)

The normalize_charset option requires Perl 5.8.5.

All the ad-hoc rule testing in this thread has been done with SA 3.3.2
on Perl 5.14.2 (debian 7.5). So this is not an issue of requiring a more
recent Perl version.

While of course something to potentially improve on itself, the topic of
charset normalization is just a by-product explaining the original
issue: Header rules and string encoding, with a grain of charset
encoding salt.

Daniel Staal

2014-06-11 17:50:27 UTC

--As of June 11, 2014 4:25:31 AM +0200, Karsten Bräckelmann is alleged to

Post by Daniel Staal
--As of June 11, 2014 2:45:25 AM +0200, Karsten Bräckelmann is alleged

Post by Karsten BrÃ¤ckelmann
Worse, enabling charset normalization completely breaks UTF-8 chars
in the regex. At least in my ad-hoc --cf command line testing.

--As for the rest, it is mine.
This sounds like something where `use feature 'unicode_strings'` might
have an affect

Possibly.

Post by Daniel Staal
enabling normalization is probably setting the internal utf8
flag on incoming text, which could change the semantics of the regex
matching.

Nope. *digging into code*
This option mainly affects rendered textual parts and headers, treating
them with Encode::Detect. More complex than just setting an internal
flag. What exactly made the ad-hoc regex rules fail is beyond the scope
of tonight's code-diving.

Right. And as a side-effect, Encode::Detect (as documented in Encode) is
probably setting the utf8 flag on the Perl string.

Note I mean internal to *perl*, not one of the modules or code. The utf8
flag affects what semantics perl uses when it compares strings, including
in regexes.