Reading through your reply, I see we need to get to the basics first.
You are massively confusing different types of encoding and not fully
realizing the difference between a character and a byte.
Character Encoding
ASCII is a fixed width character encoding. A simple lookup table, where
each char corresponds directly to one of the 256 possible values. In
ASCII, a char is exactly 8 bit, 1 byte.
Since it is a fixed width encoding, there's an upper limit of different
chars it can represent: 256. This becomes a problem, when you want to
support more chars: Regional latin based chars, like the German Umlauts,
chars specific to French, Spanish, Norwegian, etc. And Greek. Cyrillic,
the Hebrew alphabet, Chinese and Japanese characters...
That's where UTF-8 enters the picture. It's a variable length charset
encoding. For backward compatibility, the first 128 (7 bit) are
identical to ASCII, covering the common latin chars. The 8th bit is used
to extend the number of characters available, by extending the
bit-length: The following byte becomes part of the same character.
In simple terms, there are 128 different byte values with the 8th bit
set. Each of these includes the following byte to form a 16 bit encoded
char. Since that second byte can hold 256 different bit-strings, we end
up with 128*256 characters 16 bit wide, plus the 128 characters 8 bit
wide. (The actual encoding is slightly different, and UTF-8 characters
can range from 1 up to 4 bytes in legth.)
Character Encoding is usually transparent, not visible to the user. The
Chinese(?) chars in this thread are an example. The Umlaut in my name is
another. You see a single character, which internally is represented by
2 or more bytes.
String Encoding
Base64 and quoted-printable are string encodings. They are unaware of a
possible (variable length) charset encoding. They even don't care about
the original string at all. They work strictly on a byte basis, not
differentiating between "human readable text" in whatever charset and a
binary blob.
String encodings are commonly used to ensure there are no bytes with the
8th bit set.
In raw mail headers, encoded words look like "=?utf-8?B?...?=".
The "B" indicates Base64 string encoding of the "..." encoded text.
This, the string encoding is what SA decodes by default for header
rules.
Post by AlexPost by Karsten BräckelmannI assume your actual problem is with the SUB_VERYLONG rule hitting.
Since the above test rule shows the complete decoded Subject, we can
tell it's 13 chars long, clearly below the "verylong" threshold of 20
chars.
Visually counting the individual characters, including the colon, is
indeed 13. However, there are spaces, which should have negated the
rule (\S), no? Also, wc shows me the string is 41 chars long.
Correct, 13 characters. There are no spaces, though.
The string is 39 *bytes* long. (You included some line endings.) 'wc'
historically defaults to bytes, but supports chars, too.
$ echo -n "《环球旅讯》原创:在线旅游" | wc --chars --bytes
13 39
Post by AlexPost by Karsten BräckelmannThat is not caused by the encoding, though, but because the regex
operates on bytes rather than characters.
Is not each character exactly one byte? Or are you referring to the
fact that it takes untold multiple bytes to produce one encoded
character?
With UTF-8 charset encoding, a character may be up to 4 bytes long.
Doing the math, it is quite obvious the Chinese(?) chars each take 3
bytes.
(The "encoding" I refered to in that quote is the Base64 string encoding
blowing up the length of the encoded string.)
Post by AlexPost by Karsten BräckelmannTo make the regex matching aware of UTF-8 encoding, and match chars
instead of (raw) bytes, we will need the normalize_charset option.
header TEST Subject =~ /^.{10}/"
normalize_charset 1
Why wouldn't it make sense for this to be the default? What is the
utility in trying to match on an encoded string?
I think I'm also confused by your reference above that header rules
are matched against the decoded string. What then would be the purpose
of normalize_charset here? Does normalize here mean to decode it?
It probably makes sense to switch to normalize_charset enabled by
default. It is disabled for historical reasons, and due to possible
side-effects or increased resource usage (particularly a concern with
earlier Perl versions). This needs to be investigated.
It's worth to point out that charset normalization is *not* a magical
UTF-8 support switch.
(a) Charset normalization does affect regex wildcard matching, changing
the meaning of /./ from byte to UTF-8 character. This is only relevant
with ranges, as can be seen here.
Moreover, it does make a much more significant difference with e.g.
Chinese. The occasional 2 bytes a German Umlaut takes makes almost no
difference overall even in German text. (Think an arbitrary boundary of
200 chars in pure US-ASCII, or an arbitrary boundary of 195 chars in
UTF-8 encoded German text, both being 200 bytes.)
(b) Normalization is not needed for using UTF-8 encoded strings in
regex-based rules. You can almost freely write REs including multi-byte
UTF-8 chars [1]. While internally represented by more than one byte,
your editor or shell will show you a single character.
Try it with a test header rule directly matching one of those Chinese(?)
characters.
Post by AlexPost by Karsten BräckelmannAlong with yet another modification of the test rule, now matching the
first 10 chars only.
got hit: "《环球旅讯》原创:在"
The effect is clear. That 10 (chars) long match with normalize_charset
enabled is even longer than the above 20 (byte) match.
Okay, I think I understand. So I don't want to avoid scanning encoded
headers, [...]
Nit-picking, but in the spirit of the whole thread: decoded. ;)
Header rules are matched against decoded (readable) strings of encoded
(gibberish) headers in the raw mail.
[1] Validating that claim of freely before sending, and trying to break
it, there are caveats.
First of all multi-byte chars cannot simply be optional by a
trailing question mark. The multi-byte char needs to be enclosed in
brackets for the optional affecting both bytes.
Worse, enabling charset normalization completely breaks UTF-8 chars
in the regex. At least in my ad-hoc --cf command line testing.
--
char *t="\10pse\0r\0dtu\***@ghno\x4e\xc8\x79\xf4\xab\x51\x8a\x10\xf4\xf4\xc4";
main(){ char h,m=h=*t++,*x=t+2*h,c,i,l=*x,s=0; for (i=0;i<l;i++){ i%8? c<<=1:
(c=*++x); c&128 && (s+=h); if (!(h>>=1)||!t[s+h]){ putchar(t[s]);h=m;s=0; }}}