Mudcat Café message #2373610

The Mudcat Café ^TM
Thread #112265 Message #2373610
Posted By: Artful Codger
24-Jun-08 - 06:49 PM
Thread Name: Tech: htmlesc.py: Mac script to escape text
Subject: RE: Tech: htmlesc.py: Mac script to escape text

Also, if you want my feedback to any questions or comments on the script, please send me a private message, since I won't necessarily be monitoring this thread.

treewind: Actually, the encoding is UTF-16, not UTF-8. UTF-8 is an 8-bit encoding which uses 1 to 5 bytes per Unicode code point. (I speak of code points because a single character may actually be a composite of several code points, and there are code points which specify relative positioning, display order and the like.) Translating between UTF-8 byte sequences and Unicode code point values involves a lot of bit-twiddling, whereas with UTF-16 you can just combine byte pairs into an integer value (shifting one or the other up 8-bits) and you have the Unicode code point. Because of this, it's quite easy to examine a hexadecimal dump of a UTF-16 file and determine exactly what the encoded characters are. For characters with values above 0xFFFF you get into UTF-16 surrogate pairs, not as straightforward, but that is how characters need to be encoded for Mudcat, since you don't have the ability to change the page encoding for just the text you're supplying.

UTF-8 is the standard for certain byte-stream protocols, like e-mail, and the Mac filesystem uses oddly-normalized UTF-8 for its internal representation. That may be where you got the impression UTF-8 is best. But for text files, UTF-16 is preferrable to UTF-8, because (bear with me here)...

UTF-16 files are either stored in the native byte order of the operating system, or with a byte ordering specified by an explicit code point at the start of the file, called a Byte Order Mark (BOM). When an editor encounters a UTF-16 BOM, it knows immediately it's dealing with a UTF-16 file, and whether the bytes are stored in big-endian or little-endian order. It also silently strips the BOM from the start of the text. There are very few programs that don't know how to deal with UTF-16 BOMs.

There is a similar BOM for UTF-8--it's actually the same code point value as for UTF-16 (0xFFFE) that, encoded, comes out as one of two distinctive three byte sequences, depending on the byte order. HOWEVER, it is not common for programs to prepend the BOM for UTF-8 files, since UTF-8 is touted as "byte-order independent". It was designed to serialize Unicode through byte streams, where the byte order was invariant. Consequently, if most programs--including most editors--do encounter a UTF-8 BOM, they usually don't recognize it as such, and instead treat it as raw text! This can cause programs (like compilers) to report syntax errors and reject your file.

More information than you want, probably, but there it is...