The Mudcat Café TM
Thread #112265   Message #2374653
Posted By: Artful Codger
26-Jun-08 - 06:43 AM
Thread Name: Tech: htmlesc.py: Mac script to escape text
Subject: RE: Tech: htmlesc.py: Mac script to escape text
Lack of a BOM is a decided disadvantage. Without it, UTF-8 is indistinguishable from any of the myriad legacy 8-bit character sets. And, sadly, with it, most programs will do the wrong thing (unlike with the UTF-16 BOM).

That single BOM character not only tells a program it's dealing with a Unicode file with a particular byte ordering, it identifies whether the encoding is UTF-7, UTF-8, UTF-16 or UTF-32! That's quite a lot of information for such a little tag.

UTF-8 makes files smaller only for the Western languages. For the Asian languages, where the need for Unicode is immensely greater, UTF-8 is more bloated than UTF-16. UTF-8 must also be converted into UTF-16 or UTF-32 in order to process it at all efficiently, and the conversion involves a lot of bit twiddling. Anyway, in these days of multi-megabyte software updates pushed over the Internet, plain text file size is a complete non-issue.

Yes, 7-bit ASCII is a subset of the UTF-8 encoding, but so what? In terms of code point values, every Unicode encoding has ASCII as a subset. And the whole idea of Unicode is to transcend the woeful limitations of ASCII and the mess produced by incompatible legacy character sets.

No, Mac OS X hasn't made any Unicode encoding the standard for plain-text files. The default in most editors, including TextEdit, is to interpret text files using a locale-dependent character set like Mac-Roman, no doubt for backwards-compatibility. Oddly, TextEdit won't let you change the encoding to be Unicode, nor will it let you save a new file as plain text, only as RTF. Mac config files are mostly XML; I'm assuming they're actually UTF-8 encoded, but of course mine look like plain ASCII, so I can't tell.