The Mudcat Café TM
Thread #135626   Message #3095388
Posted By: Artful Codger
14-Feb-11 - 07:38 PM
Thread Name: Tech: Non-ASCII character display problems
Subject: RE: Tech: Non-ASCII character display problems
All of this holds true with a UTF-16 encoding as well, so the superiority of using UTF-8 for the web page itself has still not been demonstrated. In fact, it may still require an extra step to convert the UTF-8 input to UTF-16 which the Cold Fusion scripts can handle directly. This is an unnecessary complication.

If input remains handled on the thread display page, the argument for UTF-8 is both stronger and weaker. It would allow legacy messages to be included unconverted (the character size is still 8-bit), even though high-bit characters would largely get blitzed. (Note that arbitrary sequences introduced by high-bit characters violate the encoding scheme--not every character combination is valid--so you're just trusting that browsers will behave benignly.) It would probably still allow users to switch view encoding to see what was originally meant. Sadly, the number of posts with wonky displays would skyrocket unless emulation was used along the lines i suggested. In that case, emulation would also be necessary to view the message with a diffent source encoding applied, and then using UTF-8 rather than UTF-16 is a distinct liability (two conversions required instead of one--one must go through UTF-16 to produce either the equivalent UTF-8 or character references).

Storage is a separate issue entirely. If (for new posts) everything is converted to character references, it conforms to byte-oriented storage, as is probably used now. But you're much more likely to exceed some length boundaries, particularly for title information. Storing raw UTF-8 would reduce this in most cases, but then you have to know that you stored UTF-8 rather than some legacy 8-bit encoding, and I'm not sure how this would impact DB searching. Even character references have an impact on searches: does the DB understand them? If not, how can it normalize the different forms they may take (mnemonic or numeric? zero-padded or not?) for comparison? That problem most likely exists at present, but the relative disuse of escapes in comparison to ISO-Latin-1 masks it. In any case, most people search only by ASCII words, where char refs are a non-issue. And if users no longer had to encode text, most char refs encountered in new posts would match, being produced by the system.