The Mudcat Café TM
Thread #135626   Message #3094640
Posted By: Artful Codger
13-Feb-11 - 05:57 PM
Thread Name: Tech: Non-ASCII character display problems
Subject: RE: Tech: Non-ASCII character display problems
Ahem, Mudcat should escape all chars above 127, i.e. all chars not in the ASCII (7-bit) set. (If you're going to quibble, quibble about something of import.)

Note that with a UTF-8 file encoding specified for the web page, Mudcat would likely receive the input as UTF-8 multi-byte sequences rather than as straight Unicode values or character references (as I suspect happens now), so for pages lacking a UTF-8 header (like all the pages we have as of this writing), Mudcat would first have to unencode to 16-bit Unicode codepoints before generating character references. Scripting languages provide conversion filters that can easily be attached to streams to handle such encoding conversions (since 16-bit Unicode chars have become the internal lingua franca for most), so it's not a big task to implement; it just needs to be done. Of course, no conversion would be necessary to store new messages as UTF-8.

I'm assuming that Mudcat doesn't store threads as actual web pages, but rather constructs them on the fly from DB information. So maybe we should be talking about adding encoding attributes to the messages themselves. As I see it, you'd only need three attribute values: unknown 8-bit (ISO-8859-1 assumed, but iffy), UTF-8 and ASCII (wholly 7-bit, with or without character escapes). A sweep through the DB could type the existing messages either as ASCII or unknown (if they had any 8-bit chars), and a revised input system would ensure that new messages were always properly encoded either as ASCII or UTF-8. Mudcat might still have to create pages without a page encoding specified (if a thread contained a message with unknown encoding) so that viewers and moderators could change the view encoding—in this case, it would have to convert high-bit and multi-char UTF-8 stuff to references on the fly—and resolve the current input problems somehow.

Alternatively, since questionable messages would already be tagged, they could be displayed in a UTF-8 encoded page with ISO-8895-1 encoding emulated; a selector would let users select another emulation encoding--moderators and possibly the original posters (if registered) would be able to apply the selected encoding permanently (as a separate operation), converting the post to UTF-8 or ASCII-with-escapes, obviating the need to translate the message in the future. (The operation would not be reversible unless, instead of converting to UTF-8/ASCII, the selected encoding was stored as the encoding type. This might be allowed for a provisional period of time until a later sweep made permanent conversions.)

Even better, make the web pages UTF-16 and decode all the source material. This would be more efficient for both the browser and the Mudcat input processing to handle. Messages could still be stored as UTF-8 or escaped ASCII to reduce the storage space needed. And there would be no need to spawn a separate page for input (though that still sounds like a nice option).