The Mudcat Café TM
Thread #135626   Message #3093659
Posted By: Artful Codger
11-Feb-11 - 09:34 PM
Thread Name: Tech: Non-ASCII character display problems
Subject: Tech: Non-ASCII character display problems
This thread is intended to discuss and propose solutions for a common problem here at Mudcat: Why doesn't the text I post appear to all other users the way I posted it? This thread is aimed at techies, so geek-speak is encouraged—the more precise we can be, the better.

This problem has been discussed many times before, mostly as off-topic digressions in other threads. (Feel welcome to post links to noteworthy threads.) I'm hoping this thread will consolidate the discussions and proposals, and hopefully this will help Max to implement a robust solution that will play nicely with all the old, improperly encoded pages as well.

In the meantime, I strongly urge posters to use HTML character references for all non-ASCII characters (and literal &, < and > characters)—see the guide "Entering special characters." Scripts and online converters ease the burden somewhat.


To get things (re)started, let me summarize what the problem (based on my experimentation) appears to be:

Mudcat currently defaults the file encoding on its web pages. Consequently, the interpretation of any characters outside the (7-bit) ASCII range is left to the browser, which normally applies the poster's or viewer's default or currently selected codepage setting (typically the same as for their OS and locale).

Internally, the browser probably converts all text to Unicode—you can post virtually any text to the message entry box, and it will show up properly there, regardless of the encoding used for your source. In short, the source encoding is immaterial (unless it was opened using the wrong file encoding, so that it displayed garbage characters to begin with).

But the poster's effective codepage is very material. When Mudcat retrieves the entered text (as a stream of values), the values returned for each character depend upon the poster's effective codepage. If the character is mapped within that codepage (regardless of its Unicode mapping), the codepage's value is returned. If it is outside, one of two things probably happens, both of which have the same result: either the browser hands Mudcat a stream of values that translate to an HTML character reference (on the presumption that the receiver can only handle 8-bit characters), or the browser hands Mudcat the Unicode value for that character, and Mudcat converts it into a character reference. Either way, the right thing is done.

The result is that characters in the ASCII range (a subset of most codepages) and characters outside the poster's codepage mapping are handled properly, but characters within the high-bit half of the poster's codepage mapping are left unencoded, and currently Mudcat has no clue what encoding was used to enter those characters. Nor does another user's browser; it can only interpret the high-bit characters according to the viewer's default/selected codepage. And that's where the inconsistencies occur. Without knowing the codepage, these characters can't be converted (by value) to character references because the codepage value may not match the Unicode value.

It may be that Mudcat can somehow enquire (when the text is initially retrieved) what the poster's effective codepage is. The browser surely knows. If so, a solution at the time of input is simple: get the poster's encoding and use it to convert the value stream to Unicode; then convert all values beyond x7F to character references.

Another possibility would be to explicitly declare a page encoding of (7-bit) ASCII. Then, hopefully, all characters beyond the ASCII range would be handed to Mudcat as character references. However, I don't know how this would affect the display of high-bit chars in existing pages. In any case, it's time to get modern and move to native Unicode.

Defining the web page file encoding to something like UTF-8 may also solve the problem with no inquire/re-encoding required at all, but it would only work for newly created pages. Grishka suggested separating message entry off into a new page which could have a UTF-8 encoding specified. In new pages (having an explicit UTF-8 coding declared), the text could be used directly (in fact, there would be no need to separate off the message entry). To add to old pages (no encoding declared), all non-ASCII characters could be converted to character references first. There are some downsides to this approach, but it's viable.

If nothing else is done, it would at least be useful to flag (during preview) which high-bit characters are unencoded. When preparing preview text, Mudcat could test the values of the input characters, and if they fell in the high-bit range, it could apply special formatting to them (underlining, pastel background, foreground color, blink...). It could even refuse to accept posts until the high-bit chars were resolved or a suitable input encoding was selected. Similarly, Submit could scan the text and redirect to a preview page if the page contained high-bit characters, so that they could be corrected.

So there's the topic; discuss!