The Mudcat Café TM
Thread #135056   Message #3093674
Posted By: Artful Codger
11-Feb-11 - 10:12 PM
Thread Name: Tech: Entering special characters (moderated)
Subject: RE: Tech: Entering special characters
Now, how can you tell if you've entered text properly? The short answer is, not easily. That's because what you see (given your system, locale and browser settings) isn't necessarily what other people will, so previewing will often mislead you in this regard. You just have to know which characters are valid "raw" and which aren't, or are suspect. The ASCII set I listed is safe, anything else isn't.

The input system here does automatically encode some characters for you, and if you preview your text, you'll see the resulting escapes in the input box (in the preview area, they'll show as you expect them to do). Unfortunately, the characters most likely to be converted are the ones you're least likely to use. To explain this, I have to get a bit nerdy.

The input/display system is still centered around codepages. Until recent computing times, memory was relatively limited, so early character sets were limited—specifically to a set of 256 characters, corresponding to the number of values one could represent in a byte. As you can imagine, different areas had need of different sets of characters, and there was also an increasing demand for symbols of all sorts. So a variety of codepages developed—various mappings of these 256 values to different sets of characters—and schemes developed for switching between them, so that a user could intermix characters that resided in different codepages. A user's default codepage would thus receive the lion's share of usage, with minimized forays into other codepages, managed largely under the covers.

To add to the confusion, different systems evolved their own sets of codepages; it was only late in the game that standards committees tried to impose some uniformity. To make matters worse, most text files had no provision to indicate which codepage was used to create them, so if one was transferred to a different system and opened, it might be interpreted using a different character mapping, and the result could be garbage.

That's the behavior we see here. What appears to be happening is this: Your browser typically has a default codepage setting active, appropriate to your locale, and it's used for most of the text you're most likely to peruse. But by default (i.e, when no encoding is explicitly stated for a web page, as currently on Mudcat), HTML pages are supposed to be encoded using only 7-bit ASCII (which is a subset of most other codepages, at least in Western Europe and the Americas). This comprises only the first 128 values. How, then, should the browser handle the remaining 128 values, when they are encountered? It could skip them, or report that the page is invalid, or replace them with some "I don't know" character like the query. But most browsers interpret these values according to your default codepage. The result is that if you enter a character value in the range 128-255, it is rendered differently from one user environment to another.

But you're only typing in characters, and the browser shows you those characters, and you can type in characters that are outside your codepage and they'll show up, too! So why doesn't the input system preserve the characters it knows you're entering, encoding them if necessary? Well, it kinda does and it kinda doesn't. For manipulating the characters internally, it probably converts them all to 16-bit or 32-bit Unicode. But when it returns collected input (as from the message box), it converts the text according to the specified encoding for the web page. If no encoding is specified, it returns (byte) values to the Mudcat software according to the poster's default or selected codepage—but without indicating which codepage that was. If the character isn't in the poster's codepage, one of two possibilities seems to be occurring (which have the same result): either the browser encodes the text as a character reference and returns that sequence value-by-value to Mudcat, or it hands Mudcat the Unicode value for that character, and Mudcat software encodes the value as character reference (since you can't store a 16-bit value in a single byte). Either way, life is good: the character ends up stored as a character reference, and should display properly on all systems (font problems aside).

The bad news is that characters in the 128-255 range according to the poster's effective codepage are left unencoded, and are ambiguously rendered. This means that even though a character like a right double quote has a Unicode value well beyond the problem range, it will likely be handed to Mudcat using one of the problem range values instead of being automatically encoded.

So let's consider an example: suppose a user copies some Cyrillic text off the net (the source doesn't matter) and pastes it into a Mudcat message. What will happen? We have to consider two scenarios.

If the poster is a Western European, generally using Western European settings, his codepage will not include Cyrillic characters, so they'll all be automatically encoded as escapes and will display properly to virtually all users. Not so for non-ASCII characters like the right double quote, long dashes, or copyright symbol, all of which are included in the poster's codepage. These will remain unencoded, and will be ambiguously displayed to other users.

If the poster is a Russian, his codepage might be KOI-8 or ISO-8859-5 or CP-866, all of which map Cyrillic, but to different values in the trouble range. In this case, all the text will remain unencoded, mostly with trouble range values, so that Western European viewers and even Russian viewers with a different default codepage will see garbage.

That's why it's so important to encode all non-ASCII characters as escapes, and not rely on the input system to do it for you.

There have been many suggestions about how to cure this situation, so that what you enter is what others see, and you don't have to do any special encoding (aside from the occasional ampersand and angle bracket). But each solution has drawbacks in regard to old threads, where the text has already been entered improperly and no automated fix is feasible. As I've said at the beginning of this thread, don't post your suggestions for fixing the problem here; post them to another thread (like this one).