The Mudcat Café TM
Thread #135626   Message #3095155
Posted By: GUEST,Grishka
14-Feb-11 - 01:23 PM
Thread Name: Tech: Non-ASCII character display problems
Subject: RE: Tech: Non-ASCII character display problems
Since we do not seem to have any real "geeks" here (thank goodness), I have done some further research. Here are the results as I understood them, and the resulting arguments, which support my above concept. Please take notice of this, even if you have read most of it before, so that the discussion can move on.
  1. What happens when posting a message to the server?

    By the HTTP protocol, the browser will send to the server a sequence of 7-bit characters, in which the percent sign serves as an escape character. Its primary function is to reserve some characters for the separation of argument names (in our case: Subject, UserName, MemberID, Body etc.) and their "values". The "value" of the "Body" parameter is the actual message.

    But the percent sign is also used to encode arbitrary Unicode characters, similar to the & in HTML, by transforming them to UTF-8 and applying the % escape to every resulting byte (i.e. %xx with xx being the hex representation).

    If a page containing an entry area is declared as using a Unicode type encoding such as UTF-8 (either by containing that "meta" instruction or by the user's choice resp. default), any message entered there will be sent by the browser in this format, without any further transformation.

    If however a conventional one-byte encoding is assumed in the sending page, browsers will try to produce a compatible posting (since they assume the reciever to be Unicode-deaf). That is, if the user enters a character of Unicode value above 127, the browser looks it up in the assumed codepage, and if successful, transports the resulting single byte using the % escape. (If not, the result is unpredictable and will not allow backward inference. The Internet Explorer takes the liberty to use some other Windows codepage it assumes competent for that character, and if it doesn't know any, it actually posts HTML escapes!!) Obviously, the server script can only make real sense of this mechanism, legible in various contexts, if it expects a particular encoding and the client uses exactly this one, or one that does not differ from it for the characters sent.

    Consequences for Mudcat: If posters (such as our Gaelic friends) cannot be trusted to use HTML escapes consistently, any page containing an entry area should specify some encoding in its meta tags. ISO 8859-1 would do for western languages, but if posters in other languages are welcome as well (and cannot be trusted either), Unicode is required, which means: declare the submitting page UTF-8.


  2. What happens on the server side?

    If we can go by the name extension, the script ThreadNewMess-Sub.cfm has been written in the Cold Fusion language, which (like all such languages) decodes the received sequence to a structure of variables. Our message, for example, will be the structure member named "Body", probably a Unicode string, less likely a sequence of bytes (0 to 255). Anyway, it will contain the complete information, if the message was sent from a page of encoding UTF-8. I tested and proved this by manually changing the encoding to UTF-8 and then entering Chinese characters – the preview (as processed by the script) reproduced them exactly, byte by byte, not escaped.

    Consequences for Mudcat: If the script can rely on UTF-8 being used for posts, it can HTML-escape any character above 127 easily. (BTW, Artful tells us about transformations to HTML escapes already being done; I suspect these have been effected by the browser while posting, see above.)


  3. What is being sent by Mudcat to the reader?

    HTML escapes will display the same regardless of encoding. Conversions from UTF-8 to HTML and vice versa are absolutely safe.

    In contrast, conversions from conventional single-byte encodings to HTML or Unicode are successful only if the codepage is recognized correctly, otherwise the information is ruined beyond repair. In particular, the reader will no longer be able to recover the message by changing the assumed encoding.

    Consequences for Mudcat: To convert old posts requires human interference, or a very clever algorithm indeed. Practically, this means we have to live with those single-byte messages for quite a while. Now if an ordinary thread page were declared UTF-8, all heritage diacritics would inevitably become illegible, forcing the user to change the assumed encoding (and then perhaps change it back lateron to read the new genuine UTF-8 messages).

    This is considered unfriendly (Never be rude to an Irishman ...). Since encoding seems to be a property of a whole HTML file, not allowing changes within it, the consequence is to use two HTML files separating reading (single-byte) from writing (UTF-8).

    Grishka1 and Grishka2 are two methods of achieving this, see above.

    It is a matter of taste whether the reading pages should specify an encoding (single-byte such as ISO 8859-1), or leave it to the reader's browser default as it does now. If we assume that non-English messages are most likely to be read by people of corresponding background, the latter option is to be preferred.

I would very much welcome a statement from Joe, Max, or others in charge, as to what they consider important and desirable in this context, and which constraints of whatever nature we should take into account. Is there a chance of any reform at all, or are we wasting our time and effort? Are we going up a blind alley?

Of course we should not only discuss technology, but also the practical consequences affecting all of us.