The Mudcat Café TM
Thread #135626   Message #3095618
Posted By: GUEST,Grishka
15-Feb-11 - 08:51 AM
Thread Name: Tech: Non-ASCII character display problems
Subject: RE: Tech: Non-ASCII character display problems
Artful Codger, if we want to have a chance for our considerations to become reality, we should present them as complete concepts, as I did with "Grishka1" and "Grishka2". You may start with "AC1", which I understand to be: Keep the thread page as it is, but let it use UTF-16. This will indeed display most western messages correctly, and allow posting in Unicode. The major drawback is: legacy posts using other codepages cannot be viewed by codepage switching in the browser. Your "selector", asking the server script to do it ("emulation"), would be quite extravagant and so slow that experimenting would be discouraged. Also, the size of most HTML files would be nearly doubled.

Posting from UTF-16 pages produces exactly the same byte sequences as from UTF-8 pages, because the HTTP protocol enforces UTF-8 for the transport. I don't know yet how exactly the string is presented to the CFM script (by the ColdFusion software handling input on the server), but I am sure it will be easy to extract the information we need. If asked, I'll find out how to.

Note that my above concepts do not use UTF-8 to display any postings.

Storing message bodies on the server database is not a big issue. I understand it to be current policy to encourage "ampersand escapes" anyway, so I think the script should produce and store these, simulating an AC compliant user. A Unicode solution, possibly supported by the CF system, can be discussed as an alternative. How to facilitate full-text searches is another topic – Google does it alright for my taste.

As for thread titles, Mudcat can continue to restrict them to characters up to code number 255 and to store them as single byte strings, meaning ISO 8859-1. Since they will arrive at the server in Unicode, the script can effectively test them and refuse thread creation if necessary, issuing an error message. Another option is to accept any title, store it as UTF-8, but transform it to "ampersand" when writing an HTML page (if the thread was created after time X). Cyrillic titles would then be displayed correctly but would have to be somewhat short (assuming the database field is addressed and sized in terms of bytes. Well, Вещая takes exactly as much UTF-8 space as Vyeshchaya).

Converting existing messages in the database: We agree that this needs human interference and therefore time. It should be regarded as a separate project, the only interference with the current one being that it will hopefully one day become a matter of the past. Joe has indicated such a conversion to be desirable; it would of course be much easier if he were helped by other users, supported by the script. AC's marker to recognize problematic messages would be quite useful; if however any change of database design is taboo, a mark at the beginning of the Body (like a "BOM") can serve as a makeshift. As soon as the whole database is converted (if ever), UTF-8 can be used for everything.

I am still waiting for an official signal, or questions. If Q (14 Feb 11 - 09:55 PM) sums up the general opinion, I can use my time very well otherwise.