mudcat.org: Tech: Non-ASCII character display problems

sj

Post to this Thread - Sort Descending - Printer Friendly - Home

Tech: Non-ASCII character display problems

Related threads:
Mudcat HTML Guide PermaThread (64)
tech: HTML Ampersand Codes (33)
Tech: Entering special characters (moderated) (18)
Tech: HtmlEsc.java: Convert special chars (6)
Tech: CopyUnicode: Create any char (17)
Tech - ALTKEY Codes on Laptop (28)
HTML Stuff II (126)
Tech: htmlesc.py: Mac script to escape text (12)
HTML Tables (19)
Clickable Links (14)
HTML Beginners Study Guide (3)

Artful Codger	11 Feb 11 - 09:34 PM
Artful Codger	11 Feb 11 - 10:04 PM
GUEST,Grishka	13 Feb 11 - 11:24 AM
Taconicus	13 Feb 11 - 11:38 AM
The Fooles Troupe	13 Feb 11 - 02:46 PM
GUEST,Grishka	13 Feb 11 - 03:55 PM
GUEST,Grishka	13 Feb 11 - 04:15 PM
Artful Codger	13 Feb 11 - 05:57 PM
Artful Codger	13 Feb 11 - 06:01 PM
GUEST,Grishka	14 Feb 11 - 01:23 PM
Artful Codger	14 Feb 11 - 07:38 PM
Q (Frank Staplin)	14 Feb 11 - 09:55 PM
GUEST,Grishka	15 Feb 11 - 08:51 AM
GUEST,Grishka	15 Feb 11 - 09:35 AM
GUEST,Grishka	15 Feb 11 - 09:57 AM
JohnInKansas	15 Feb 11 - 05:30 PM
Artful Codger	15 Feb 11 - 06:51 PM
GUEST,Grishka	16 Feb 11 - 04:09 AM
The Fooles Troupe	16 Feb 11 - 07:03 AM
JohnInKansas	16 Feb 11 - 04:15 PM
JohnInKansas	16 Feb 11 - 05:05 PM
GUEST,Grishka	18 Feb 11 - 04:38 AM
Joe Offer	18 Feb 11 - 05:06 AM
GUEST,Grishka	18 Feb 11 - 06:11 AM
GUEST,Grishka	18 Feb 11 - 12:25 PM
Jeri	18 Feb 11 - 12:56 PM
Bill D	18 Feb 11 - 01:24 PM
GUEST,Grishka	18 Feb 11 - 04:46 PM
Joe Offer	19 Feb 11 - 12:52 AM
GUEST,Grishka	19 Feb 11 - 02:52 PM
JohnInKansas	19 Feb 11 - 09:15 PM
GUEST,Grishka	21 Feb 11 - 03:38 AM
GUEST,Grishka	21 Feb 11 - 03:41 AM
GUEST,Grishka	23 Feb 11 - 05:22 AM
Joe Offer	23 Feb 11 - 05:35 AM
GUEST	23 Feb 11 - 05:40 AM
GUEST,.gargoyle	23 Feb 11 - 05:43 AM
Joe Offer	23 Feb 11 - 05:46 AM
JohnInKansas	23 Feb 11 - 07:10 AM
GUEST,.gargoyle	23 Feb 11 - 07:11 AM
JohnInKansas	23 Feb 11 - 07:51 AM
JohnInKansas	23 Feb 11 - 08:44 AM
Bill D	23 Feb 11 - 10:47 AM
GUEST,Grishka	23 Feb 11 - 11:28 AM
Joe Offer	23 Feb 11 - 04:08 PM
Jack Campin	04 Nov 16 - 09:22 PM
Jack Campin	06 Nov 16 - 09:38 PM

Share Thread

Lyrics & Knowledge Search [Advanced]
DT Forum Child
Sort (Forum) by:relevance date

DT Lyrics:

Subject: Tech: Non-ASCII character display problems
From: Artful Codger
Date: 11 Feb 11 - 09:34 PM

This thread is intended to discuss and propose solutions for a common problem here at Mudcat: Why doesn't the text I post appear to all other users the way I posted it? This thread is aimed at techies, so geek-speak is encouraged—the more precise we can be, the better.

This problem has been discussed many times before, mostly as off-topic digressions in other threads. (Feel welcome to post links to noteworthy threads.) I'm hoping this thread will consolidate the discussions and proposals, and hopefully this will help Max to implement a robust solution that will play nicely with all the old, improperly encoded pages as well.

In the meantime, I strongly urge posters to use HTML character references for all non-ASCII characters (and literal &, < and > characters)—see the guide "Entering special characters." Scripts and online converters ease the burden somewhat.

To get things (re)started, let me summarize what the problem (based on my experimentation) appears to be:

Mudcat currently defaults the file encoding on its web pages. Consequently, the interpretation of any characters outside the (7-bit) ASCII range is left to the browser, which normally applies the poster's or viewer's default or currently selected codepage setting (typically the same as for their OS and locale).

Internally, the browser probably converts all text to Unicode—you can post virtually any text to the message entry box, and it will show up properly there, regardless of the encoding used for your source. In short, the source encoding is immaterial (unless it was opened using the wrong file encoding, so that it displayed garbage characters to begin with).

But the poster's effective codepage is very material. When Mudcat retrieves the entered text (as a stream of values), the values returned for each character depend upon the poster's effective codepage. If the character is mapped within that codepage (regardless of its Unicode mapping), the codepage's value is returned. If it is outside, one of two things probably happens, both of which have the same result: either the browser hands Mudcat a stream of values that translate to an HTML character reference (on the presumption that the receiver can only handle 8-bit characters), or the browser hands Mudcat the Unicode value for that character, and Mudcat converts it into a character reference. Either way, the right thing is done.

The result is that characters in the ASCII range (a subset of most codepages) and characters outside the poster's codepage mapping are handled properly, but characters within the high-bit half of the poster's codepage mapping are left unencoded, and currently Mudcat has no clue what encoding was used to enter those characters. Nor does another user's browser; it can only interpret the high-bit characters according to the viewer's default/selected codepage. And that's where the inconsistencies occur. Without knowing the codepage, these characters can't be converted (by value) to character references because the codepage value may not match the Unicode value.

It may be that Mudcat can somehow enquire (when the text is initially retrieved) what the poster's effective codepage is. The browser surely knows. If so, a solution at the time of input is simple: get the poster's encoding and use it to convert the value stream to Unicode; then convert all values beyond x7F to character references.

Another possibility would be to explicitly declare a page encoding of (7-bit) ASCII. Then, hopefully, all characters beyond the ASCII range would be handed to Mudcat as character references. However, I don't know how this would affect the display of high-bit chars in existing pages. In any case, it's time to get modern and move to native Unicode.

Defining the web page file encoding to something like UTF-8 may also solve the problem with no inquire/re-encoding required at all, but it would only work for newly created pages. Grishka suggested separating message entry off into a new page which could have a UTF-8 encoding specified. In new pages (having an explicit UTF-8 coding declared), the text could be used directly (in fact, there would be no need to separate off the message entry). To add to old pages (no encoding declared), all non-ASCII characters could be converted to character references first. There are some downsides to this approach, but it's viable.

If nothing else is done, it would at least be useful to flag (during preview) which high-bit characters are unencoded. When preparing preview text, Mudcat could test the values of the input characters, and if they fell in the high-bit range, it could apply special formatting to them (underlining, pastel background, foreground color, blink...). It could even refuse to accept posts until the high-bit chars were resolved or a suitable input encoding was selected. Similarly, Submit could scan the text and redirect to a preview page if the page contained high-bit characters, so that they could be corrected.

So there's the topic; discuss!

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Artful Codger
Date: 11 Feb 11 - 10:04 PM

There is a related problem: that of font coverage for the Unicode character set. No font contains all the defined Unicode characters; each font covers only some parts of the domain, with frequent holes within the covered parts. So if you specify a character, even though it may be visible to you, it may not be visible to another use who is using a similar, though not identical, font, with holes where you have visible characters.

The good news is that, for the European languages at least, coverage is fairly complete for virtually all the characters you're likely to encounter. Also, browsers are savvy enough that if they can't find a character in the default or specified font, they will find the closest font they know of which does have the character. So you're pretty sure of getting the character displayed, if it's available on your system.

But people are finicky: they like things to look uniform, so it's a bit jarring if they're reading Times New Roman text and suddenly there's a jump to Ariel just to display an Irish long-r. Better if the entire text (or a certain span) were displayed in Ariel to begin with. (Though I personally dislike reading sans-serif text.) In many cases, users can address this by bracketing their posts with style or font directives—if they know how. But they're still guessing that (1) that font will exist on all or most other users' systems and that (2) all the needed characters will be defined in their font implementation.

Font issues also tie into the high-bit character problem. For old, improperly encoded posts, some have suggested fixing the display problems by bracketing the text with encoding or font specs, or changing the encoding for the whole page. I don't believe encoding can be changed on the fly—an encoding spec applies to the entire web page file (though you might be able to specify a different encoding for an included file). And hopefully fonts which cater to specific codepages rather than to Unicode are becoming historical relics. If you know the encoding, it's much better to re-encode the text than to dance around the problems it causes.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 13 Feb 11 - 11:24 AM

Here is my first proposal, let's call it Grishka1 for reference. It solves the problem completely for all future posts, to new or old threads, with minimal programming effort. The display of old messages is unchanged.

The following changes are being proposed:

The thread pages are unchanged, except that the whole entry area is replaced by a simple button "Answer", which technically works like a link with "target _blank", e.g.
<A HREF="answer.cfm?Thread_ID=135626" target="_blank"> <IMG SRC="/graphics/!answerButton.gif" WIDTH="50" HEIGHT="20" BORDER="0" ALT="Answer"></A>

This means that, on pressing the button, users will see the page with the entry area in a new tab (page) of their browsers. To reread the thread or quote from it, they can always change to the former tab, and even reload it to view any contribution posted in the meantime. This I consider an advantage, not a trade-off.

The page shall look and work exactly like the existing preview page – initially empty, of course. The important difference is that its invisible HTML header contains a line
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
which makes it work like an entryfield in Google, Wikipedia, and almost the rest of the world. In particular, it will accept any character directly. Escaping is no longer necessary, except for &, <, and perhaps >. If someone likes to enter escapes, e.g. because her/his keyboard does not yield the desired character (e.g. € for €), this will still work.

On the Mudcat server, the script will now obtain all characters in genuine unicode, exactly as they were entered. It will routinely transform <em>all</em> characters above 255 to "ampersand escapes", so that future readers will be able to read them regardless of codepage, even when appearing in an old thread containing original sins.

Please discuss.

"Grishka1b": The script distinguishes between old and new threads, the latter being completely in UTF-8 to save space and bandwidth.

"Grishka2" may be a frameset (the top frame being the thread, the bottom frame containing the entry/preview area as a UTF-8 page), but I am not yet sure whether it will work and be desirable.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Taconicus
Date: 13 Feb 11 - 11:38 AM

ASCII not, read not.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: The Fooles Troupe
Date: 13 Feb 11 - 02:46 PM

"It will routinely transform all characters above 255"

No comment needed for those who understand and see the funny side ... :-)

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 13 Feb 11 - 03:55 PM

Foolestroupe, always look at the bright side (broit soid) of "geeks", good-on-ya-mite. Those who know me will smirk at my figuring here as a "tech geek" anyway.

I wish we got rid of this problem quickly to proceed to even better fun and problems more worthy of our grey cells.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 13 Feb 11 - 04:15 PM

Of course I meant 127 instead of 255 (still quite an age), and I accidentally escaped the <em>. Well, I think you know what I meant. Now please comment on the topic, preferably those in charge.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Artful Codger
Date: 13 Feb 11 - 05:57 PM

Ahem, Mudcat should escape all chars above 127, i.e. all chars not in the ASCII (7-bit) set. (If you're going to quibble, quibble about something of import.)

Note that with a UTF-8 file encoding specified for the web page, Mudcat would likely receive the input as UTF-8 multi-byte sequences rather than as straight Unicode values or character references (as I suspect happens now), so for pages lacking a UTF-8 header (like all the pages we have as of this writing), Mudcat would first have to unencode to 16-bit Unicode codepoints before generating character references. Scripting languages provide conversion filters that can easily be attached to streams to handle such encoding conversions (since 16-bit Unicode chars have become the internal lingua franca for most), so it's not a big task to implement; it just needs to be done. Of course, no conversion would be necessary to store new messages as UTF-8.

I'm assuming that Mudcat doesn't store threads as actual web pages, but rather constructs them on the fly from DB information. So maybe we should be talking about adding encoding attributes to the messages themselves. As I see it, you'd only need three attribute values: unknown 8-bit (ISO-8859-1 assumed, but iffy), UTF-8 and ASCII (wholly 7-bit, with or without character escapes). A sweep through the DB could type the existing messages either as ASCII or unknown (if they had any 8-bit chars), and a revised input system would ensure that new messages were always properly encoded either as ASCII or UTF-8. Mudcat might still have to create pages without a page encoding specified (if a thread contained a message with unknown encoding) so that viewers and moderators could change the view encoding—in this case, it would have to convert high-bit and multi-char UTF-8 stuff to references on the fly—and resolve the current input problems somehow.

Alternatively, since questionable messages would already be tagged, they could be displayed in a UTF-8 encoded page with ISO-8895-1 encoding emulated; a selector would let users select another emulation encoding--moderators and possibly the original posters (if registered) would be able to apply the selected encoding permanently (as a separate operation), converting the post to UTF-8 or ASCII-with-escapes, obviating the need to translate the message in the future. (The operation would not be reversible unless, instead of converting to UTF-8/ASCII, the selected encoding was stored as the encoding type. This might be allowed for a provisional period of time until a later sweep made permanent conversions.)

Even better, make the web pages UTF-16 and decode all the source material. This would be more efficient for both the browser and the Mudcat input processing to handle. Messages could still be stored as UTF-8 or escaped ASCII to reduce the storage space needed. And there would be no need to spawn a separate page for input (though that still sounds like a nice option).

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Artful Codger
Date: 13 Feb 11 - 06:01 PM

„txt ‚txt

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 14 Feb 11 - 01:23 PM

Since we do not seem to have any real "geeks" here (thank goodness), I have done some further research. Here are the results as I understood them, and the resulting arguments, which support my above concept. Please take notice of this, even if you have read most of it before, so that the discussion can move on.

What happens when posting a message to the server?

By the HTTP protocol, the browser will send to the server a sequence of 7-bit characters, in which the percent sign serves as an escape character. Its primary function is to reserve some characters for the separation of argument names (in our case: Subject, UserName, MemberID, Body etc.) and their "values". The "value" of the "Body" parameter is the actual message.

But the percent sign is also used to encode arbitrary Unicode characters, similar to the & in HTML, by transforming them to UTF-8 and applying the % escape to every resulting byte (i.e. %xx with xx being the hex representation).

If a page containing an entry area is declared as using a Unicode type encoding such as UTF-8 (either by containing that "meta" instruction or by the user's choice resp. default), any message entered there will be sent by the browser in this format, without any further transformation.

If however a conventional one-byte encoding is assumed in the sending page, browsers will try to produce a compatible posting (since they assume the reciever to be Unicode-deaf). That is, if the user enters a character of Unicode value above 127, the browser looks it up in the assumed codepage, and if successful, transports the resulting single byte using the % escape. (If not, the result is unpredictable and will not allow backward inference. The Internet Explorer takes the liberty to use some other Windows codepage it assumes competent for that character, and if it doesn't know any, it actually posts HTML escapes!!) Obviously, the server script can only make real sense of this mechanism, legible in various contexts, if it expects a particular encoding and the client uses exactly this one, or one that does not differ from it for the characters sent.

Consequences for Mudcat: If posters (such as our Gaelic friends) cannot be trusted to use HTML escapes consistently, any page containing an entry area should specify some encoding in its meta tags. ISO 8859-1 would do for western languages, but if posters in other languages are welcome as well (and cannot be trusted either), Unicode is required, which means: declare the submitting page UTF-8.

What happens on the server side?

If we can go by the name extension, the script ThreadNewMess-Sub.cfm has been written in the Cold Fusion language, which (like all such languages) decodes the received sequence to a structure of variables. Our message, for example, will be the structure member named "Body", probably a Unicode string, less likely a sequence of bytes (0 to 255). Anyway, it will contain the complete information, if the message was sent from a page of encoding UTF-8. I tested and proved this by manually changing the encoding to UTF-8 and then entering Chinese characters – the preview (as processed by the script) reproduced them exactly, byte by byte, not escaped.

Consequences for Mudcat: If the script can rely on UTF-8 being used for posts, it can HTML-escape any character above 127 easily. (BTW, Artful tells us about transformations to HTML escapes already being done; I suspect these have been effected by the browser while posting, see above.)

What is being sent by Mudcat to the reader?

HTML escapes will display the same regardless of encoding. Conversions from UTF-8 to HTML and vice versa are absolutely safe.

In contrast, conversions from conventional single-byte encodings to HTML or Unicode are successful only if the codepage is recognized correctly, otherwise the information is ruined beyond repair. In particular, the reader will no longer be able to recover the message by changing the assumed encoding.

Consequences for Mudcat: To convert old posts requires human interference, or a very clever algorithm indeed. Practically, this means we have to live with those single-byte messages for quite a while. Now if an ordinary thread page were declared UTF-8, all heritage diacritics would inevitably become illegible, forcing the user to change the assumed encoding (and then perhaps change it back lateron to read the new genuine UTF-8 messages).

This is considered unfriendly (Never be rude to an Irishman ...). Since encoding seems to be a property of a whole HTML file, not allowing changes within it, the consequence is to use two HTML files separating reading (single-byte) from writing (UTF-8).

Grishka1 and Grishka2 are two methods of achieving this, see above.

It is a matter of taste whether the reading pages should specify an encoding (single-byte such as ISO 8859-1), or leave it to the reader's browser default as it does now. If we assume that non-English messages are most likely to be read by people of corresponding background, the latter option is to be preferred.

I would very much welcome a statement from Joe, Max, or others in charge, as to what they consider important and desirable in this context, and which constraints of whatever nature we should take into account. Is there a chance of any reform at all, or are we wasting our time and effort? Are we going up a blind alley?

Of course we should not only discuss technology, but also the practical consequences affecting all of us.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Artful Codger
Date: 14 Feb 11 - 07:38 PM

All of this holds true with a UTF-16 encoding as well, so the superiority of using UTF-8 for the web page itself has still not been demonstrated. In fact, it may still require an extra step to convert the UTF-8 input to UTF-16 which the Cold Fusion scripts can handle directly. This is an unnecessary complication.

If input remains handled on the thread display page, the argument for UTF-8 is both stronger and weaker. It would allow legacy messages to be included unconverted (the character size is still 8-bit), even though high-bit characters would largely get blitzed. (Note that arbitrary sequences introduced by high-bit characters violate the encoding scheme--not every character combination is valid--so you're just trusting that browsers will behave benignly.) It would probably still allow users to switch view encoding to see what was originally meant. Sadly, the number of posts with wonky displays would skyrocket unless emulation was used along the lines i suggested. In that case, emulation would also be necessary to view the message with a diffent source encoding applied, and then using UTF-8 rather than UTF-16 is a distinct liability (two conversions required instead of one--one must go through UTF-16 to produce either the equivalent UTF-8 or character references).

Storage is a separate issue entirely. If (for new posts) everything is converted to character references, it conforms to byte-oriented storage, as is probably used now. But you're much more likely to exceed some length boundaries, particularly for title information. Storing raw UTF-8 would reduce this in most cases, but then you have to know that you stored UTF-8 rather than some legacy 8-bit encoding, and I'm not sure how this would impact DB searching. Even character references have an impact on searches: does the DB understand them? If not, how can it normalize the different forms they may take (mnemonic or numeric? zero-padded or not?) for comparison? That problem most likely exists at present, but the relative disuse of escapes in comparison to ISO-Latin-1 masks it. In any case, most people search only by ASCII words, where char refs are a non-issue. And if users no longer had to encode text, most char refs encountered in new posts would match, being produced by the system.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Q (Frank Staplin)
Date: 14 Feb 11 - 09:55 PM

ḐŲĤ

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 15 Feb 11 - 08:51 AM

Artful Codger, if we want to have a chance for our considerations to become reality, we should present them as complete concepts, as I did with "Grishka1" and "Grishka2". You may start with "AC1", which I understand to be: Keep the thread page as it is, but let it use UTF-16. This will indeed display most western messages correctly, and allow posting in Unicode. The major drawback is: legacy posts using other codepages cannot be viewed by codepage switching in the browser. Your "selector", asking the server script to do it ("emulation"), would be quite extravagant and so slow that experimenting would be discouraged. Also, the size of most HTML files would be nearly doubled.

Posting from UTF-16 pages produces exactly the same byte sequences as from UTF-8 pages, because the HTTP protocol enforces UTF-8 for the transport. I don't know yet how exactly the string is presented to the CFM script (by the ColdFusion software handling input on the server), but I am sure it will be easy to extract the information we need. If asked, I'll find out how to.

Note that my above concepts do not use UTF-8 to display any postings.

Storing message bodies on the server database is not a big issue. I understand it to be current policy to encourage "ampersand escapes" anyway, so I think the script should produce and store these, simulating an AC compliant user. A Unicode solution, possibly supported by the CF system, can be discussed as an alternative. How to facilitate full-text searches is another topic – Google does it alright for my taste.

As for thread titles, Mudcat can continue to restrict them to characters up to code number 255 and to store them as single byte strings, meaning ISO 8859-1. Since they will arrive at the server in Unicode, the script can effectively test them and refuse thread creation if necessary, issuing an error message. Another option is to accept any title, store it as UTF-8, but transform it to "ampersand" when writing an HTML page (if the thread was created after time X). Cyrillic titles would then be displayed correctly but would have to be somewhat short (assuming the database field is addressed and sized in terms of bytes. Well, Вещая takes exactly as much UTF-8 space as Vyeshchaya).

Converting existing messages in the database: We agree that this needs human interference and therefore time. It should be regarded as a separate project, the only interference with the current one being that it will hopefully one day become a matter of the past. Joe has indicated such a conversion to be desirable; it would of course be much easier if he were helped by other users, supported by the script. AC's marker to recognize problematic messages would be quite useful; if however any change of database design is taboo, a mark at the beginning of the Body (like a "BOM") can serve as a makeshift. As soon as the whole database is converted (if ever), UTF-8 can be used for everything.

I am still waiting for an official signal, or questions. If Q (14 Feb 11 - 09:55 PM) sums up the general opinion, I can use my time very well otherwise.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 15 Feb 11 - 09:35 AM

Concept Grishka3: The thread page stays as it is, just the Preview checkbox is removed, and the "Submit Message" button is labeled "Reply". Clicking on it inevitably leads to the preview page, which is completely unchanged, but declared UTF-8. The contents, if any, of the entry box are reproduced, processed as follows: If the script encounters any characters above 127, it tentatively converts them as if from ISO 8859-1 to UTF-8, both in the preview and in the corresponding entry box, and includes a big red warning ("3 characters have been converted, please check ..."). Posters using a codepage similar to ISO 8859-1 can usually ignore the warning, the others have to examine the preview for the transformed characters. If they do not want to do that, they may click "Reply" before entering anything. If they pasted their text from elswhere, they should repaste it to the new box (Ctrl-A, Ctrl-V).

Writers of plain English text and diligent escapists (users of the htmlesc software) will never see the warning.

The extra benefit is that at least one preview is compulsory (though it is still possible to enter B***S*** then).

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 15 Feb 11 - 09:57 AM

Grishka3b: Same as 3, but direct submitting is allowed. If the script encounters no character above 127, it proceeds as it does now, otherwise as in Grishka3.

My own favourite is still Grishka1.

BTW, a normal Submit button can also be made to open a new tab, as I found out two minutes ago, so AC's "option" can be considered:

<FORM ACTION="ThreadNewMess-Sub.cfm" METHOD="POST" TARGET="_blank">
  ...

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: JohnInKansas
Date: 15 Feb 11 - 05:30 PM

Without questioning the value of this research, I feel the need to ask whether anyone has actually looked at whether the problem is "big enough" to merit the rather drastic changes being proposed.

While I've occasionally seen a few broken chars when few "particular someones" paste from another website, the offense is less serious than the far more common typographical errors that we tolerate from the few fat-fingered members who don't read before submitting. (Several specific persons who frequently post broken chars do so only when copying from a few fairly specific and unique other websites. Even their posts are clean when posting from most sites. I do wonder why.)

If the problem was really significant, I'd expect to have seen complaints in the threads where char errors happen, and I don't find any such objections in any threads I've read.

An additional question is whether changes made now will be valid for the eventual (pending) standards for "web fonts." The W3C committee has "standardized" the "@font-face rule" for using CSS to designate a remote font, and the capability is reported (June 2009) as being available in all major browsers, but most such implementations violate proprietary rights of the font makers whose fonts are used, and the type foundaries are objecting. If a CSS is used, there's still a problem if the machine browsing the site doesn't have the font specified, and at present mudcat likely would be forced to "buy" the specified font in order to put it on the server.

Microsoft has provided for the use of EOT (Embedded Open Type) in which the (encrypted) font is embedded in a document. They haven't had much success with getting people to use it, although IE can, and in some cases does, use it. Other browsers probably can use EOT by now, but it's unclear how much additional burden that use places on website servers.

"All the browsers but Microsoft's, meanwhile, have embraced a technique called "naked" or "raw" font linking, which means
uploading ordinary desktop fonts onto servers." (Technology Review, June 2009) This method clearly, in most cases, violates the foundries' licensing terms, and while they've shown commendable restraint, failure to do "something different" may soon result in lawsuits like the ones over music and videos.

A fairly recent change in copyright rules means that unlike a short time ago a font can be copyrighted as software, and most of the decent ones are ©.

The "Web Open Font Format" (WOFF) is being pursued actively, particularly by a couple of "startup" companies^**, but as proposed now that would require web sites to pay annual fees to lease the use of fonts. That seems to be an unreasonable burden to foist on mudcat when the use, as now, of the fonts on users machines is legal and "already paid for" (if you're using legal copies of the programs that provided them).

^** The startup company called Small Batch is offering WOFF as "Typekit" and another startup named Kernest is offering to "broker" leases. Mozilla has signed on and typographers are circulating petitions for a standard.

Implementation of WOFF also would require significant changes in all browsers, which really means that it won't do a lot of good until our people who are still using Win98/WinME/WinXP buy new computers compatible with the newest browsers.

Clearly defining and describing the particular user practices that result in broken chars now, with the existing mudcat setup, and telling people "how to not do that" would be helpful. The ones who are doing it will ignore the request to not do it, and the rest of us will shrug and go on with our reading and posting.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Artful Codger
Date: 15 Feb 11 - 06:51 PM

It doesn't matter what format is used for transport--as long as the target encoding (affecting the web page entry area) is UTF-x, the user input will be received and handled unambiguously. The browser and OS take care of that; neither the user nor Mudcat need be concerned.

The web pages are not stored, they are constructed from constituent information. So the page encoding should be that which makes things easiest from the Mudcat scripting side. Yes, UTF-16 may mean that the pages sent (for display) are nearly twice the size (though probably not, if the transport is always UTF-8), and that may mean that UTF-8 is preferrable, although increasing the burden on the Mudcat input handling.

Escaping is only encouraged presently because failing to do so results in inconsistent display. If the encoding is changed to UTF, there is no longer any need for users to escape text (except for &, < and >).

Can the HTML tag dictate whether a new tab, rather than a new window, is opened? I greatly prefer tabs, but since many pop-ups make no sense as tabs (and may resize the viewing window(!)), I must leave my default browser setting to open windows. Managing separate windows, however is a pain, so I'm strongly opposed to forcing a new tab/window to be opened for input. It also adds to management problems when one is simultaneously researching and composing a response; I have enough tabs and windows to manage as it is!

Changing emulation could be streamlined by popping up a "change encoding" tab/window, like the one you're proposing for input. Then the entire thread would not have to be redisplayed, and the display of other messages would not be affected. Selecting a new encoding directly from the window would be much simpler and more efficient timewise (even with the round-trip to the server) than having to find the encoding using the browser's interface—most users don't even know how to do this! And the setting I proposed to "fix" the encoding (i.e., apply that encoding for other users thereafter) could be incorporated into that interface. Then, only one user has to go through the pain of finding the right encoding for a message. Since ISO-Latin-1 would be assumed as the emulation default, most legacy messages would display properly from the start (as they do now, to most users), even if they were improperly encoded. Without emulation, these messages would appear blotto in a page with UTF-8 encoding specified. Leaving the display pages with no encoding specified is to mire them in the obsolescent past. The sooner legacy messages are updated to Unicode, the better for Mudcat's future.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 16 Feb 11 - 04:09 AM

I think we should do first things first, that is stop garbled characters due to codepage discrepancies; stop (resp. minimize) burdening posters with HTML escaping. (The problem is explained in AC's moderated thread.) In order to increase our chances of a quick and waterproof solution, this goal should be achieved

with minimal effort for the programmers and testers, likely to become stable quickly,
with minimal change for users from Kansas,
without requiring a complete reworking of the database,
preferably without any changes to the database and its interpretation at all, thus allowing fallback to the old script,
without barring any future reforms.

These criteria are reflected in my above proposals. Everything else can be added later, including the conversion of old messages. Note that the officials still have to be convinced to consider any change at all.

The "target _blank" code, normally resulting in a new tab being created, is already in use by the links labeled "Printer Friendly"; try them out. (Popups, in contrast, are usually effected by JavaScript, which is not the Mudcat way.) (Neither is specifying any fonts other than "Arial", John.)

My models Grishka1 and Grishka3 can alternatively be realised without "target _blank". In this case posters of plain English will not notice any difference at all between Grishka3 and the current model. It is also very easy to offer both behaviours via two buttons – this may be the best idea.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: The Fooles Troupe
Date: 16 Feb 11 - 07:03 AM

QUOTE
Can the HTML tag dictate whether a new tab, rather than a new window, is opened? I greatly prefer tabs, but since many pop-ups make no sense as tabs (and may resize the viewing window(!)), I must leave my default browser setting to open windows. Managing separate windows, however is a pain, so I'm strongly opposed to forcing a new tab/window to be opened for input.
UNQUOTE

Interestingly enough, there are Firefox plugins that when certain of their settings are enabled, can force all 'open in new window's to 'open in new tab's - eg 'Tab Mix Plus', which is one I use on both Windows (7) and Ubuntu.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: JohnInKansas
Date: 16 Feb 11 - 04:15 PM

Foolestroupe -

Since it's the common belief that IE is the most archaic and obsolete browser around, my ability to set whether to "open in same Window," or "open in new Window," or "open in new tab" in IE7 (which even Microsoft says is 2 generations obsolete) would imply that all reasonably current browsers probably have that feature.

The USER can set which to do (but a website can "force an override").

I set to "open in same window" and either click with the mouse wheel to open in a new tab, or right click to choose which to do. No plugin required.

Some of our members have indicated they still may be using really old IE that does not support tabs. Win98 may not be able to run IE versions since tabs were available, and WinXP has no support for "optional updates" so those users may not have updated to the latest IE version (with tabs) that they could run.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: JohnInKansas
Date: 16 Feb 11 - 05:05 PM

Since WinXP, Micorosft has supplied some fonts in the form that they call by the technical name of "Big Fonts." In all of those fonts, all the ASCI/ANSI English characters, and all the principal "European Language" characters are included.

The character numbers "printed" by those fonts are the Unicode numbers.

Internal font coding in Office programs since Office 2003 has been UTF-16 for the fonts provided by Microsoft.

Anyone typing from a Windows computer with WinXP/Office 2003 or later and using one of the big fonts should not need to code any character they can type.

If a person gets a font from somewhere else, it probably is NOT a "big font," and coding of characters outside the "font range" would be necessary. A font from another source may also force "font page" use, which could cause characters to be differently encoded.

Fonts with the same names existed on earlier Windows versions, but installing WinXP/Office2003 or anything later should have upgraded them to "big."

The Microsoft Office "big fonts" are:

Arial
Arial Black
Arial Bold
Arial Narrow
Bookman Old Style
Courier New
Garamond
Impact
Tahoma
Times New Roman
Trebuchet (Central and Eastern European languages only)
Verdana®

At least since Office 2003, Office program installation has selected Tahoma as the default font, and claims have been made that it was "designed for improved web visibility" (but I've failed to find any real advantage to using it).

Nearly all surviving Windows computers should have, or can get, the Microsoft "Arial Unicode" font which contains all Unicode characters up to hex FFFD, but the chars beyond those in the "Big Fonts" will probably require kidnapping an oriental person to steal a keyboard to use them without coding.

"Currently in the Microsoft Windows operating systems, the two systems of storing text — code pages and Unicode — coexist.
However, Unicode-based systems are replacing code page–based systems.
For example,
Microsoft Windows® NT 4.0, Microsoft Windows 2000, Microsoft Windows XP,
Microsoft Office 97 and later,
Microsoft Internet Explorer 4.0 and later,
and
Microsoft SQL Server 7.0 and later all support Unicode."
Unicode Support in Office 2003

Code pages are used by Office only for (some of) the "little fonts." They might also be used for a "free font" from someplace chosen at random.

As an example, using Times New Roman (a Big Font) the "shortcut" allowing US users who don't have a "euro key" to use Alt-Numpad 0128 to enter € returns the correct Unicode Hex character number 20AC if you use the Alt-X toggle in Word to flip it back to the char value. Any character that comes "off the keyboard" should be sent by its Unicode char number.

So the broken characters that appear at mudcat are due to people using "little fonts" that still use code paging rather than Unicode, loss of the Unicode char values by the mudcat database, or people using something other than Windows programs.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 18 Feb 11 - 04:38 AM

Joe Offer, please comment.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Joe Offer
Date: 18 Feb 11 - 05:06 AM

You're losin' me, Grishka. I do pretty well technically, but this is more complicated than I have time for. What I'd like first of all, is some way to copy text with distorted characters, paste it somewhere and convert it to the normal character set, and paste the corrected text back into Mudcat.

If there's a simple change that can be added to Mudcat pages to make all text universally readable, that would also be nice.

But it has to be simple, and I'm not seeing simple so far.

-Joe-

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 18 Feb 11 - 06:11 AM

Joe, thanks for reacting; this is exactly the kind of feedback we have been missing.

What I'd like first of all ...

has been described by me in a post to AC's thread (18 Jan 11 - 06:07 PM), now deleted. I shall repost it here, if desired.

If there's a simple change that can be added to Mudcat pages to make all text universally readable, that would also be nice.

That is exactly what I am striving to provide, at least for any text posted in the future. My models Grishka1 and Grishka3 are really as simple as you can get them; anything AC is suggesting is much more complex. Please try to read my descriptions (13 Feb 11 - 11:24 AM and 15 Feb 11 - 09:35 AM); if you find them incomprehensible, just ask. Once we agree on a model, i.e. we find it desirable provided it works as I claim, we can focus on the details.

I'm not seeing simple so far.

That's the problem with threads like this one: the simplest ideas are least likely to be noticed.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 18 Feb 11 - 12:25 PM

Joe,

What I'd like first of all ...

will be your reward if you invest some time to discuss here: a little tool, operating on HTML source code directly.

Complete description of the simplest version of Grishka3b (without "target _blank"):

The only change to any of the pages is that the ones for thread creation and preview are declared UTF-8.
Thread creation is rejected if the title contains any characters above 255 (sic!). This excludes cyrillic titles, but allows Gaelic, French, German, etc., regardless of the creator's codepage.

Thread titles are stored as they have been hitherto, to facilitate searching/filtering. (When written into an HTML file, they should be "ampersanded", but this feature can be added later).

Any message from the preview page is "ampersanded" by the server script upon reception, i.e. doing exactly what the posters are currently asked to do themselves.

Any message from the thread page containing only characters up to 127 is treated exactly as it is now.

Any message from the thread page containing any characters above 127 is treated as if "preview" had been checked (since the server script does not know how to interpret those characters). The preview page is filled on the assumption that the codepage was ISO 8859-1 (i.e. expanded to Unicode), and a warning message is included.

Summary of advantages:

No change in the database and its interpretation

No change of web design

No change for posters who type English texts or use those ampersands

Almost no change for posters using western codepages

No need for the elves to convert any future post

No more need for those verbose instructions which inhibit some posters and are ignored by others

No more ranting by Artful Codger ;-)

Other reforms are unhindered

The conversion of old messages is a limited task instead of Sisyphean

More time for music and fun.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Jeri
Date: 18 Feb 11 - 12:56 PM

Joe does a lot around here, but he doesn't write any of the code. That's 100% Max.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Bill D
Date: 18 Feb 11 - 01:24 PM

An aside, but since John in Kansas mentioned it...

Since it's the common belief that IE is the most archaic and obsolete browser around, my ability to set whether to "open in same Window," or "open in new Window," or "open in new tab" in IE7 (which even Microsoft says is 2 generations obsolete) would imply that all reasonably current browsers probably have that feature.

The USER can set which to do (but a website can "force an override").

I set to "open in same window" and either click with the mouse wheel to open in a new tab, or right click to choose which to do. No plugin required.

It is true that all browsers support the ability of a website to force "open in new Window," or "open in new tab", {the 'target=blank' command}. But this is not always desirable for the user. I use the web filter Proxomitron, and I found a filter written to specifically remove & override that command. It is easy enough TO open a link in a new tab or window, but *I* prefer to have the option.

(to explore this, search on **Grypen target blank**)

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 18 Feb 11 - 04:46 PM

Jeri, I am aware of that, and of course I would very much welcome Max to join the discussion, or rather to tell us about his own ideas and considerations (as I wrote before). But he doesn't seem to be around, and Joe seems to speak for Mudcat in terms of policy. The first question, yet unanswered, is: assuming it works, would it be worth the effort for the officials, or how much effort? If the answer is sufficiently positive, we can proceed to the details – and maybe Max doesn't need any help with these.

Joe, please keep us informed about your and Max's ideas, if Max is not going to write himself. Would you like your reward (ca. 150 lines of Java code, to be processed like CopyUnicode. Usage: Copy, click on "Capture", click on codepage names until satisfied with the display window, paste back; no browser required.)?

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Joe Offer
Date: 19 Feb 11 - 12:52 AM

Grishka, how are you going to make good on that No more ranting by Artful Codger pledge?
[grin]

-Joe-

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 19 Feb 11 - 02:52 PM

Joe,

how are you going to make good on that No more ranting by Artful Codger pledge?

Full money-back-guarantee-no-questions-asked! Of course the asterisk-small-print is part of the contract.

;-) inasmuch as justified and concerning this topic.

(Actually we are playing good cop / bad cop; you must know that game.)

Is it a deal? And do you want your sales commission?

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: JohnInKansas
Date: 19 Feb 11 - 09:15 PM

re. forcing new window or new tab: One nearly universal use that I find probably acceptable is when you click to get a "printable version" nearly all sites open the pv in a new tab or new window.

A couple of "governent sites" open every click in "new," which is somewhat annoying¹. Additionally annoying is that when you download pdf documents that the gov (e.g. is SCOTUS) has thoughtfully provided they all have the same filename, and if you save a dozen without inserting a different name for each one you end up with only the last overwrite.

(Some Microsoft servers have begun to slip into the same "one name for all" habit.)

But all that "public information" is supposed to be a secret anyway, and it's only there to give the illusion that you've been informed.

¹ While the current html standard is "posted" at the W3C site, actually reading it requires you to prety much click to a new URL for each page (or paragraph) of the document and then to click on to each subparagraph of the paragraph; and I believe I saw a note that the 4.01 Spec is close to 400 pages. Three clicks away from the TOC and I'm pretty much lost as to where I am in the doc, without a whole lot of assembly and reformatting so I may never read the whole thing (again).

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 21 Feb 11 - 03:38 AM

Joe and Max,

please tell us why you deem my summary of 18 Feb 11 - 12:25 PM too "complicated". Is it so incomprehensible that you do not want about it? Or do you fear it might not do the trick, and find my arguments (further above) incomprehensible or unconvincing? Or do you fear it will be too much work compared to the benefit? Or do you plan an implementation later, provided my claims hold water? Or are even better ideas on their way?

If we let the discussion sleep now, it is bound to start again from zero in a couple of months, as it did before. Since we do not really enjoy that, it would be a great relief to have some kind of Mudcat-official preliminary result, as specific as possible.

this is more complicated than I have time for.

We want to save time for all of us, and at the same time improve the quality of Mudcat considerably.

What I'd like first of all, is some way to copy text with distorted characters, paste it somewhere and convert it to the normal character set, and paste the corrected text back into Mudcat.

For this I wrote a tool for you, finished and tested. It can even correct the text you accidentally misconverted. To get started with it you have to invest 15 minutes. It will save you many many hours, nerves, and errors. If you fear it may not work or cause damage, see that CopyUnicode thread. Silver paper with it? Goldverschnürt sogar? Just ask.

Happy P-Day across the pond! Yes, we can.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 21 Feb 11 - 03:41 AM

Sorry, the second sentence should read: "Is it so incomprehensible that you do not want to reflect about it?"

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 23 Feb 11 - 05:22 AM

I obviously failed to convince those in charge so far, and we are left puzzled once more – tant pis.

The tool being finished, I feel I should post it nevertheless, in case some adepts like Artful Codger or Jon Banjo want to study it and either recommend it to Mudcat or use it themselves when asked once more to convert something (saving the trouble of reconstructing the HTML tags manually). Feel free to adapt it to your needs and taste. Note the total absence of potentially harmful code of file manipulation, internet access, system calls etc.

Farewell geekdom, hello life!

import java.awt.*;
  import java.awt.datatransfer.*;
  import java.awt.event.*;
  import java.nio.*;
  import java.nio.charset.Charset;
  import javax.swing.*;
  import javax.swing.event.*;
  import javax.swing.text.*;
  import javax.swing.text.html.*;
  
  /**
   * @author Grishka, donated to Mudcat.org and the public domain.
   */
  public class CodepageTurner extends JFrame
  {
   // codepage used in the SOURCE OF COPYING; Notepad uses "windows-1252"
   static final String cMyEditorCharsetName = ""; // empty: assume system codepage
   Charset mMyEditorCharset = Charset.defaultCharset (); // unless ...
   Object [] mCharsets;
   Charset mCharSet = null;
  
   JScrollPane mHtmlScrollPane = new JScrollPane ();
   JTextPane mHtmlPane = new JTextPane ();
   JScrollPane mCodepagesScrollPane = new JScrollPane ();
   JList mCodepageList;
   JLabel mOutput = new JLabel ();
   static final Color mGreenColor = new Color (0, 160, 0);
   static final Color mAmberColor = new Color (150, 130, 0);
  
   byte [] mClip = new byte [0];
   String mClipOut = null;
  
   private CodepageTurner ()
   {
    setTitle ("CodepageTurner for HTML source code");
    setDefaultCloseOperation (javax.swing.WindowConstants.EXIT_ON_CLOSE);
    getContentPane ().setLayout (new GridBagLayout ());
  
    // label for output
    GridBagConstraints lConstraints = new GridBagConstraints (0, 0, 1, 1,
      1., 0., GridBagConstraints.WEST, GridBagConstraints.BOTH,
      new Insets (0, 5, 0, 5), 0, 0);
    getContentPane ().add (mOutput, lConstraints);
  
    // buttons to capture the clipboard ====================================
    JButton lCaptureButton = new JButton ("Capture");
    lCaptureButton.addActionListener (new ActionListener () {
     public void actionPerformed (ActionEvent evt) {
      Capture (false);
     }
    });
    lConstraints = new GridBagConstraints (1, 0, 1, 1,
      0., 0., GridBagConstraints.CENTER, GridBagConstraints.BOTH,
      new Insets (0, 0, 0, 0), 0, 0);
    getContentPane ().add (lCaptureButton, lConstraints);
    JButton lRevertButton = new JButton ("Revert");
    lRevertButton.setToolTipText ("Capture text, reverting erroneous byte-escaping");
    lRevertButton.addActionListener (new ActionListener () {
     public void actionPerformed (ActionEvent evt) {
      Capture (true);
     }
    });
    lConstraints = new GridBagConstraints (2, 0, 1, 1,
      0., 0., GridBagConstraints.CENTER, GridBagConstraints.BOTH,
      new Insets (0, 0, 0, 0), 0, 0);
    getContentPane ().add (lRevertButton, lConstraints);
  
    // split pane
    JSplitPane lSplitPane = new JSplitPane ();
    lSplitPane.setDividerLocation (450);
    lSplitPane.setOrientation (JSplitPane.HORIZONTAL_SPLIT);
    lConstraints = new GridBagConstraints (0, 1, 3, 1,
      1., 1., GridBagConstraints.CENTER, GridBagConstraints.BOTH,
      new Insets (0, 0, 0, 0), 0, 0);
    getContentPane ().add (lSplitPane, lConstraints);
  
    // area to display the (approximate) effect in HTML ====================
    mHtmlScrollPane.setPreferredSize (new Dimension (650, 600));
    lSplitPane.setLeftComponent (mHtmlScrollPane);
    mHtmlPane.setContentType ("text/html");
    mHtmlPane.setEditable (false);
    PreventTheHtmlPaneFromFiring ();
    mHtmlScrollPane.setViewportView (mHtmlPane);
  
    // List of all codepages ("charsets") ==================================
    mCodepagesScrollPane.setPreferredSize (new Dimension (350, 600));
    lSplitPane.setRightComponent (mCodepagesScrollPane);
    mCharsets = Charset.availableCharsets ().values ().toArray ();
    mCodepageList = new JList (mCharsets);
    mCodepageList.setLayoutOrientation (JList.VERTICAL_WRAP);
    mCodepageList.addListSelectionListener (new ListSelectionListener () {
     public void valueChanged (ListSelectionEvent aEvent)
     {
      Charset lCharSet = (Charset) mCharsets [mCodepageList.getSelectedIndex ()];
      if (lCharSet != mCharSet)
      {
       mCharSet = lCharSet;
       Process (mHtmlPane.viewToModel (mHtmlPane.getVisibleRect ().getLocation ()));
      }
     }
    });
    mCodepageList.addComponentListener (new ComponentAdapter () {
     @Override
     public void componentResized (ComponentEvent aEvt)
     {
      LayoutTheList ();
     }
    });
    LayoutTheList ();
    pack ();
    if (cMyEditorCharsetName.length () > 0)
    {
     if (Charset.isSupported (cMyEditorCharsetName))
      mMyEditorCharset = Charset.forName (cMyEditorCharsetName);
     else
      Alarm ("Codepage " + cMyEditorCharsetName
      + " not found; we assume " + mMyEditorCharset.name () + " instead.");
    }
    mCharSet = mMyEditorCharset;
   }
   private void PreventTheHtmlPaneFromFiring ()
   {
    mHtmlPane.setEditorKit (new HTMLEditorKit () {
     @Override
     public ViewFactory getViewFactory () {
      return new HTMLEditorKit.HTMLFactory () {
       @Override
       public View create (Element elem) {
        if (elem.getAttributes ().getAttribute (
          StyleConstants.NameAttribute) == HTML.Tag.INPUT)
         return new FormView (elem) {
          @Override
          protected void submitData (String aStr) { Msg (aStr); }
          @Override
          protected void imageSubmit (String aStr) { Msg (aStr); }
          private void Msg (String aStr)
          {
           JOptionPane.showMessageDialog (getContentPane (), aStr);
          }
         };
        return super.create (elem);
       }
      };
     }
    });
    mHtmlPane.addHyperlinkListener (new HyperlinkListener () {
     public void hyperlinkUpdate (HyperlinkEvent aEvt)
     {
      if (aEvt.getEventType () == HyperlinkEvent.EventType.ACTIVATED)
       JOptionPane.showMessageDialog (getContentPane (), aEvt.getDescription ());
     }
    });
   }
   private void LayoutTheList ()
   {
    int lRowHeight = mCodepageList.getCellBounds (0, 0).height;
    mCodepageList.setVisibleRowCount (
      mCodepagesScrollPane.getViewport ().getHeight () / lRowHeight);
    mCodepagesScrollPane.setViewportView (mCodepageList);
   }
   private void Capture (boolean aRevert)
   {
    try {
     String lClip = (String) Toolkit.getDefaultToolkit ().getSystemClipboard ()
       .getData (DataFlavor.stringFlavor);
     if (lClip.equals (mClipOut) && JOptionPane.showConfirmDialog (this,
       "That's my own output. Use it anyway?", "Clipboard unchanged",
       JOptionPane.YES_NO_OPTION) != 0)
      return;
     if (aRevert && JOptionPane.showConfirmDialog (this,
       "Change any escaped single byte to its raw character?",
       "Revert accidental encoding", JOptionPane.YES_NO_OPTION) == 0)
      for (char lChar = 128; lChar < 256; lChar++)
       lClip = lClip.replace ("&#x" + Integer.toHexString (lChar) + ";",
            Character.toString (lChar))
           .replace ("&#x" + Integer.toHexString (lChar).toUpperCase () + ";",
            Character.toString (lChar))
           .replace ("&#" + Integer.toString (lChar) + ";",
            Character.toString (lChar));
     mClip = lClip.getBytes (mMyEditorCharset);
    }
    catch (UnsupportedFlavorException aExc) {
     Alarm ("No text found in the clipboard");
    }
    catch (Exception aExc) { Alarm (aExc.toString ()); }
    Process (0);
   }
   private void Process (int aTextIdxAtTop)
   {
    try {
     int lCntChar = 0, lCntConv = 0, lCntBad = 0;
     CharBuffer lDecoded = mCharSet.decode (ByteBuffer.wrap (mClip));
  
     StringBuffer lHtml = new StringBuffer ();
     while (lDecoded.remaining () > 0)
     {
      char lGot = lDecoded.get ();
      if (lGot > 127)
      {
       lCntConv++;
       lHtml.append ("&#" + Integer.toString (lGot) + ";");
       if (lGot == 0xFFFD)
        lCntBad++;
      }
      else
       lHtml.append (lGot);
      lCntChar++;
     }
     mClipOut = lHtml.toString ();
  
     StringSelection lSel = new StringSelection (mClipOut);
     Toolkit.getDefaultToolkit ().getSystemClipboard ().setContents (lSel, lSel);
  
     mHtmlPane.setText (mClipOut);
     mHtmlPane.select (aTextIdxAtTop, aTextIdxAtTop);
     mOutput.setForeground ((lCntBad == 0)? ((lCntConv == 0)? mGreenColor
       : mAmberColor) : Color.RED);
     mOutput.setText (Integer.toString (lCntChar) + " charcters, "
       + lCntConv + " non-ASCII, " + lCntBad + " unmatched");
    }
    catch (Exception aExc)
    {
     Alarm (aExc.toString ());
    }
   }
   private void Alarm (String aError)
   {
    Toolkit.getDefaultToolkit ().beep ();
    mOutput.setForeground (Color.RED);
    mOutput.setText ("ERROR");
    mHtmlPane.setText ("Error: " + aError);
   }
   public static void main (String[] aArgs)
   {
    EventQueue.invokeLater (new Runnable () {
     public void run ()
     {
      new CodepageTurner ().setVisible (true);
     }
    });
   }
  }

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Joe Offer
Date: 23 Feb 11 - 05:35 AM

Grishka, can you please contact me by e-mail?

-Joe-
joe@mudcat.org

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST
Date: 23 Feb 11 - 05:40 AM

ترجم هذه الصفحة

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,.gargoyle
Date: 23 Feb 11 - 05:43 AM

ترجم هذه الصفحة

Works fine in Arabic - what is the issue?

Sincerely,

Gargoyle

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Joe Offer
Date: 23 Feb 11 - 05:46 AM

Those are all ampersand codes, garg. How did you create them?

-Joe-

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: JohnInKansas
Date: 23 Feb 11 - 07:10 AM

Like this?

ت   ت
ر   ر
ج   ج
م   م
ه   ه
ذ   ذ
ه   ه
 
ا   ا
ل   ل
ص   ص
ف   ف
ح   ح
ة   ة

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,.gargoyle
Date: 23 Feb 11 - 07:11 AM

It helps to shut off the nasty 7S script.

We all have options for configuration....to set encoding.

We can make them automatic or select a specific

My automatic has the world's ten basic, however my choices could be from nearly 50.

The same as selecting what type of a "browser" I want to appear to a website as, ie. Chrome, Explorer, Safari, Netscape, Lynux, Android etc.

So....same as the Arabic "historical gargoyle" we have the same in

Chinese

約有

and in Japanese

石像鬼

and even Greek.

Sincerely

ιστορικών gargoyle

I realized that my frustration with posting to the "All Things Bright" came about because I have forgotten that web addresses could not be posted by guests to the forum.

I used a "Nordic" keyboard configuration and proxy to try and bypass what I perceived as a "clone" blocking me rather than the forgotten S.O.P. of the MC,

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: JohnInKansas
Date: 23 Feb 11 - 07:51 AM

Coded ✱ thru ❇:

✱ ✲ ✳ ✴ ✵ ✶ ✷ ✸ ✹ ✺ ✻ ✼ ✽ ✾ ✿ ❀ ❁ ❂ ❃ ❄ ❅ ❆ ❇

Typed in Word using Times Roman, copied and pasted:

✱ ✲ ✳ ✴ ✵ ✶ ✷ ✸ ✹ ✺ ✻ ✼ ✽ ✾ ✿ ❀ ❁ ❂ ❃ ❄ ❅ ❆ ❇

I see the correct glyphs with both methods, in Word and in Preview, but when I clicked to preview the pasted glyphs were converted (without my permission!!!) to their decimal codes ✱ thru &#10055.

(I've noticed this automatic conversion previously here.)

Does anyone else see just two rows of asterisks?

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: JohnInKansas
Date: 23 Feb 11 - 08:44 AM

Explanation:

When I paste glyphs outside the ANSI range from Word into the Reply to Thread Box, Check the Preview Box, and click Submit Message, the glyphs generally appear in the preview window as they were in Word, and as they looked when I pasted them.

In the Reply to Thread box, though, the pasted glyphs have been transformed into decimal & code. The ones I coded in & hex code remain as I typed them.

The transformation to code appears to work only for character numbers below where the Unicode charts "go seriously oriental" but I see no real reason to provide for those characters, at least at this time.

The range of characters that appear (today) to be "always" correctly converted to decimal Unicode char values seems to encompass all of the "languages" likely to appear here. Locally regionalized keyboards may "print symbols" for some language specific chars, and people typing in a language for which their keyboard lacks a "foreign char" that they want may not know the correct code. Those are both "typos" that require no modification of the 'cat. If you misspell a werd it's gonna post misspelled, even if what you type (accidentally?) is 癦.instead of €

This conversion is, so far as I've noticed, a relatively new feature at the 'cat, and may have been "intermittent" while the details are being worked out. - Or maybe it's been here forever and I just didn't notice it.

The ONLY broken characters I see in posts here with any consistency are "curlies" that the posting person's computer pastes as "symbols" from a "pseudo-font" that has no direct conversion to Unicode characters. Those do not particularly affect the intelligibility of what's posted, any more than the occasional typo. Their only use is to give all the Windows users instant recognition of the one regular who uses a Mac sloppily - and probably the reverse for intelligent enough Mac users (if that's not an oxymoron?).

I don't see broken chars that uniquely identify 'nix posters, but it's not been a sufficient concern for me to look for them.

The "Practice Threads" elicit some wailing about things that don't work as expected. These are 90% user error and 100% things there's no useful reason to post. For those who can use the escapes (and tags) correctly, they may be cute, and they're not particularly harmful; but it's no real problem if the don't post as intended since the "cutes" have no essential use in conversation.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Bill D
Date: 23 Feb 11 - 10:47 AM

Having set this browser to Times New Roman, I see all the characters above ↑ ▲ properly.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: GUEST,Grishka
Date: 23 Feb 11 - 11:28 AM

Joe, I entirely trust you to have good reasons to do whatever you are doing. It is not neccessary to explain them to me personally in order to stop me pestering the 'cat in this matter, because I won't any more. What I felt I should contribute, I did above, to the best of my competencies, admittedly quite limited. (Of course I shall continue to answer questions.) To focus the discussion in general, only public statements will do. What you wrote here on 18 Feb 11 - 05:06 AM is much better than nothing, for a start.

John (23 Feb 11 - 07:51 AM), it's your browser doing that by itself, unfortunately only with "irrelevant" characters, see my post of 14 Feb 11 - 01:23 PM above, reflecting my experiments.

To all: the above tool CodepageTurner is designed to convert the raw bytes of "legacy" messages already stored in the Mudcat database, of unknown codepages. This is an operation usually performed by Joe, whom I understood to desire such a software (in his post of 18 Feb 11 - 05:06 AM), or his "clones", or "kind souls" he asks for help. When posting, please use Artful's tool "htmlesc" or anything equivalent.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Joe Offer
Date: 23 Feb 11 - 04:08 PM

Hi, Grishka-
Don't get me wrong - I'm enjoying this, and I like what you've been doing. However, I did have something I wanted to discuss with you directly.

-Joe-
joe@mudcat.org

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Jack Campin
Date: 04 Nov 16 - 09:22 PM

I think that may have been spam but in this context it's extremely hard to tell..

How does Google's new international font set (Noto) affect this? I can't install it on this machine but I'm unusually trailing-edge.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Non-ASCII character display problems
From: Jack Campin
Date: 06 Nov 16 - 09:38 PM

Some insane mod seems to have deleted a post in which I asked:

What effect does Google's new Noto font set have on this?

Is it possible or desirable to set things up so that "alien" text is handled by one of those?

(For me, it would not be too good, since my machine is too old to recognize Noto fonts as valid).

Yeah, Jack, your message was in the midst of lots of Spam messages, so it probably was deleted by mistake, not insanity. I undeleted it, but then I closed the thread because for some reason, it has been a prime target for Spam messages. -Joe Offer-

Post - Top - Home - Printer Friendly - Translate

Share Thread:

This Thread Is Closed.

Mudcat time: 26 April 1:17 PM EDT

[ Home ]

All original material is copyright © 2022 by the Mudcat Café Music Foundation. All photos, music, images, etc. are copyright © by their rightful owners. Every effort is taken to attribute appropriate copyright to images, content, music, etc. We are not a copyright resource.