Subject: Tech: Non-ASCII character display problems From: Artful Codger Date: 11 Feb 11 - 09:34 PM This thread is intended to discuss and propose solutions for a common problem here at Mudcat: Why doesn't the text I post appear to all other users the way I posted it? This thread is aimed at techies, so geek-speak is encouraged—the more precise we can be, the better. This problem has been discussed many times before, mostly as off-topic digressions in other threads. (Feel welcome to post links to noteworthy threads.) I'm hoping this thread will consolidate the discussions and proposals, and hopefully this will help Max to implement a robust solution that will play nicely with all the old, improperly encoded pages as well. In the meantime, I strongly urge posters to use HTML character references for all non-ASCII characters (and literal &, < and > characters)—see the guide "Entering special characters." Scripts and online converters ease the burden somewhat. To get things (re)started, let me summarize what the problem (based on my experimentation) appears to be: Mudcat currently defaults the file encoding on its web pages. Consequently, the interpretation of any characters outside the (7-bit) ASCII range is left to the browser, which normally applies the poster's or viewer's default or currently selected codepage setting (typically the same as for their OS and locale). Internally, the browser probably converts all text to Unicode—you can post virtually any text to the message entry box, and it will show up properly there, regardless of the encoding used for your source. In short, the source encoding is immaterial (unless it was opened using the wrong file encoding, so that it displayed garbage characters to begin with). But the poster's effective codepage is very material. When Mudcat retrieves the entered text (as a stream of values), the values returned for each character depend upon the poster's effective codepage. If the character is mapped within that codepage (regardless of its Unicode mapping), the codepage's value is returned. If it is outside, one of two things probably happens, both of which have the same result: either the browser hands Mudcat a stream of values that translate to an HTML character reference (on the presumption that the receiver can only handle 8-bit characters), or the browser hands Mudcat the Unicode value for that character, and Mudcat converts it into a character reference. Either way, the right thing is done. The result is that characters in the ASCII range (a subset of most codepages) and characters outside the poster's codepage mapping are handled properly, but characters within the high-bit half of the poster's codepage mapping are left unencoded, and currently Mudcat has no clue what encoding was used to enter those characters. Nor does another user's browser; it can only interpret the high-bit characters according to the viewer's default/selected codepage. And that's where the inconsistencies occur. Without knowing the codepage, these characters can't be converted (by value) to character references because the codepage value may not match the Unicode value. It may be that Mudcat can somehow enquire (when the text is initially retrieved) what the poster's effective codepage is. The browser surely knows. If so, a solution at the time of input is simple: get the poster's encoding and use it to convert the value stream to Unicode; then convert all values beyond x7F to character references. Another possibility would be to explicitly declare a page encoding of (7-bit) ASCII. Then, hopefully, all characters beyond the ASCII range would be handed to Mudcat as character references. However, I don't know how this would affect the display of high-bit chars in existing pages. In any case, it's time to get modern and move to native Unicode. Defining the web page file encoding to something like UTF-8 may also solve the problem with no inquire/re-encoding required at all, but it would only work for newly created pages. Grishka suggested separating message entry off into a new page which could have a UTF-8 encoding specified. In new pages (having an explicit UTF-8 coding declared), the text could be used directly (in fact, there would be no need to separate off the message entry). To add to old pages (no encoding declared), all non-ASCII characters could be converted to character references first. There are some downsides to this approach, but it's viable. If nothing else is done, it would at least be useful to flag (during preview) which high-bit characters are unencoded. When preparing preview text, Mudcat could test the values of the input characters, and if they fell in the high-bit range, it could apply special formatting to them (underlining, pastel background, foreground color, blink...). It could even refuse to accept posts until the high-bit chars were resolved or a suitable input encoding was selected. Similarly, Submit could scan the text and redirect to a preview page if the page contained high-bit characters, so that they could be corrected. So there's the topic; discuss! |
Subject: RE: Tech: Non-ASCII character display problems From: Artful Codger Date: 11 Feb 11 - 10:04 PM There is a related problem: that of font coverage for the Unicode character set. No font contains all the defined Unicode characters; each font covers only some parts of the domain, with frequent holes within the covered parts. So if you specify a character, even though it may be visible to you, it may not be visible to another use who is using a similar, though not identical, font, with holes where you have visible characters. The good news is that, for the European languages at least, coverage is fairly complete for virtually all the characters you're likely to encounter. Also, browsers are savvy enough that if they can't find a character in the default or specified font, they will find the closest font they know of which does have the character. So you're pretty sure of getting the character displayed, if it's available on your system. But people are finicky: they like things to look uniform, so it's a bit jarring if they're reading Times New Roman text and suddenly there's a jump to Ariel just to display an Irish long-r. Better if the entire text (or a certain span) were displayed in Ariel to begin with. (Though I personally dislike reading sans-serif text.) In many cases, users can address this by bracketing their posts with style or font directives—if they know how. But they're still guessing that (1) that font will exist on all or most other users' systems and that (2) all the needed characters will be defined in their font implementation. Font issues also tie into the high-bit character problem. For old, improperly encoded posts, some have suggested fixing the display problems by bracketing the text with encoding or font specs, or changing the encoding for the whole page. I don't believe encoding can be changed on the fly—an encoding spec applies to the entire web page file (though you might be able to specify a different encoding for an included file). And hopefully fonts which cater to specific codepages rather than to Unicode are becoming historical relics. If you know the encoding, it's much better to re-encode the text than to dance around the problems it causes. |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 13 Feb 11 - 11:24 AM Here is my first proposal, let's call it Grishka1 for reference. It solves the problem completely for all future posts, to new or old threads, with minimal programming effort. The display of old messages is unchanged. The following changes are being proposed:
"Grishka1b": The script distinguishes between old and new threads, the latter being completely in UTF-8 to save space and bandwidth. "Grishka2" may be a frameset (the top frame being the thread, the bottom frame containing the entry/preview area as a UTF-8 page), but I am not yet sure whether it will work and be desirable. |
Subject: RE: Tech: Non-ASCII character display problems From: Taconicus Date: 13 Feb 11 - 11:38 AM ASCII not, read not. |
Subject: RE: Tech: Non-ASCII character display problems From: The Fooles Troupe Date: 13 Feb 11 - 02:46 PM "It will routinely transform all characters above 255" No comment needed for those who understand and see the funny side ... :-) |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 13 Feb 11 - 03:55 PM Foolestroupe, always look at the bright side (broit soid) of "geeks", good-on-ya-mite. Those who know me will smirk at my figuring here as a "tech geek" anyway. I wish we got rid of this problem quickly to proceed to even better fun and problems more worthy of our grey cells. |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 13 Feb 11 - 04:15 PM Of course I meant 127 instead of 255 (still quite an age), and I accidentally escaped the <em>. Well, I think you know what I meant. Now please comment on the topic, preferably those in charge. |
Subject: RE: Tech: Non-ASCII character display problems From: Artful Codger Date: 13 Feb 11 - 05:57 PM Ahem, Mudcat should escape all chars above 127, i.e. all chars not in the ASCII (7-bit) set. (If you're going to quibble, quibble about something of import.) Note that with a UTF-8 file encoding specified for the web page, Mudcat would likely receive the input as UTF-8 multi-byte sequences rather than as straight Unicode values or character references (as I suspect happens now), so for pages lacking a UTF-8 header (like all the pages we have as of this writing), Mudcat would first have to unencode to 16-bit Unicode codepoints before generating character references. Scripting languages provide conversion filters that can easily be attached to streams to handle such encoding conversions (since 16-bit Unicode chars have become the internal lingua franca for most), so it's not a big task to implement; it just needs to be done. Of course, no conversion would be necessary to store new messages as UTF-8. I'm assuming that Mudcat doesn't store threads as actual web pages, but rather constructs them on the fly from DB information. So maybe we should be talking about adding encoding attributes to the messages themselves. As I see it, you'd only need three attribute values: unknown 8-bit (ISO-8859-1 assumed, but iffy), UTF-8 and ASCII (wholly 7-bit, with or without character escapes). A sweep through the DB could type the existing messages either as ASCII or unknown (if they had any 8-bit chars), and a revised input system would ensure that new messages were always properly encoded either as ASCII or UTF-8. Mudcat might still have to create pages without a page encoding specified (if a thread contained a message with unknown encoding) so that viewers and moderators could change the view encoding—in this case, it would have to convert high-bit and multi-char UTF-8 stuff to references on the fly—and resolve the current input problems somehow. Alternatively, since questionable messages would already be tagged, they could be displayed in a UTF-8 encoded page with ISO-8895-1 encoding emulated; a selector would let users select another emulation encoding--moderators and possibly the original posters (if registered) would be able to apply the selected encoding permanently (as a separate operation), converting the post to UTF-8 or ASCII-with-escapes, obviating the need to translate the message in the future. (The operation would not be reversible unless, instead of converting to UTF-8/ASCII, the selected encoding was stored as the encoding type. This might be allowed for a provisional period of time until a later sweep made permanent conversions.) Even better, make the web pages UTF-16 and decode all the source material. This would be more efficient for both the browser and the Mudcat input processing to handle. Messages could still be stored as UTF-8 or escaped ASCII to reduce the storage space needed. And there would be no need to spawn a separate page for input (though that still sounds like a nice option). |
Subject: RE: Tech: Non-ASCII character display problems From: Artful Codger Date: 13 Feb 11 - 06:01 PM „txt ‚txt |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 14 Feb 11 - 01:23 PM Since we do not seem to have any real "geeks" here (thank goodness), I have done some further research. Here are the results as I understood them, and the resulting arguments, which support my above concept. Please take notice of this, even if you have read most of it before, so that the discussion can move on.
Of course we should not only discuss technology, but also the practical consequences affecting all of us. |
Subject: RE: Tech: Non-ASCII character display problems From: Artful Codger Date: 14 Feb 11 - 07:38 PM All of this holds true with a UTF-16 encoding as well, so the superiority of using UTF-8 for the web page itself has still not been demonstrated. In fact, it may still require an extra step to convert the UTF-8 input to UTF-16 which the Cold Fusion scripts can handle directly. This is an unnecessary complication. If input remains handled on the thread display page, the argument for UTF-8 is both stronger and weaker. It would allow legacy messages to be included unconverted (the character size is still 8-bit), even though high-bit characters would largely get blitzed. (Note that arbitrary sequences introduced by high-bit characters violate the encoding scheme--not every character combination is valid--so you're just trusting that browsers will behave benignly.) It would probably still allow users to switch view encoding to see what was originally meant. Sadly, the number of posts with wonky displays would skyrocket unless emulation was used along the lines i suggested. In that case, emulation would also be necessary to view the message with a diffent source encoding applied, and then using UTF-8 rather than UTF-16 is a distinct liability (two conversions required instead of one--one must go through UTF-16 to produce either the equivalent UTF-8 or character references). Storage is a separate issue entirely. If (for new posts) everything is converted to character references, it conforms to byte-oriented storage, as is probably used now. But you're much more likely to exceed some length boundaries, particularly for title information. Storing raw UTF-8 would reduce this in most cases, but then you have to know that you stored UTF-8 rather than some legacy 8-bit encoding, and I'm not sure how this would impact DB searching. Even character references have an impact on searches: does the DB understand them? If not, how can it normalize the different forms they may take (mnemonic or numeric? zero-padded or not?) for comparison? That problem most likely exists at present, but the relative disuse of escapes in comparison to ISO-Latin-1 masks it. In any case, most people search only by ASCII words, where char refs are a non-issue. And if users no longer had to encode text, most char refs encountered in new posts would match, being produced by the system. |
Subject: RE: Tech: Non-ASCII character display problems From: Q (Frank Staplin) Date: 14 Feb 11 - 09:55 PM ḐŲĤ |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 15 Feb 11 - 08:51 AM Artful Codger, if we want to have a chance for our considerations to become reality, we should present them as complete concepts, as I did with "Grishka1" and "Grishka2". You may start with "AC1", which I understand to be: Keep the thread page as it is, but let it use UTF-16. This will indeed display most western messages correctly, and allow posting in Unicode. The major drawback is: legacy posts using other codepages cannot be viewed by codepage switching in the browser. Your "selector", asking the server script to do it ("emulation"), would be quite extravagant and so slow that experimenting would be discouraged. Also, the size of most HTML files would be nearly doubled. Posting from UTF-16 pages produces exactly the same byte sequences as from UTF-8 pages, because the HTTP protocol enforces UTF-8 for the transport. I don't know yet how exactly the string is presented to the CFM script (by the ColdFusion software handling input on the server), but I am sure it will be easy to extract the information we need. If asked, I'll find out how to. Note that my above concepts do not use UTF-8 to display any postings. Storing message bodies on the server database is not a big issue. I understand it to be current policy to encourage "ampersand escapes" anyway, so I think the script should produce and store these, simulating an AC compliant user. A Unicode solution, possibly supported by the CF system, can be discussed as an alternative. How to facilitate full-text searches is another topic – Google does it alright for my taste. As for thread titles, Mudcat can continue to restrict them to characters up to code number 255 and to store them as single byte strings, meaning ISO 8859-1. Since they will arrive at the server in Unicode, the script can effectively test them and refuse thread creation if necessary, issuing an error message. Another option is to accept any title, store it as UTF-8, but transform it to "ampersand" when writing an HTML page (if the thread was created after time X). Cyrillic titles would then be displayed correctly but would have to be somewhat short (assuming the database field is addressed and sized in terms of bytes. Well, Вещая takes exactly as much UTF-8 space as Vyeshchaya). Converting existing messages in the database: We agree that this needs human interference and therefore time. It should be regarded as a separate project, the only interference with the current one being that it will hopefully one day become a matter of the past. Joe has indicated such a conversion to be desirable; it would of course be much easier if he were helped by other users, supported by the script. AC's marker to recognize problematic messages would be quite useful; if however any change of database design is taboo, a mark at the beginning of the Body (like a "BOM") can serve as a makeshift. As soon as the whole database is converted (if ever), UTF-8 can be used for everything. I am still waiting for an official signal, or questions. If Q (14 Feb 11 - 09:55 PM) sums up the general opinion, I can use my time very well otherwise. |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 15 Feb 11 - 09:35 AM Concept Grishka3: The thread page stays as it is, just the Preview checkbox is removed, and the "Submit Message" button is labeled "Reply". Clicking on it inevitably leads to the preview page, which is completely unchanged, but declared UTF-8. The contents, if any, of the entry box are reproduced, processed as follows: If the script encounters any characters above 127, it tentatively converts them as if from ISO 8859-1 to UTF-8, both in the preview and in the corresponding entry box, and includes a big red warning ("3 characters have been converted, please check ..."). Posters using a codepage similar to ISO 8859-1 can usually ignore the warning, the others have to examine the preview for the transformed characters. If they do not want to do that, they may click "Reply" before entering anything. If they pasted their text from elswhere, they should repaste it to the new box (Ctrl-A, Ctrl-V). Writers of plain English text and diligent escapists (users of the htmlesc software) will never see the warning. The extra benefit is that at least one preview is compulsory (though it is still possible to enter B***S*** then). |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 15 Feb 11 - 09:57 AM Grishka3b: Same as 3, but direct submitting is allowed. If the script encounters no character above 127, it proceeds as it does now, otherwise as in Grishka3. My own favourite is still Grishka1. BTW, a normal Submit button can also be made to open a new tab, as I found out two minutes ago, so AC's "option" can be considered: <FORM ACTION="ThreadNewMess-Sub.cfm" METHOD="POST" TARGET="_blank"> |
Subject: RE: Tech: Non-ASCII character display problems From: JohnInKansas Date: 15 Feb 11 - 05:30 PM Without questioning the value of this research, I feel the need to ask whether anyone has actually looked at whether the problem is "big enough" to merit the rather drastic changes being proposed. While I've occasionally seen a few broken chars when few "particular someones" paste from another website, the offense is less serious than the far more common typographical errors that we tolerate from the few fat-fingered members who don't read before submitting. (Several specific persons who frequently post broken chars do so only when copying from a few fairly specific and unique other websites. Even their posts are clean when posting from most sites. I do wonder why.) If the problem was really significant, I'd expect to have seen complaints in the threads where char errors happen, and I don't find any such objections in any threads I've read. An additional question is whether changes made now will be valid for the eventual (pending) standards for "web fonts." The W3C committee has "standardized" the "@font-face rule" for using CSS to designate a remote font, and the capability is reported (June 2009) as being available in all major browsers, but most such implementations violate proprietary rights of the font makers whose fonts are used, and the type foundaries are objecting. If a CSS is used, there's still a problem if the machine browsing the site doesn't have the font specified, and at present mudcat likely would be forced to "buy" the specified font in order to put it on the server. Microsoft has provided for the use of EOT (Embedded Open Type) in which the (encrypted) font is embedded in a document. They haven't had much success with getting people to use it, although IE can, and in some cases does, use it. Other browsers probably can use EOT by now, but it's unclear how much additional burden that use places on website servers. "All the browsers but Microsoft's, meanwhile, have embraced a technique called "naked" or "raw" font linking, which means uploading ordinary desktop fonts onto servers." (Technology Review, June 2009) This method clearly, in most cases, violates the foundries' licensing terms, and while they've shown commendable restraint, failure to do "something different" may soon result in lawsuits like the ones over music and videos. A fairly recent change in copyright rules means that unlike a short time ago a font can be copyrighted as software, and most of the decent ones are ©. The "Web Open Font Format" (WOFF) is being pursued actively, particularly by a couple of "startup" companies**, but as proposed now that would require web sites to pay annual fees to lease the use of fonts. That seems to be an unreasonable burden to foist on mudcat when the use, as now, of the fonts on users machines is legal and "already paid for" (if you're using legal copies of the programs that provided them). ** The startup company called Small Batch is offering WOFF as "Typekit" and another startup named Kernest is offering to "broker" leases. Mozilla has signed on and typographers are circulating petitions for a standard. Implementation of WOFF also would require significant changes in all browsers, which really means that it won't do a lot of good until our people who are still using Win98/WinME/WinXP buy new computers compatible with the newest browsers. Clearly defining and describing the particular user practices that result in broken chars now, with the existing mudcat setup, and telling people "how to not do that" would be helpful. The ones who are doing it will ignore the request to not do it, and the rest of us will shrug and go on with our reading and posting. John |
Subject: RE: Tech: Non-ASCII character display problems From: Artful Codger Date: 15 Feb 11 - 06:51 PM It doesn't matter what format is used for transport--as long as the target encoding (affecting the web page entry area) is UTF-x, the user input will be received and handled unambiguously. The browser and OS take care of that; neither the user nor Mudcat need be concerned. The web pages are not stored, they are constructed from constituent information. So the page encoding should be that which makes things easiest from the Mudcat scripting side. Yes, UTF-16 may mean that the pages sent (for display) are nearly twice the size (though probably not, if the transport is always UTF-8), and that may mean that UTF-8 is preferrable, although increasing the burden on the Mudcat input handling. Escaping is only encouraged presently because failing to do so results in inconsistent display. If the encoding is changed to UTF, there is no longer any need for users to escape text (except for &, < and >). Can the HTML tag dictate whether a new tab, rather than a new window, is opened? I greatly prefer tabs, but since many pop-ups make no sense as tabs (and may resize the viewing window(!)), I must leave my default browser setting to open windows. Managing separate windows, however is a pain, so I'm strongly opposed to forcing a new tab/window to be opened for input. It also adds to management problems when one is simultaneously researching and composing a response; I have enough tabs and windows to manage as it is! Changing emulation could be streamlined by popping up a "change encoding" tab/window, like the one you're proposing for input. Then the entire thread would not have to be redisplayed, and the display of other messages would not be affected. Selecting a new encoding directly from the window would be much simpler and more efficient timewise (even with the round-trip to the server) than having to find the encoding using the browser's interface—most users don't even know how to do this! And the setting I proposed to "fix" the encoding (i.e., apply that encoding for other users thereafter) could be incorporated into that interface. Then, only one user has to go through the pain of finding the right encoding for a message. Since ISO-Latin-1 would be assumed as the emulation default, most legacy messages would display properly from the start (as they do now, to most users), even if they were improperly encoded. Without emulation, these messages would appear blotto in a page with UTF-8 encoding specified. Leaving the display pages with no encoding specified is to mire them in the obsolescent past. The sooner legacy messages are updated to Unicode, the better for Mudcat's future. |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 16 Feb 11 - 04:09 AM I think we should do first things first, that is stop garbled characters due to codepage discrepancies; stop (resp. minimize) burdening posters with HTML escaping. (The problem is explained in AC's moderated thread.) In order to increase our chances of a quick and waterproof solution, this goal should be achieved
The "target _blank" code, normally resulting in a new tab being created, is already in use by the links labeled "Printer Friendly"; try them out. (Popups, in contrast, are usually effected by JavaScript, which is not the Mudcat way.) (Neither is specifying any fonts other than "Arial", John.) My models Grishka1 and Grishka3 can alternatively be realised without "target _blank". In this case posters of plain English will not notice any difference at all between Grishka3 and the current model. It is also very easy to offer both behaviours via two buttons – this may be the best idea. |
Subject: RE: Tech: Non-ASCII character display problems From: The Fooles Troupe Date: 16 Feb 11 - 07:03 AM QUOTE Can the HTML tag dictate whether a new tab, rather than a new window, is opened? I greatly prefer tabs, but since many pop-ups make no sense as tabs (and may resize the viewing window(!)), I must leave my default browser setting to open windows. Managing separate windows, however is a pain, so I'm strongly opposed to forcing a new tab/window to be opened for input. UNQUOTE Interestingly enough, there are Firefox plugins that when certain of their settings are enabled, can force all 'open in new window's to 'open in new tab's - eg 'Tab Mix Plus', which is one I use on both Windows (7) and Ubuntu. |
Subject: RE: Tech: Non-ASCII character display problems From: JohnInKansas Date: 16 Feb 11 - 04:15 PM Foolestroupe - Since it's the common belief that IE is the most archaic and obsolete browser around, my ability to set whether to "open in same Window," or "open in new Window," or "open in new tab" in IE7 (which even Microsoft says is 2 generations obsolete) would imply that all reasonably current browsers probably have that feature. The USER can set which to do (but a website can "force an override"). I set to "open in same window" and either click with the mouse wheel to open in a new tab, or right click to choose which to do. No plugin required. Some of our members have indicated they still may be using really old IE that does not support tabs. Win98 may not be able to run IE versions since tabs were available, and WinXP has no support for "optional updates" so those users may not have updated to the latest IE version (with tabs) that they could run. John |
Subject: RE: Tech: Non-ASCII character display problems From: JohnInKansas Date: 16 Feb 11 - 05:05 PM Since WinXP, Micorosft has supplied some fonts in the form that they call by the technical name of "Big Fonts." In all of those fonts, all the ASCI/ANSI English characters, and all the principal "European Language" characters are included. The character numbers "printed" by those fonts are the Unicode numbers. Internal font coding in Office programs since Office 2003 has been UTF-16 for the fonts provided by Microsoft. Anyone typing from a Windows computer with WinXP/Office 2003 or later and using one of the big fonts should not need to code any character they can type. If a person gets a font from somewhere else, it probably is NOT a "big font," and coding of characters outside the "font range" would be necessary. A font from another source may also force "font page" use, which could cause characters to be differently encoded. Fonts with the same names existed on earlier Windows versions, but installing WinXP/Office2003 or anything later should have upgraded them to "big." The Microsoft Office "big fonts" are: Arial Arial Black Arial Bold Arial Narrow Bookman Old Style Courier New Garamond Impact Tahoma Times New Roman Trebuchet (Central and Eastern European languages only) Verdana® At least since Office 2003, Office program installation has selected Tahoma as the default font, and claims have been made that it was "designed for improved web visibility" (but I've failed to find any real advantage to using it). Nearly all surviving Windows computers should have, or can get, the Microsoft "Arial Unicode" font which contains all Unicode characters up to hex FFFD, but the chars beyond those in the "Big Fonts" will probably require kidnapping an oriental person to steal a keyboard to use them without coding. "Currently in the Microsoft Windows operating systems, the two systems of storing text — code pages and Unicode — coexist. However, Unicode-based systems are replacing code page–based systems. For example, Microsoft Windows® NT 4.0, Microsoft Windows 2000, Microsoft Windows XP, Microsoft Office 97 and later, Microsoft Internet Explorer 4.0 and later, and Microsoft SQL Server 7.0 and later all support Unicode." Unicode Support in Office 2003 Code pages are used by Office only for (some of) the "little fonts." They might also be used for a "free font" from someplace chosen at random. As an example, using Times New Roman (a Big Font) the "shortcut" allowing US users who don't have a "euro key" to use Alt-Numpad 0128 to enter € returns the correct Unicode Hex character number 20AC if you use the Alt-X toggle in Word to flip it back to the char value. Any character that comes "off the keyboard" should be sent by its Unicode char number. So the broken characters that appear at mudcat are due to people using "little fonts" that still use code paging rather than Unicode, loss of the Unicode char values by the mudcat database, or people using something other than Windows programs. John |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 18 Feb 11 - 04:38 AM Joe Offer, please comment. |
Subject: RE: Tech: Non-ASCII character display problems From: Joe Offer Date: 18 Feb 11 - 05:06 AM You're losin' me, Grishka. I do pretty well technically, but this is more complicated than I have time for. What I'd like first of all, is some way to copy text with distorted characters, paste it somewhere and convert it to the normal character set, and paste the corrected text back into Mudcat. If there's a simple change that can be added to Mudcat pages to make all text universally readable, that would also be nice. But it has to be simple, and I'm not seeing simple so far. -Joe- |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 18 Feb 11 - 06:11 AM Joe, thanks for reacting; this is exactly the kind of feedback we have been missing. What I'd like first of all ...has been described by me in a post to AC's thread (18 Jan 11 - 06:07 PM), now deleted. I shall repost it here, if desired. If there's a simple change that can be added to Mudcat pages to make all text universally readable, that would also be nice.That is exactly what I am striving to provide, at least for any text posted in the future. My models Grishka1 and Grishka3 are really as simple as you can get them; anything AC is suggesting is much more complex. Please try to read my descriptions (13 Feb 11 - 11:24 AM and 15 Feb 11 - 09:35 AM); if you find them incomprehensible, just ask. Once we agree on a model, i.e. we find it desirable provided it works as I claim, we can focus on the details. I'm not seeing simple so far.That's the problem with threads like this one: the simplest ideas are least likely to be noticed. |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 18 Feb 11 - 12:25 PM Joe, What I'd like first of all ...will be your reward if you invest some time to discuss here: a little tool, operating on HTML source code directly. Complete description of the simplest version of Grishka3b (without "target _blank"):
|
Subject: RE: Tech: Non-ASCII character display problems From: Jeri Date: 18 Feb 11 - 12:56 PM Joe does a lot around here, but he doesn't write any of the code. That's 100% Max. |
Subject: RE: Tech: Non-ASCII character display problems From: Bill D Date: 18 Feb 11 - 01:24 PM An aside, but since John in Kansas mentioned it... Since it's the common belief that IE is the most archaic and obsolete browser around, my ability to set whether to "open in same Window," or "open in new Window," or "open in new tab" in IE7 (which even Microsoft says is 2 generations obsolete) would imply that all reasonably current browsers probably have that feature. The USER can set which to do (but a website can "force an override"). I set to "open in same window" and either click with the mouse wheel to open in a new tab, or right click to choose which to do. No plugin required. It is true that all browsers support the ability of a website to force "open in new Window," or "open in new tab", {the 'target=blank' command}. But this is not always desirable for the user. I use the web filter Proxomitron, and I found a filter written to specifically remove & override that command. It is easy enough TO open a link in a new tab or window, but *I* prefer to have the option. (to explore this, search on **Grypen target blank**) |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 18 Feb 11 - 04:46 PM Jeri, I am aware of that, and of course I would very much welcome Max to join the discussion, or rather to tell us about his own ideas and considerations (as I wrote before). But he doesn't seem to be around, and Joe seems to speak for Mudcat in terms of policy. The first question, yet unanswered, is: assuming it works, would it be worth the effort for the officials, or how much effort? If the answer is sufficiently positive, we can proceed to the details – and maybe Max doesn't need any help with these. Joe, please keep us informed about your and Max's ideas, if Max is not going to write himself. Would you like your reward (ca. 150 lines of Java code, to be processed like CopyUnicode. Usage: Copy, click on "Capture", click on codepage names until satisfied with the display window, paste back; no browser required.)? |
Subject: RE: Tech: Non-ASCII character display problems From: Joe Offer Date: 19 Feb 11 - 12:52 AM Grishka, how are you going to make good on that No more ranting by Artful Codger pledge? [grin] -Joe- |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 19 Feb 11 - 02:52 PM Joe, how are you going to make good on that No more ranting by Artful Codger pledge?Full money-back-guarantee-no-questions-asked! Of course the asterisk-small-print is part of the contract. ;-) inasmuch as justified and concerning this topic. (Actually we are playing good cop / bad cop; you must know that game.) Is it a deal? And do you want your sales commission? |
Subject: RE: Tech: Non-ASCII character display problems From: JohnInKansas Date: 19 Feb 11 - 09:15 PM re. forcing new window or new tab: One nearly universal use that I find probably acceptable is when you click to get a "printable version" nearly all sites open the pv in a new tab or new window. A couple of "governent sites" open every click in "new," which is somewhat annoying1. Additionally annoying is that when you download pdf documents that the gov (e.g. is SCOTUS) has thoughtfully provided they all have the same filename, and if you save a dozen without inserting a different name for each one you end up with only the last overwrite. (Some Microsoft servers have begun to slip into the same "one name for all" habit.) But all that "public information" is supposed to be a secret anyway, and it's only there to give the illusion that you've been informed. 1 While the current html standard is "posted" at the W3C site, actually reading it requires you to prety much click to a new URL for each page (or paragraph) of the document and then to click on to each subparagraph of the paragraph; and I believe I saw a note that the 4.01 Spec is close to 400 pages. Three clicks away from the TOC and I'm pretty much lost as to where I am in the doc, without a whole lot of assembly and reformatting so I may never read the whole thing (again). John |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 21 Feb 11 - 03:38 AM Joe and Max, please tell us why you deem my summary of 18 Feb 11 - 12:25 PM too "complicated". Is it so incomprehensible that you do not want about it? Or do you fear it might not do the trick, and find my arguments (further above) incomprehensible or unconvincing? Or do you fear it will be too much work compared to the benefit? Or do you plan an implementation later, provided my claims hold water? Or are even better ideas on their way? If we let the discussion sleep now, it is bound to start again from zero in a couple of months, as it did before. Since we do not really enjoy that, it would be a great relief to have some kind of Mudcat-official preliminary result, as specific as possible. this is more complicated than I have time for.We want to save time for all of us, and at the same time improve the quality of Mudcat considerably. What I'd like first of all, is some way to copy text with distorted characters, paste it somewhere and convert it to the normal character set, and paste the corrected text back into Mudcat.For this I wrote a tool for you, finished and tested. It can even correct the text you accidentally misconverted. To get started with it you have to invest 15 minutes. It will save you many many hours, nerves, and errors. If you fear it may not work or cause damage, see that CopyUnicode thread. Silver paper with it? Goldverschnürt sogar? Just ask. Happy P-Day across the pond! Yes, we can. |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 21 Feb 11 - 03:41 AM Sorry, the second sentence should read: "Is it so incomprehensible that you do not want to reflect about it?" |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 23 Feb 11 - 05:22 AM I obviously failed to convince those in charge so far, and we are left puzzled once more – tant pis. The tool being finished, I feel I should post it nevertheless, in case some adepts like Artful Codger or Jon Banjo want to study it and either recommend it to Mudcat or use it themselves when asked once more to convert something (saving the trouble of reconstructing the HTML tags manually). Feel free to adapt it to your needs and taste. Note the total absence of potentially harmful code of file manipulation, internet access, system calls etc. Farewell geekdom, hello life! import java.awt.*; |
Subject: RE: Tech: Non-ASCII character display problems From: Joe Offer Date: 23 Feb 11 - 05:35 AM Grishka, can you please contact me by e-mail? -Joe- joe@mudcat.org |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST Date: 23 Feb 11 - 05:40 AM ترجم هذه الصفحة |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,.gargoyle Date: 23 Feb 11 - 05:43 AM ترجم هذه الصفحةWorks fine in Arabic - what is the issue?
|
Subject: RE: Tech: Non-ASCII character display problems From: Joe Offer Date: 23 Feb 11 - 05:46 AM Those are all ampersand codes, garg. How did you create them? -Joe- |
Subject: RE: Tech: Non-ASCII character display problems From: JohnInKansas Date: 23 Feb 11 - 07:10 AM Like this? ت ت ر ر ج ج م م ه ه ذ ذ ه ه   ا ا ل ل ص ص ف ف ح ح ة ة John |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,.gargoyle Date: 23 Feb 11 - 07:11 AM It helps to shut off the nasty 7S script.
|
Subject: RE: Tech: Non-ASCII character display problems From: JohnInKansas Date: 23 Feb 11 - 07:51 AM Coded ✱ thru ❇: ✱ ✲ ✳ ✴ ✵ ✶ ✷ ✸ ✹ ✺ ✻ ✼ ✽ ✾ ✿ ❀ ❁ ❂ ❃ ❄ ❅ ❆ ❇ Typed in Word using Times Roman, copied and pasted: ✱ ✲ ✳ ✴ ✵ ✶ ✷ ✸ ✹ ✺ ✻ ✼ ✽ ✾ ✿ ❀ ❁ ❂ ❃ ❄ ❅ ❆ ❇ I see the correct glyphs with both methods, in Word and in Preview, but when I clicked to preview the pasted glyphs were converted (without my permission!!!) to their decimal codes ✱ thru ❇. (I've noticed this automatic conversion previously here.) Does anyone else see just two rows of asterisks? John |
Subject: RE: Tech: Non-ASCII character display problems From: JohnInKansas Date: 23 Feb 11 - 08:44 AM Explanation: When I paste glyphs outside the ANSI range from Word into the Reply to Thread Box, Check the Preview Box, and click Submit Message, the glyphs generally appear in the preview window as they were in Word, and as they looked when I pasted them. In the Reply to Thread box, though, the pasted glyphs have been transformed into decimal & code. The ones I coded in & hex code remain as I typed them. The transformation to code appears to work only for character numbers below where the Unicode charts "go seriously oriental" but I see no real reason to provide for those characters, at least at this time. The range of characters that appear (today) to be "always" correctly converted to decimal Unicode char values seems to encompass all of the "languages" likely to appear here. Locally regionalized keyboards may "print symbols" for some language specific chars, and people typing in a language for which their keyboard lacks a "foreign char" that they want may not know the correct code. Those are both "typos" that require no modification of the 'cat. If you misspell a werd it's gonna post misspelled, even if what you type (accidentally?) is 癦.instead of € This conversion is, so far as I've noticed, a relatively new feature at the 'cat, and may have been "intermittent" while the details are being worked out. - Or maybe it's been here forever and I just didn't notice it. The ONLY broken characters I see in posts here with any consistency are "curlies" that the posting person's computer pastes as "symbols" from a "pseudo-font" that has no direct conversion to Unicode characters. Those do not particularly affect the intelligibility of what's posted, any more than the occasional typo. Their only use is to give all the Windows users instant recognition of the one regular who uses a Mac sloppily - and probably the reverse for intelligent enough Mac users (if that's not an oxymoron?). I don't see broken chars that uniquely identify 'nix posters, but it's not been a sufficient concern for me to look for them. The "Practice Threads" elicit some wailing about things that don't work as expected. These are 90% user error and 100% things there's no useful reason to post. For those who can use the escapes (and tags) correctly, they may be cute, and they're not particularly harmful; but it's no real problem if the don't post as intended since the "cutes" have no essential use in conversation. John |
Subject: RE: Tech: Non-ASCII character display problems From: Bill D Date: 23 Feb 11 - 10:47 AM Having set this browser to Times New Roman, I see all the characters above ↑ ▲ properly. |
Subject: RE: Tech: Non-ASCII character display problems From: GUEST,Grishka Date: 23 Feb 11 - 11:28 AM Joe, I entirely trust you to have good reasons to do whatever you are doing. It is not neccessary to explain them to me personally in order to stop me pestering the 'cat in this matter, because I won't any more. What I felt I should contribute, I did above, to the best of my competencies, admittedly quite limited. (Of course I shall continue to answer questions.) To focus the discussion in general, only public statements will do. What you wrote here on 18 Feb 11 - 05:06 AM is much better than nothing, for a start. John (23 Feb 11 - 07:51 AM), it's your browser doing that by itself, unfortunately only with "irrelevant" characters, see my post of 14 Feb 11 - 01:23 PM above, reflecting my experiments. To all: the above tool CodepageTurner is designed to convert the raw bytes of "legacy" messages already stored in the Mudcat database, of unknown codepages. This is an operation usually performed by Joe, whom I understood to desire such a software (in his post of 18 Feb 11 - 05:06 AM), or his "clones", or "kind souls" he asks for help. When posting, please use Artful's tool "htmlesc" or anything equivalent. |
Subject: RE: Tech: Non-ASCII character display problems From: Joe Offer Date: 23 Feb 11 - 04:08 PM Hi, Grishka- Don't get me wrong - I'm enjoying this, and I like what you've been doing. However, I did have something I wanted to discuss with you directly. -Joe- joe@mudcat.org |
Subject: RE: Tech: Non-ASCII character display problems From: Jack Campin Date: 04 Nov 16 - 09:22 PM I think that may have been spam but in this context it's extremely hard to tell.. How does Google's new international font set (Noto) affect this? I can't install it on this machine but I'm unusually trailing-edge. |
Subject: RE: Tech: Non-ASCII character display problems From: Jack Campin Date: 06 Nov 16 - 09:38 PM Some insane mod seems to have deleted a post in which I asked: What effect does Google's new Noto font set have on this? Is it possible or desirable to set things up so that "alien" text is handled by one of those? (For me, it would not be too good, since my machine is too old to recognize Noto fonts as valid).
|
Share Thread: |