Subject: Tech: Entering special characters (moderated) From: Artful Codger Date: 18 Jan 11 - 03:40 PM ATTENTION: The guide below shows you how to properly enter special characters in your messages—curly quotes, long dashes, symbols, and accented and non-Roman characters (like Cyrillic, Hebrew and Japanese kana). It explains what HTML character references are, why and when people should be using them, how to encode the common ones, and where they can go for more information. Also described are the display problems that occur when people don't encode such characters properly and some tools that help you convert or insert text. This thread is not intended as a discussion thread. You may post corrections, questions and additional material to be incorporated into the guide, but comments on display problems or improvements to Mudcat character handling should be posted instead to the thread Tech: Non-ASCII character display problems. Recent changes:
This is an edited PermaThread® for the description of HTML character references. This thread will be edited by Artful Codger, who will consolidate the information posted here into a technical guide. Feel free to post to this thread, but remember that all messages posted here are subject to editing or deletion. -Joe Offer- |
Subject: Tech: Entering special characters (moderated) From: Artful Codger Date: 19 Jan 11 - 10:13 PM Entering Special CharactersContents of this guideWhat are character references (escapes)?Why should I care? Why use them? How do I encode them? Mnemonic escapes Numeric escapes Escape charts for selected languages
Particularly troublesome are quotes, apostrophes, dashes and ellipses, since word processors tend to replace the ASCII characters you type with "smart" symbols outside the ASCII set. A double quote you type directly in the message box is fine, because it's a "straight" quote (not biased right or left) which is in the ASCII set, but word processors usually turn these into curly quotes, depending on which side of text they lie on—these quotes are not ASCII characters. This applies to the apostrophe as well; typing directly is fine, but pasting from another source is suspect. Dashes other than the short hyphen are non-ASCII. An ellipsis is non-ASCII. All these characters display incorrectly to some segment of users if not encoded. WARNING: Text copied from other web pages is prone to include these troublesome non-ASCII characters. Ampersands and angle brackets should also be encoded when they are used literally instead as part of an HTML construct. (To be wholly proper, you should encode straight double quotes as well.) Failing to do so creates invalid HTML, but most browsers will treat these lapses benignly. That said, more than one ABC transcription has been corrupted because the posters failed to escape their angle brackets. Because people need these characters to embed HTML directly, Mudcat will never attempt to escape these characters for you, even if it eventually learns to automatically encode or accept all the others.
Mnemonic escapes consist of an ampersand, a short, predefined mnemonic name, and a terminating semicolon. A separate escape is required for each character. All mnemonics are case-sensitive.
Mnemonics for accented characters follow a consistent pattern, although they're only defined for certain base letters. The following table summarizes the accented characters for which mnemonic escapes are defined; replace the X in the template with a letter in the following list of valid characters. Accented characters not in this table must be encoded using numeric escapes.
Here are some additional characters for which mnemonic escapes are defined.
Note the reversal of letters between the bottom quote mnemonics: bdquo (bottom double) but sbquo (single bottom).
Example: The Welsh character long-w (w with circumflex: ŵ) has a Unicode decimal value of 373, so it can be encoded as ŵ. The same value in hexadecimal is 175, so it can also be encoded as ŵ Although you will often find values expressed with leading zeros, they aren't required in numeric escapes, and if present, will be ignored. Nowadays, most charts supply character values in hexadecimal. Hexadecimal requires 6 additional digits, represented by the letters A-F (or a-f; HTML ignores case for hexadecimal digits). So now you know the format, but how do you find the Unicode value for a character? Well, you could check the charts at the Unicode Consortium website, viewable/downloadable as PDF files: http://www.unicode.org/charts/ Fortunately, most operating systems provide a simpler way to browse Unicode character maps (and also check whether the character is available in a given font—at least on your system). For instance, on Macs (running OS 10.5, and probably 10.6), you can open the Character Palette and click on likely character subsets in the scroll list. To obtain a hexadecimal value, click on a character and look at the Unicode field displayed in the Character Info section. [Can someone provide similar instructions for Windows? Please specify which release you're talking about, since Microsoft likes to change everything around with each release.] The full set of characters for a particular language may be spread out in multiple subranges of the Unicode tables. So it often helps to have a crib chart just for that language (and I'll attempt to supply some). A little web surfing may turn up a chart for your language of interest. For non-Roman alphabets, using a converter is a better option. Sometimes, a single character may either be composed of multiple components or have a value beyond the 16-bit range, in which case it may need to be expressed as a sequence of codepoints, each codepoint expressed as a separate HTML escape. Examples are Chinese pictographs and Cuneiform. There are also many codepoints for applying special formatting, such as to reverse writing direction for Arabic and Semitic script. But a full discussion of Unicode is beyond the scope of this guide. |
Subject: Tech: Special characters by language From: Artful Codger Date: 20 Jan 11 - 10:18 PM Below are some language-specific charts of special characters. If there are important characters I've missed, please let me know.
Acute = čárka, caron = haček, circle = kroužek. For tall letters, haček (caron) often appears as an apostrophe on the right (an indivisible part of the letter). The digraph "ch" is treated as a single letter, lying between H and I alphabetically, and sorting accordingly (for instance, in dictionaries and name lists). But it is spelled with two letters, as in English. Dz and dž are also commonly spelled with letter pairs.
Polish alphabet: |
Subject: Tech: Entering special characters From: Artful Codger Date: 22 Jan 11 - 10:41 PM
Belarusian alphabet: a b c ć č d dz dž e f g h i j k l m n ń o p r s ś š t u v y z ź ž Bulgarian alphabet (Latinized): a b v g d e ž z i j k l m n o p r s t u f x c č š št ã ' ju ja Croatian (and Bosnian) alphabet: a b c č ć d dž đ e f g h i j k l lj m n nj o p r s š t u v z ž Macedonian alphabet (Latinized; no non-English characters are used): a b v g d gj e zh z dz i j k l lj m n nj o p r s t kj u f h c ch dzh sh Serbian, normally written in Cyrillic, is Romanized using the same alphabet as Croatian. Serbian alphabet order: a b v g d ć e ž z i j k l lj m n nj o p r s t č u f h c č dž š Slovenian alphabet: a (ä) b c č d e f g h i j k l m n o (ö) p r s š t u (ü) v z ž |
Subject: Tech: HTML formatting From: Artful Codger Date: 10 Feb 11 - 01:51 AM If you're curious how to add italics, bolding, links, headings, lists, tables, block quotes and other special formatting to your Mudcat messages, check out these other threads:
|
Subject: RE: Tech: Moderated thread on ampersand escapes From: GUEST,Grishka Date: 10 Feb 11 - 03:34 PM If desired, my friends and I can donate 200 lines of Java code forming a tool which displays all the Unicode characters and copies any of them to the clipboard upon single-clicking. That is: Click on the character (if necessary, scroll in the table to find it), click into the Mudcat entrybox, press Ctrl-V, voilà. Leenia, ArtfulCodger is going to explain that notion so that you understand it. It is required for anyone who wants to post to Mudcat anything beyond plain English text, for example schöne Grüße or 卄开尸尸丫 く卄工丼乇丂乇 丼乇山 丫乇开尺 !. |
Subject: RE: Tech: Moderated thread on ampersand escapes From: Artful Codger Date: 10 Feb 11 - 05:58 PM To Grishka: Yes, please post your Java source in a new thread, and let me know the link. |
Subject: RE: Tech: Moderated thread on ampersand escapes From: JohnInKansas Date: 10 Feb 11 - 09:08 PM In "plain English" an escape character is one designated in the language the computer is using that tells the computer to treat what follows in a special way. The usual and most common usage is to tell the interpreter that what follows immediately after the "escape" is a code of some kind, and not just simple text to be copied and displayed. The & character is designated in HTML as being an "escape" character that means that what follows it has a special meaning. Because of this special usage, in ancient times it was necessary to "use the escape code to code the display of the & character" so we had to type & to "just type" &. At mudcat, instead of "the & character is an escape" the interpreter has been told "the & character is an escape unless it's followed by a blank space" and we can now (usually) just type & if we just want to display & - as long as there's a space before and after it. The HTML standard includes the ability to designate any typographical character by using the character number assigned to the character. Each character is assigned a unique number. A variety of "character definitions" have been used, beginning with DOS and Unix, later merged into the ANSI standard definition for most uses. The current "most complete" definition of character numbers is the Unicode Standard. All of the later sets of character definitions are intended to be able to "include" all earlier ones, but there are cases where this doesn't always work. Artful Codger will explain to us when and why it sometimes fails. If you want to use the character numbers to be sure that the character you intend is sent to the html interpreter, you must type an & to tell the interpreter that "code is coming." To use character numbers, the & must be immediately followed by a # character to tell the interpreter that what follows is a number. The decimal number for the character must follow immediately after, and the "code" must be ended with a semicolon ;. If you type © the "copyright" symbol © should be displayed in your html post, since the decimal number 0169 is assigned to that symbol in all of the various standards. Because of variations in different Operating Systems, and different ways in which information is sent between computers, there are several different ways in which information "in transit" can be "encoded." Failures to get an accurate transmission and interpretation of the characters you attempt to send, when someone else receives and displays them, may result from these differences. Artful Codger will explain these problems, and what you can do about them. Fewer errors may result if the Hexadecimal numbers are used for characters that are assigned "bigger numbers." The Unicode Standard uses only hex numbers to define the characters. As before, to "post" a specific character using the "hex number" assigned to it, you begin with the & "escape" to tell the interpreter that you're using a code. The & must be immediately followed by the # character to say that the code is a number. The # charcter must be immediately followed by an x (or X) to tell the interpreter that the number is in hexadecimal format. The hex number follows immediately, and the code is ended with a semicolon ;. The decimal number 169 is the same as the hexidecimal number 00A9. Typing © should display the "copyright" symbol © in an html post. The above is the "simple" explanation of the basics of posting characters using the numerical "names" assigned to characters in the standards. Additional information will follow in later posts. John |
Subject: RE: Tech: Moderated thread on ampersand escapes From: JohnInKansas Date: 10 Feb 11 - 11:47 PM Any character defined in the ISO or ANSI character set standard can be posted to html using the decimal character number in the form: &#nnn; where & is the html "escape" to designate that a code follows the number symbol # designates a numerical form for the code nnn is the decimal number assigned to the character a semicolon ; ends the code.
An important correction: One must use the Unicode values. ISO(-8859-1) and ANSI character values largely coincide with Unicode values, but there are also many differences. -Artful Codger-
|
Subject: RE: Tech: Moderated thread on ampersand escapes From: JohnInKansas Date: 11 Feb 11 - 01:13 AM Characters that might not be on your keyboard can be posted using the ampersand-escape method, but for numerous reasonably common characters it is also possible to use an ampersand escape with the code being a name for the character rather than a character number. To do that you would use the form: &name; The&once again tells the html interpreter that it's a code. Without the # to tell it otherwise, the interpreter reads the code as a character string – i.e. as a "word." The semicolon ;ends the code. The characters for which "name codes" exist are called "named character entities" by the HTML standards, frequently shortened to just "character entities." There are a few character entities defined by various versions of the HTML standard, and several others that are in "common use" and can normally be expected to work. Most "HTML Textbooks" don't distinguished between the "specified" ones and the ones in general use. Arful Codger may explain which are to be considered "useful" at mudcat, or perhaps just which ARE NOT to be considered okay here. Note that in some cases "what you type" may be case sensitive. The following are the named character entities that my rather old handbook shows:
Note that the above "tabulation" uses a <pre> and </pre> tag and shoud show in a monospace type. If columns don't show, or if someone copies and pastes it, selecting Courier or some other mono font may straighten them out. If incorporated into the "permanent" thread, they probably should be made into a real table(?). These named char entities are the ones commonly used in English/Roman text, and include those named in the html standard. I have NOT checked to see how many are included in the list that aren't "standard" although all of these should be read by most html website interpreters. Other languages may include other entities based on common usage, but they're probably unlikely to be recognized correctly at mudcat. When in doubt, the numerical code should be used. Decimal character numbers from 130 through 150 are shown in my (HTML ver 4) handbook as "sometimes won't work." I don't at present have an answer as to why they're listed if the don't; but maybe we'll get to that later. John |
Subject: RE: Tech: Moderated thread on ampersand escapes From: Artful Codger Date: 11 Feb 11 - 03:32 AM All mnemonic escapes (character entity references) are case-sensitive, according to the standard. The official list of mnemonic escapes is defined here: http://www.w3.org/TR/html401/sgml/entities.html The organization, however, leaves something to be desired, being neither by value nor by coherent subgrouping (acutes together, or accented a's together, or paired characters together). Here, I'm only concerned with which standard mnemonics are (and aren't) understood by Mudcat, though perhaps that may depend on the user's browser more than on Mudcat. With my browser (Firefox), I've found only a handful of oddballs. The numeric escapes should always work, regardless of browser—as long as the defined character is printable and defined in the reasonably standard set of fonts. As it happens, values x80-x9F (decimal 128-159) are control codes in the Unicode map, so it's not surprising if they "sometimes won't work." I should point out that mnemonic escapes are defined for a number of characters with values > 255; John's table is correct as far as it goes, but incomplete. In particular, there are mnemonics for most of the "troublesome" characters I mentioned, as well as for other quote symbols, currency symbols, a bullet, daggers, lone diacriticals, and additional ligatures and mathematical symbols. |
Subject: RE: Tech: Moderated thread on ampersand escapes From: GUEST,Grishka Date: 11 Feb 11 - 04:47 PM Here is CopyUnicode.java , as requested (10 Feb 11 - 05:58 PM).That tool and this thread may help to alleviate the symptoms, but a good cure would be preferable, which I think would also take less effort than what we are doing here. [TDB: Provide links to such threads.]– one tightly moderated thread would suffice, in which the official Mudcat policy is explained and discussed. TIA. ♫ Imagine there's no encoding problem; it isn't hard to do! ♫. Have fun with CopyUnicode anyway.Grishka |
Subject: RE: Tech: Moderated thread on ampersand escapes From: GUEST,Grishka Date: 11 Feb 11 - 04:52 PM Sorry, I was just intoxicated by the prospect. This is the thread, of course. |
Subject: RE: Tech: Entering special characters From: Artful Codger Date: 11 Feb 11 - 10:12 PM Now, how can you tell if you've entered text properly? The short answer is, not easily. That's because what you see (given your system, locale and browser settings) isn't necessarily what other people will, so previewing will often mislead you in this regard. You just have to know which characters are valid "raw" and which aren't, or are suspect. The ASCII set I listed is safe, anything else isn't. The input system here does automatically encode some characters for you, and if you preview your text, you'll see the resulting escapes in the input box (in the preview area, they'll show as you expect them to do). Unfortunately, the characters most likely to be converted are the ones you're least likely to use. To explain this, I have to get a bit nerdy. The input/display system is still centered around codepages. Until recent computing times, memory was relatively limited, so early character sets were limited—specifically to a set of 256 characters, corresponding to the number of values one could represent in a byte. As you can imagine, different areas had need of different sets of characters, and there was also an increasing demand for symbols of all sorts. So a variety of codepages developed—various mappings of these 256 values to different sets of characters—and schemes developed for switching between them, so that a user could intermix characters that resided in different codepages. A user's default codepage would thus receive the lion's share of usage, with minimized forays into other codepages, managed largely under the covers. To add to the confusion, different systems evolved their own sets of codepages; it was only late in the game that standards committees tried to impose some uniformity. To make matters worse, most text files had no provision to indicate which codepage was used to create them, so if one was transferred to a different system and opened, it might be interpreted using a different character mapping, and the result could be garbage. That's the behavior we see here. What appears to be happening is this: Your browser typically has a default codepage setting active, appropriate to your locale, and it's used for most of the text you're most likely to peruse. But by default (i.e, when no encoding is explicitly stated for a web page, as currently on Mudcat), HTML pages are supposed to be encoded using only 7-bit ASCII (which is a subset of most other codepages, at least in Western Europe and the Americas). This comprises only the first 128 values. How, then, should the browser handle the remaining 128 values, when they are encountered? It could skip them, or report that the page is invalid, or replace them with some "I don't know" character like the query. But most browsers interpret these values according to your default codepage. The result is that if you enter a character value in the range 128-255, it is rendered differently from one user environment to another. But you're only typing in characters, and the browser shows you those characters, and you can type in characters that are outside your codepage and they'll show up, too! So why doesn't the input system preserve the characters it knows you're entering, encoding them if necessary? Well, it kinda does and it kinda doesn't. For manipulating the characters internally, it probably converts them all to 16-bit or 32-bit Unicode. But when it returns collected input (as from the message box), it converts the text according to the specified encoding for the web page. If no encoding is specified, it returns (byte) values to the Mudcat software according to the poster's default or selected codepage—but without indicating which codepage that was. If the character isn't in the poster's codepage, one of two possibilities seems to be occurring (which have the same result): either the browser encodes the text as a character reference and returns that sequence value-by-value to Mudcat, or it hands Mudcat the Unicode value for that character, and Mudcat software encodes the value as character reference (since you can't store a 16-bit value in a single byte). Either way, life is good: the character ends up stored as a character reference, and should display properly on all systems (font problems aside). The bad news is that characters in the 128-255 range according to the poster's effective codepage are left unencoded, and are ambiguously rendered. This means that even though a character like a right double quote has a Unicode value well beyond the problem range, it will likely be handed to Mudcat using one of the problem range values instead of being automatically encoded. So let's consider an example: suppose a user copies some Cyrillic text off the net (the source doesn't matter) and pastes it into a Mudcat message. What will happen? We have to consider two scenarios. If the poster is a Western European, generally using Western European settings, his codepage will not include Cyrillic characters, so they'll all be automatically encoded as escapes and will display properly to virtually all users. Not so for non-ASCII characters like the right double quote, long dashes, or copyright symbol, all of which are included in the poster's codepage. These will remain unencoded, and will be ambiguously displayed to other users. If the poster is a Russian, his codepage might be KOI-8 or ISO-8859-5 or CP-866, all of which map Cyrillic, but to different values in the trouble range. In this case, all the text will remain unencoded, mostly with trouble range values, so that Western European viewers and even Russian viewers with a different default codepage will see garbage. That's why it's so important to encode all non-ASCII characters as escapes, and not rely on the input system to do it for you. There have been many suggestions about how to cure this situation, so that what you enter is what others see, and you don't have to do any special encoding (aside from the occasional ampersand and angle bracket). But each solution has drawbacks in regard to old threads, where the text has already been entered improperly and no automated fix is feasible. As I've said at the beginning of this thread, don't post your suggestions for fixing the problem here; post them to another thread (like this one). |
Subject: RE: Tech: Entering special characters (moderated) From: JohnInKansas Date: 11 Feb 11 - 10:39 PM Re comment added at 10 Feb 11 - 11:47 PM. The Unicode value for the Euro is as stated. There is no official "euro" in the ANSI standard. The decimal number 0128 was unassigned by ANSI, so MICROSOFT decided to use it for Windows computers released in the US and elsewhere outside the Euro nations, where keyboards don't have a key for the euro symbol. The assignment of the euro symbol to decimal number 0128 is essentially a "Windows" font table extension that Microsoft describes as allowing the use of the Alt-NumPad method of entry. With NumLock turned on, you can hold down the Alt key while typing 0128 on the number pad, (the leading zero is required on my machine) and a euro symbol will be inserted into your document. If, however, you put your cursor (in Word) directly following the euro symbol you've inserted, and click Alt-X to toggle it to the Unicode value for the character, you get "20AC" which is the correct hex number (decimal 8364) for the Unicode euro character. I don't know how many other operating systems may have adopted the "0128" shortcut; but it's NOT an ANSI thing, it's just Microsoft trickery. John |
Subject: RE: Tech: Entering special characters (moderated) From: Artful Codger Date: 11 Feb 11 - 10:53 PM That doesn't surprise me. The chart I was referring to says that ANSI defines x80-x9F as control codes, but that ISO-8859-1 defines them for a variety of characters (quotes, dashes, daggers, TM, ...). They may have meant MS's implementation of ISO-8859-1. But the ISO mapping on my browser (Firefox on Mac) does appear to include these characters. |
Subject: RE: Tech: Entering special characters (moderated) From: JohnInKansas Date: 21 Sep 12 - 04:27 PM Artful C - A possible correction to a correction you made a little above is that the Euro is not an ANSI character, or at least wasn't the last time I looked a year or so ago. That decimal character number was, and apparently still is, defined in ANSI as a "reserved number" with no assigned meaning. That allows any individual font designers and/or programmers to assign any character they need to that number. (And Unicode includes a fairly large number of such "unassigned" or "reserved" character numbers.) Microsoft simply elected to use the Alt-Numpad-128 method to allow users without a € key on their keyboard an easier way to enter the new symbol. When you use the Alt-Numpad-128 method in recent Word versions, the character actually printed has the correct Unicode numerical value of Hex 208C. The transformation in Windows is actually done in the "character code pages" that flip in and out of RAM during use of programs. Essentially, the use of Alt-Numpad-128 for the ₌ is just a keyboard shortcut, added as a default in recent Windows versions, that enters something other than what you type. John |
Subject: RE: Tech: Entering special characters (moderated) From: Artful Codger Date: 22 Sep 12 - 02:16 AM To avoid potential confusion, I've removed the Euro example from my correction. The main point of the correction (that ISO and ANSI values don't always coincide with Unicode in the high-8-bit range, and that only the Unicode values are guaranteed to work in numeric references) stands. Since the Euro is likely to be mapped somewhere in one's native codepage, pasting or typing a Euro symbol directly is likely to leave the character improperly encoded. Best to encode the Euro as €. That's all people need to know here— the rest just clouds the issue. |
Share Thread: |
Subject: | Help |
From: | |
Preview Automatic Linebreaks Make a link ("blue clicky") |