The Mudcat Café TM
Thread #135056   Message #3078367
Posted By: Artful Codger
19-Jan-11 - 10:13 PM
Thread Name: Tech: Entering special characters (moderated)
Subject: Tech: Entering special characters (moderated)

Entering Special Characters

Contents of this guide

What are character references (escapes)?
Why should I care? Why use them?
How do I encode them?
Mnemonic escapes
Numeric escapes
Escape charts for selected languages

What are character references?

Mudcat messages are really segments of HTML code--the stuff web pages are made of. Character references are sequences of plain text which allow you to embed special characters in HTML in a portable way, independent of people's system, locale or browser settings. In source code (as when entering a Mudcat message), a character reference is a string of text beginning with an ampersand and ending with a semicolon, and denotes a single character to be displayed.

Example: The sequence © in a web page is rendered as the copyright symbol (©).

The HTML specification calls these special sequences "character references", but hardly anyone uses that term outside of formal documents. Instead, people may call them something like "character escapes", "character entities", "ampersand codes", "escape sequences" or just "escapes". They come in two forms--those that use short mnemonic names (formally, character entity references) and those that use numbers (numeric character references). For brevity, I'll refer to these as "mnemonic escapes" and "numeric escapes", respectively.

[I prefer the term "escape" to "reference" because, historically, escapes were character/byte sequences beginning with a special character or key (ESC) that were given special interpretation; by extension, it now refers to other similar sequences. In the programming world, "reference" generally means something quite different.]

Why should I care? Why use them?

Because you want your text to appear to all users in the same way. You don't want some users to see a question mark or random symbol when what you meant them to see was a right double quote, an accented e, a long dash, or a Cyrillic M. Even if it appears correct to you, it won't necessarily appear the same way to others.

Due to vagaries of the current input system, any character you enter which isn't in the 7-bit ASCII character set and which isn't encoded as a character reference may display improperly to other users, particularly those on different systems, in different locales, using different browsers or with different browser settings. You're really creating invalid HTML when you do that. Character escapes resolve most of the display incompatibility issues.

Which characters can you safely type in "raw"? What's in the usable ASCII set? Only: Everything else should be encoded using escapes. That includes other punctuation, symbols, accented letters, and non-English letters.

Particularly troublesome are quotes, apostrophes, dashes and ellipses, since word processors tend to replace the ASCII characters you type with "smart" symbols outside the ASCII set. A double quote you type directly in the message box is fine, because it's a "straight" quote (not biased right or left) which is in the ASCII set, but word processors usually turn these into curly quotes, depending on which side of text they lie on—these quotes are not ASCII characters. This applies to the apostrophe as well; typing directly is fine, but pasting from another source is suspect. Dashes other than the short hyphen are non-ASCII. An ellipsis is non-ASCII. All these characters display incorrectly to some segment of users if not encoded.

WARNING: Text copied from other web pages is prone to include these troublesome non-ASCII characters.

Ampersands and angle brackets should also be encoded when they are used literally instead as part of an HTML construct. (To be wholly proper, you should encode straight double quotes as well.) Failing to do so creates invalid HTML, but most browsers will treat these lapses benignly. That said, more than one ABC transcription has been corrupted because the posters failed to escape their angle brackets. Because people need these characters to embed HTML directly, Mudcat will never attempt to escape these characters for you, even if it eventually learns to automatically encode or accept all the others.

How do I encode them?

The surest way is to feed your text through a converter program before you apply any formatting (like adding heading, italic or bolding tags). Currently, Mudcat itself doesn't provide such a utility, but I've written a tiny, downloadable, cross-platform program you can put on your desktop. With just a double-click, it will encode text properly while it's on your clipboard—see HtmlEsc.java.

If you only need the occasional character, try this nifty utility from guest Grishka: CopyUnicode (see the second message for download links and description of usage).

You can also use web-based converters, such as this one:
http://www.reconn.us/component/option,com_wrapper/Itemid,62/
To use an online converter, you paste your text into one pane of the converter, click their convert button, then copy the converted text into the Mudcat entry box.

Certain characters—single and double quotes, apostrophes, dashes and ellipses—can be replaced with ASCII near-equivalents just by deleting and retyping them once the text has been pasted to the message entry box. But it's easy to miss some characters, so I still recommend using a converter.

You can also hand-code the odd exceptions (say, to add a long dash, accented character or copyright symbol, or to encode a pair of smart quotes). The sections below present the most common escapes you're likely to need, and show you how to find or form the rest.

Mnemonic escapes

Mnemonic escapes consist of an ampersand, a short, predefined mnemonic name, and a terminating semicolon. A separate escape is required for each character. All mnemonics are case-sensitive.

Mnemonic escapes exist for only a restricted subset of all the characters which can be displayed in Mudcat messages, but most of those needed for Western European languages are covered, as well as the most commonly-needed symbols. If you need to enter a character which isn't in the mnemonic escape subset, you'll have to use a numeric escape instead. No mnemonic escapes are defined for non-Roman characters (Greek, Cyrillic, Japanese kana, Devanagari...).
[TDB: List the unrecognized mnemonic escapes and give numeric alternatives.]

Here are the most commonly needed mnemonic escapes for English-language text:

Char Mnemonic
escape
Description
“ Left double quote
” Right double quote
‘ Left single quote
’ Right single quote (commonly used for apostrophe)
- (none) Hyphen/short dash (just use a hyphen)
– N-dash (medium; typically used in number ranges)
— M-dash (long)
… Ellipsis
& & Ampersand
< &lt; Left angle bracket (less-than sign)
> &gt; Right angle bracket (greater-than sign)
© &copy; Copyright symbol
£ &pound; Pound (currency) sign
&euro; Euro sign (€)
  &nbsp; Non-breaking space (useful in series to create an indent at the start of a line)

Mnemonics for accented characters follow a consistent pattern, although they're only defined for certain base letters. The following table summarizes the accented characters for which mnemonic escapes are defined; replace the X in the template with a letter in the following list of valid characters. Accented characters not in this table must be encoded using numeric escapes.

AccentExampleTemplateValid characters
Acuteá&Xacute;A E I O U Y a e i o u y
Graveà&Xgrave;A E I O U a e i o u
Circumflexâ&Xcirc;A E I O U a e i o u
Diaresis/umlautä&Xuml;A E I O U Y a e i o u y
Tildeñ&Xtilde;A N O a n o
Cedillaç&Xcedil;C c
Caronš&Xcaron;S s

Here are some additional characters for which mnemonic escapes are defined.

Char Mnemonic Description Char Mnemonic Description
ß &szlig; Eszett (s-z ligature) ƒ &fnof; Florin / function
Æ &AElig; A-E ligature æ &aelig; a-e-ligature
Œ &OElig; O-E ligature œ &oelig; o-e ligature
Ø &Oslash; Slashed O ø &oslash; Slashed o
Ð &ETH; Eth ð &eth; eth
Þ &THORN; Thorn þ &thorn; thorn
¡ &iexcl; Inverted exclamation mark ¿ &iquest; Inverted question mark
xª &ordf; Feminine ordinal xº &ordm; Masculine ordinal
x° &deg; Degree x¹ &sup1; Superscript 1
x² &sup2; Superscript 2 x³ &sup3; Superscript 3
« &laquo; Left (double) angle quote » &raquo; Right (double) angle quote
&lsaquo; Left single angle quote &rsaquo; Right single angle quote
&bdquo; Bottom (left) double quote &sbquo; Bottom (left) single quote*
· &middot; Middle dot &bull; List bullet
&dagger; Footnote dagger &Dagger; Double dagger
§ &sect; Section &para; Paragraph (pilcrow)
&trade; Trademark symbol ® &reg; Registered symbol
¢ &cent; Cent ¥ &yen; Yen
¤ &curren; Currency ¦ &brvbar; Broken vertical bar
± &plusmn; Plus/minus &ne; Not equal
&le; Less than or equal &ge; Greater than or equal
× &times; Times sign ÷ &divide; Division sign
- &shy; Soft hyphen*      

Note the reversal of letters between the bottom quote mnemonics: bdquo (bottom double) but sbquo (single bottom).

The soft hyphen is used to suggest a possible hyphenation point in words or long strings of unbroken text, to facilitate line wrapping. It is not displayed unless a browser chooses that point to wrap.

Resist the temptation to embed lots of unnecessary special characters in your posts, since it reduces the ability of other people to copy your text and save it in plain-text files that may not support these characters.

Alphabets for languages often include digraphs: a pair of letters which are treated as a single unit, and which may be ligated (joined together). For instance, "ch" is considered a single letter in many languages, including Welsh, Czech and (until recently) Spanish. While Unicode does include some digraphs as single characters (like æ, IJ and dz), most digraphs should just be spelled with simple character pairs.

For a more complete list of the mnemonic escapes, there are help pages online, such as this one:
http://rabbit.eng.miami.edu/info/htmlchars.html
The HTML standard provides a complete (if less friendly) listing here:
http://www.w3.org/TR/html401/sgml/entities.html

Numeric escapes

To express a character using a numeric escape, you first need to find its Unicode value (technically, its codepoint value). I'll describe how to do that after I present the basic notation.

There are two ways the value can be encoded: as a decimal value or as a hexidecimal (base-16) value. The format of a decimal numeric escape is The hexadecimal format is Note the "#" in both formats and the extra "x" for hexadecimal. You can use x or X; case doesn't matter.

Example: The Welsh character long-w (w with circumflex: ŵ) has a Unicode decimal value of 373, so it can be encoded as &#373;. The same value in hexadecimal is 175, so it can also be encoded as &#x175;

Although you will often find values expressed with leading zeros, they aren't required in numeric escapes, and if present, will be ignored.

Nowadays, most charts supply character values in hexadecimal. Hexadecimal requires 6 additional digits, represented by the letters A-F (or a-f; HTML ignores case for hexadecimal digits).

So now you know the format, but how do you find the Unicode value for a character? Well, you could check the charts at the Unicode Consortium website, viewable/downloadable as PDF files:
http://www.unicode.org/charts/

Fortunately, most operating systems provide a simpler way to browse Unicode character maps (and also check whether the character is available in a given font—at least on your system). For instance, on Macs (running OS 10.5, and probably 10.6), you can open the Character Palette and click on likely character subsets in the scroll list. To obtain a hexadecimal value, click on a character and look at the Unicode field displayed in the Character Info section.
[Can someone provide similar instructions for Windows? Please specify which release you're talking about, since Microsoft likes to change everything around with each release.]

The full set of characters for a particular language may be spread out in multiple subranges of the Unicode tables. So it often helps to have a crib chart just for that language (and I'll attempt to supply some). A little web surfing may turn up a chart for your language of interest. For non-Roman alphabets, using a converter is a better option.

Sometimes, a single character may either be composed of multiple components or have a value beyond the 16-bit range, in which case it may need to be expressed as a sequence of codepoints, each codepoint expressed as a separate HTML escape. Examples are Chinese pictographs and Cuneiform. There are also many codepoints for applying special formatting, such as to reverse writing direction for Arabic and Semitic script. But a full discussion of Unicode is beyond the scope of this guide.