Lyrics & Knowledge Personal Pages Record Shop Auction Links Radio & Media Kids Membership Help
The Mudcat Cafesj

Post to this Thread - Sort Descending - Printer Friendly - Home


Tech: Entering special characters (moderated)

Related threads:
Mudcat HTML Guide PermaThread (64)
tech: HTML Ampersand Codes (33)
Tech: Non-ASCII character display problems (47) (closed)
Tech: HtmlEsc.java: Convert special chars (6)
Tech: CopyUnicode: Create any char (17)
Tech - ALTKEY Codes on Laptop (28)
HTML Stuff II (126)
Tech: htmlesc.py: Mac script to escape text (12)
HTML Tables (19)
Clickable Links (14)
HTML Beginners Study Guide (3)


Artful Codger 18 Jan 11 - 03:40 PM
Artful Codger 19 Jan 11 - 10:13 PM
Artful Codger 20 Jan 11 - 10:18 PM
Artful Codger 22 Jan 11 - 10:41 PM
Artful Codger 10 Feb 11 - 01:51 AM
GUEST,Grishka 10 Feb 11 - 03:34 PM
Artful Codger 10 Feb 11 - 05:58 PM
JohnInKansas 10 Feb 11 - 09:08 PM
JohnInKansas 10 Feb 11 - 11:47 PM
JohnInKansas 11 Feb 11 - 01:13 AM
Artful Codger 11 Feb 11 - 03:32 AM
GUEST,Grishka 11 Feb 11 - 04:47 PM
GUEST,Grishka 11 Feb 11 - 04:52 PM
Artful Codger 11 Feb 11 - 10:12 PM
JohnInKansas 11 Feb 11 - 10:39 PM
Artful Codger 11 Feb 11 - 10:53 PM
JohnInKansas 21 Sep 12 - 04:27 PM
Artful Codger 22 Sep 12 - 02:16 AM
Share Thread
more
Lyrics & Knowledge Search [Advanced]
DT  Forum Child
Sort (Forum) by:relevance date
DT Lyrics:





Subject: Tech: Entering special characters (moderated)
From: Artful Codger
Date: 18 Jan 11 - 03:40 PM

ATTENTION: The guide below shows you how to properly enter special characters in your messages—curly quotes, long dashes, symbols, and accented and non-Roman characters (like Cyrillic, Hebrew and Japanese kana). It explains what HTML character references are, why and when people should be using them, how to encode the common ones, and where they can go for more information. Also described are the display problems that occur when people don't encode such characters properly and some tools that help you convert or insert text.

This thread is not intended as a discussion thread. You may post corrections, questions and additional material to be incorporated into the guide, but comments on display problems or improvements to Mudcat character handling should be posted instead to the thread Tech: Non-ASCII character display problems.
Recent changes:
  • 14 Feb 2011: Charts added for Czech, Slovak and Polish.
  • 13 Feb 2011: Charts added for Esperanto, Irish, Gaelic and Welsh.
  • 12 Feb 2011: How to Encode: Reference to CopyUnicode utility added.
  • 12 Feb 2011: How to Encode: Description of HtmlEsc utility updated. It is now available as a downloadable JAR, easily used by anyone.
  • 12 Feb 2011: Mnemonic escapes: Table of some additional characters added.
  • 11 Feb 2011: Message added describing the display problem in more detail.


This is an edited PermaThread® for the description of HTML character references. This thread will be edited by Artful Codger, who will consolidate the information posted here into a technical guide. Feel free to post to this thread, but remember that all messages posted here are subject to editing or deletion.

-Joe Offer-


Post - Top - Home - Printer Friendly - Translate

Subject: Tech: Entering special characters (moderated)
From: Artful Codger
Date: 19 Jan 11 - 10:13 PM

Entering Special Characters

Contents of this guide

What are character references (escapes)?
Why should I care? Why use them?
How do I encode them?
Mnemonic escapes
Numeric escapes
Escape charts for selected languages

What are character references?

Mudcat messages are really segments of HTML code--the stuff web pages are made of. Character references are sequences of plain text which allow you to embed special characters in HTML in a portable way, independent of people's system, locale or browser settings. In source code (as when entering a Mudcat message), a character reference is a string of text beginning with an ampersand and ending with a semicolon, and denotes a single character to be displayed.

Example: The sequence © in a web page is rendered as the copyright symbol (©).

The HTML specification calls these special sequences "character references", but hardly anyone uses that term outside of formal documents. Instead, people may call them something like "character escapes", "character entities", "ampersand codes", "escape sequences" or just "escapes". They come in two forms--those that use short mnemonic names (formally, character entity references) and those that use numbers (numeric character references). For brevity, I'll refer to these as "mnemonic escapes" and "numeric escapes", respectively.

[I prefer the term "escape" to "reference" because, historically, escapes were character/byte sequences beginning with a special character or key (ESC) that were given special interpretation; by extension, it now refers to other similar sequences. In the programming world, "reference" generally means something quite different.]

Why should I care? Why use them?

Because you want your text to appear to all users in the same way. You don't want some users to see a question mark or random symbol when what you meant them to see was a right double quote, an accented e, a long dash, or a Cyrillic M. Even if it appears correct to you, it won't necessarily appear the same way to others.

Due to vagaries of the current input system, any character you enter which isn't in the 7-bit ASCII character set and which isn't encoded as a character reference may display improperly to other users, particularly those on different systems, in different locales, using different browsers or with different browser settings. You're really creating invalid HTML when you do that. Character escapes resolve most of the display incompatibility issues.

Which characters can you safely type in "raw"? What's in the usable ASCII set? Only:
  • the English alphabet A-z, a-z, with no accented or composite characters
  • numbers: 0-9
  • the space, tab, newline (and carriage return), as well as some other non-printing characters you probably shouldn't use.
  • certain punctuation characters: ! " # $ % ' ( ) * + , - . / : ; = ? @ [ ] \ ^ _ ` { } | ~
  • three other punctuation characters having special meaning to HTML: & < > (But see below)
Everything else should be encoded using escapes. That includes other punctuation, symbols, accented letters, and non-English letters.

Particularly troublesome are quotes, apostrophes, dashes and ellipses, since word processors tend to replace the ASCII characters you type with "smart" symbols outside the ASCII set. A double quote you type directly in the message box is fine, because it's a "straight" quote (not biased right or left) which is in the ASCII set, but word processors usually turn these into curly quotes, depending on which side of text they lie on—these quotes are not ASCII characters. This applies to the apostrophe as well; typing directly is fine, but pasting from another source is suspect. Dashes other than the short hyphen are non-ASCII. An ellipsis is non-ASCII. All these characters display incorrectly to some segment of users if not encoded.

WARNING: Text copied from other web pages is prone to include these troublesome non-ASCII characters.

Ampersands and angle brackets should also be encoded when they are used literally instead as part of an HTML construct. (To be wholly proper, you should encode straight double quotes as well.) Failing to do so creates invalid HTML, but most browsers will treat these lapses benignly. That said, more than one ABC transcription has been corrupted because the posters failed to escape their angle brackets. Because people need these characters to embed HTML directly, Mudcat will never attempt to escape these characters for you, even if it eventually learns to automatically encode or accept all the others.

How do I encode them?

The surest way is to feed your text through a converter program before you apply any formatting (like adding heading, italic or bolding tags). Currently, Mudcat itself doesn't provide such a utility, but I've written a tiny, downloadable, cross-platform program you can put on your desktop. With just a double-click, it will encode text properly while it's on your clipboard—see HtmlEsc.java.

If you only need the occasional character, try this nifty utility from guest Grishka: CopyUnicode (see the second message for download links and description of usage).

You can also use web-based converters, such as this one:
http://www.reconn.us/component/option,com_wrapper/Itemid,62/
To use an online converter, you paste your text into one pane of the converter, click their convert button, then copy the converted text into the Mudcat entry box.

Certain characters—single and double quotes, apostrophes, dashes and ellipses—can be replaced with ASCII near-equivalents just by deleting and retyping them once the text has been pasted to the message entry box. But it's easy to miss some characters, so I still recommend using a converter.

You can also hand-code the odd exceptions (say, to add a long dash, accented character or copyright symbol, or to encode a pair of smart quotes). The sections below present the most common escapes you're likely to need, and show you how to find or form the rest.

Mnemonic escapes

Mnemonic escapes consist of an ampersand, a short, predefined mnemonic name, and a terminating semicolon. A separate escape is required for each character. All mnemonics are case-sensitive.

Mnemonic escapes exist for only a restricted subset of all the characters which can be displayed in Mudcat messages, but most of those needed for Western European languages are covered, as well as the most commonly-needed symbols. If you need to enter a character which isn't in the mnemonic escape subset, you'll have to use a numeric escape instead. No mnemonic escapes are defined for non-Roman characters (Greek, Cyrillic, Japanese kana, Devanagari...).
[TDB: List the unrecognized mnemonic escapes and give numeric alternatives.]

Here are the most commonly needed mnemonic escapes for English-language text:

Char Mnemonic
escape
Description
&ldquo; Left double quote
&rdquo; Right double quote
&lsquo; Left single quote
&rsquo; Right single quote (commonly used for apostrophe)
- (none) Hyphen/short dash (just use a hyphen)
&ndash; N-dash (medium; typically used in number ranges)
&mdash; M-dash (long)
&hellip; Ellipsis
& &amp; Ampersand
< &lt; Left angle bracket (less-than sign)
> &gt; Right angle bracket (greater-than sign)
© &copy; Copyright symbol
£ &pound; Pound (currency) sign
&euro; Euro sign (€)
  &nbsp; Non-breaking space (useful in series to create an indent at the start of a line)

Mnemonics for accented characters follow a consistent pattern, although they're only defined for certain base letters. The following table summarizes the accented characters for which mnemonic escapes are defined; replace the X in the template with a letter in the following list of valid characters. Accented characters not in this table must be encoded using numeric escapes.

AccentExampleTemplateValid characters
Acuteá&Xacute;A E I O U Y a e i o u y
Graveà&Xgrave;A E I O U a e i o u
Circumflexâ&Xcirc;A E I O U a e i o u
Diaresis/umlautä&Xuml;A E I O U Y a e i o u y
Tildeñ&Xtilde;A N O a n o
Cedillaç&Xcedil;C c
Caronš&Xcaron;S s

Here are some additional characters for which mnemonic escapes are defined.

Char Mnemonic Description Char Mnemonic Description
ß &szlig; Eszett (s-z ligature) ƒ &fnof; Florin / function
Æ &AElig; A-E ligature æ &aelig; a-e-ligature
Œ &OElig; O-E ligature œ &oelig; o-e ligature
Ø &Oslash; Slashed O ø &oslash; Slashed o
Ð &ETH; Eth ð &eth; eth
Þ &THORN; Thorn þ &thorn; thorn
¡ &iexcl; Inverted exclamation mark ¿ &iquest; Inverted question mark
xª &ordf; Feminine ordinal xº &ordm; Masculine ordinal
x° &deg; Degree x¹ &sup1; Superscript 1
x² &sup2; Superscript 2 x³ &sup3; Superscript 3
« &laquo; Left (double) angle quote » &raquo; Right (double) angle quote
&lsaquo; Left single angle quote &rsaquo; Right single angle quote
&bdquo; Bottom (left) double quote &sbquo; Bottom (left) single quote*
· &middot; Middle dot &bull; List bullet
&dagger; Footnote dagger &Dagger; Double dagger
§ &sect; Section &para; Paragraph (pilcrow)
&trade; Trademark symbol ® &reg; Registered symbol
¢ &cent; Cent ¥ &yen; Yen
¤ &curren; Currency ¦ &brvbar; Broken vertical bar
± &plusmn; Plus/minus &ne; Not equal
&le; Less than or equal &ge; Greater than or equal
× &times; Times sign ÷ &divide; Division sign
- &shy; Soft hyphen*      

Note the reversal of letters between the bottom quote mnemonics: bdquo (bottom double) but sbquo (single bottom).

The soft hyphen is used to suggest a possible hyphenation point in words or long strings of unbroken text, to facilitate line wrapping. It is not displayed unless a browser chooses that point to wrap.

Resist the temptation to embed lots of unnecessary special characters in your posts, since it reduces the ability of other people to copy your text and save it in plain-text files that may not support these characters.

Alphabets for languages often include digraphs: a pair of letters which are treated as a single unit, and which may be ligated (joined together). For instance, "ch" is considered a single letter in many languages, including Welsh, Czech and (until recently) Spanish. While Unicode does include some digraphs as single characters (like æ, IJ and dz), most digraphs should just be spelled with simple character pairs.

For a more complete list of the mnemonic escapes, there are help pages online, such as this one:
http://rabbit.eng.miami.edu/info/htmlchars.html
The HTML standard provides a complete (if less friendly) listing here:
http://www.w3.org/TR/html401/sgml/entities.html

Numeric escapes

To express a character using a numeric escape, you first need to find its Unicode value (technically, its codepoint value). I'll describe how to do that after I present the basic notation.

There are two ways the value can be encoded: as a decimal value or as a hexidecimal (base-16) value. The format of a decimal numeric escape is
    &#value;
The hexadecimal format is
    &#xvalue;
Note the "#" in both formats and the extra "x" for hexadecimal. You can use x or X; case doesn't matter.

Example: The Welsh character long-w (w with circumflex: ŵ) has a Unicode decimal value of 373, so it can be encoded as &#373;. The same value in hexadecimal is 175, so it can also be encoded as &#x175;

Although you will often find values expressed with leading zeros, they aren't required in numeric escapes, and if present, will be ignored.

Nowadays, most charts supply character values in hexadecimal. Hexadecimal requires 6 additional digits, represented by the letters A-F (or a-f; HTML ignores case for hexadecimal digits).

So now you know the format, but how do you find the Unicode value for a character? Well, you could check the charts at the Unicode Consortium website, viewable/downloadable as PDF files:
http://www.unicode.org/charts/

Fortunately, most operating systems provide a simpler way to browse Unicode character maps (and also check whether the character is available in a given font—at least on your system). For instance, on Macs (running OS 10.5, and probably 10.6), you can open the Character Palette and click on likely character subsets in the scroll list. To obtain a hexadecimal value, click on a character and look at the Unicode field displayed in the Character Info section.
[Can someone provide similar instructions for Windows? Please specify which release you're talking about, since Microsoft likes to change everything around with each release.]

The full set of characters for a particular language may be spread out in multiple subranges of the Unicode tables. So it often helps to have a crib chart just for that language (and I'll attempt to supply some). A little web surfing may turn up a chart for your language of interest. For non-Roman alphabets, using a converter is a better option.

Sometimes, a single character may either be composed of multiple components or have a value beyond the 16-bit range, in which case it may need to be expressed as a sequence of codepoints, each codepoint expressed as a separate HTML escape. Examples are Chinese pictographs and Cuneiform. There are also many codepoints for applying special formatting, such as to reverse writing direction for Arabic and Semitic script. But a full discussion of Unicode is beyond the scope of this guide.


Post - Top - Home - Printer Friendly - Translate

Subject: Tech: Special characters by language
From: Artful Codger
Date: 20 Jan 11 - 10:18 PM

Below are some language-specific charts of special characters. If there are important characters I've missed, please let me know.

Esperanto
Capital Small
Description Char Hex Escape Char Hex Escape
C-caret Ĉ 108 &#x108; ĉ 109 &#x109;
G-caret Ĝ 11C &#x11C; ĝ 11D &#x11D;
H-caret Ĥ 124 &#x124; ĥ 125 &#x125;
J-caret Ĵ 134 &#x134; ĵ 135 &#x135;
S-caret Ŝ 15C &#x15C; ŝ 15D &#x15D;
U-breve Ŭ 16C &#x16C; ŭ 16D &#x16D;


Irish, Gaelic and Welsh
Capital Small
Description Char Hex Escape Char Hex Escape
A-acute Á C1 &#xC1; / &Aacute; á E1 &#xE1; / &aacute;
E-acute É C9 &#xC9; / &Eacute; é E9 &#xE9; / &eacute;
I-acute Í CD &#xCD; / &Iacute; í ED &#xED; / &iacute;
O-acute Ó D3 &#xD3; / &Oacute; ó F3 &#xF3; / &oacute;
U-acute Ú DA &#xDA; / &Uacute; ú FA &#xFA; / &uacute;
A-grave À C0 &#xC0; / &Agrave; à E0 &#xE0; / &agrave;
E-grave È C8 &#xC8; / &Egrave; è E8 &#xE8; / &egrave;
I-grave Ì CC &#xCC; / &Igrave; ì EC &#xEC; / &igrave;
O-grave Ò D2 &#xD2; / &Ograve; ò F2 &#xF2; / &ograve;
U-grave Ù D9 &#xD9; / &Ugrave; ù F9 &#xF9; / &ugrave;
W-grave 1E80 &#x1E80; 1E81 &#x1E81;
Y-grave 1EF2 &#x1EF2; 1EF3 &#x1EF3;
A-circumflex  C2 &#xC2; / &Acirc; â E2 &#xE2; / &acirc;
E-circumflex Ê CA &#xCA; / &Ecirc; ê EA &#xEA; / &ecirc;
I-circumflex Î CE &#xCE; / &Icirc; î EE &#xEE; / &icirc;
O-circumflex Ô D4 &#xD4; / &Ocirc; ô F4 &#xF4; / &ocirc;
U-circumflex Û DB &#xDB; / &Ucirc; û FB &#xFB; / &ucirc;
W-circumflex Ŵ 174 &#x174; ŵ 175 &#x175;
Y-circumflex Ŷ 176 &#x176; ŷ 177 &#x177;
B-dot 1E02 &#x1E02; 1E03 &#x1E03;
C-dot Ċ 10A &#x10A; ċ 10B &#x10B;
D-dot 1E0A &#x1E0A; 1E0B &#x1E0B;
F-dot 1E1E &#x1E1E; 1E1F &#x1E1F;
G-dot Ġ 120 &#x120; ġ 121 &#x121;
M-dot 1E40 &#x1E40; 1E41 &#x1E41;
P-dot 1E56 &#x1E56; 1E57 &#x1E57;
S-dot 1E60 &#x1E60; 1E61 &#x1E61;
T-dot 1E6A &#x1E6A; 1E6B &#x1E6B;
long r -- -- -- ɼ 27C &#x27C;
long s -- -- -- ſ 17F &#x17F;
long s-dot -- -- -- 1E9B &#x1E9B;
Tironian et
(7-agus)
-- -- -- 204A &#x204A;


Czech and Slovak
Capital Small
Description Char Escape Char Escape
A-acute Á &#xC1; / &Aacute; á &#xE1; / &aacute;
A-umlaut Ä &#xC4; / &Auml; ä &#xE4; / &auml;
C-caron Č &#x10C; č &#x10D;
D-caron Ď &#x10E; ď &#x10F;
Dz DZ
Dz
&#x1F1;
&#x1F2;
dz &#x1F3;
Dz-caron DŽ
Dž
&#x1C4;
&#x1C5;
dž &#x1C6;
E-acute É &#xC9; / &Eacute; é &#xE9; / &eacute;
E-caron Ě &#x11A; ě &#x11B;
I-acute Í &#xCD; / &Iacute; í &#xED; / &iacute;
L-acute Ĺ &#x139; ĺ &#x13A;
L-caron Ľ &#x13D; ľ &#x13E;
N-caron Ň &#x147; ň &#x148;
O-acute Ó &#xD3; / &Oacute; ó &#xF3; / &oacute;
O-circumflex Ô &#xD4; / &Ocirc; ô &#xF4; / &ocirc;
R-acute Ŕ &#x154; ŕ &#x155;
R-caron Ř &#x158; ř &#x159;
S-caron Š &#x160; / &Scaron; š &#x161; / &scaron;
T-caron Ť &#x164; ť &#x165;
U-acute Ú &#xDA; / &Uacute; ú &#xFA; / &uacute;
U-circle Ů &#x16E; ů &#x16F;
Y-acute Ý &#xDD; / &Yacute; ý &#xFD; / &yacute;
Z-caron Ž &#x17D; ž &#x17E;

Acute = čárka, caron = haček, circle = kroužek. For tall letters, haček (caron) often appears as an apostrophe on the right (an indivisible part of the letter).

The digraph "ch" is treated as a single letter, lying between H and I alphabetically, and sorting accordingly (for instance, in dictionaries and name lists). But it is spelled with two letters, as in English. Dz and dž are also commonly spelled with letter pairs.

Polish
Capital Small
Description Char Escape Char Escape
A-ogonek (tail) Ą &#x104; ą &#x105;
C-kreska (acute) Ć &#x106; ć &#x107;
E-ogonek Ę &#x118; ę &#x119;
L-slash Ł &#x141; ł &#x142;
N-kreska Ń &#x143; ń &#x144;
O-kreska Ó &#xD3; / &Oacute; ó &#xF3; / &oacute;
S-kreska Ś &#x15A; ś &#x15B;
Z-kreska Ź &#x179; ź &#x17A;
Z-kropka (dot) Ż &#x17B; ż &#x17C;

Polish alphabet:
a ą b c ć d e ę f g h i j k l ł m n ń o ó p r s ś t u w y z ź ż ch ci cz dz dzi dź dż ni rz si sz szcz zi


Post - Top - Home - Printer Friendly - Translate

Subject: Tech: Entering special characters
From: Artful Codger
Date: 22 Jan 11 - 10:41 PM

South and East Slavic Languages Using Latin Script
  Capital Small
Description Char Escape Char Escape
A-tilde à &#xC3; / &Atilde; ã &#xE3; / &atilde;
C-caron Č &#x10C; č &#x10D;
C-acute Ć &#x106; ć &#x107;
Dz DZ
Dz
&#x1F1;
&#x1F2;
dz &#x1F3;
Dz-caron DŽ
Dž
&#x1C4;
&#x1C5;
dž &#x1C6;
D-stroke Đ &#x110; đ &#x111;
Lj LJ
Lj
&#x1C7;
&#x1C8;
lj &#x1C9;
N-acute Ń &#x143; ń &#x144;
Nj NJ
Nj
&#x1CA;
&#x1CB;
nj &#x1CC;
S-caron Š &#x160; / &Scaron; š &#x161; / &scaron;
S-acute Ś &#x15A; ś &#x15B;
U-breve Ŭ &#x16C; ŭ &#x16D;
Z-caron Ž &#x17D; / (&Zcaron;) ž &#x17E; / (&zcaron;)
Z-acute Ź &#x179; ź &#x17A;
The mnemonic escape for Z-caron is defined, but support may be iffy.

Belarusian alphabet:
a b c ć č d dz dž e f g h i j k l m n ń o p r s ś š t u v y z ź ž

Bulgarian alphabet (Latinized):
a b v g d e ž z i j k l m n o p r s t u f x c č š št ã ' ju ja

Croatian (and Bosnian) alphabet:
a b c č ć d dž đ e f g h i j k l lj m n nj o p r s š t u v z ž

Macedonian alphabet (Latinized; no non-English characters are used):
a b v g d gj e zh z dz i j k l lj m n nj o p r s t kj u f h c ch dzh sh

Serbian, normally written in Cyrillic, is Romanized using the same alphabet as Croatian.
Serbian alphabet order:
a b v g d ć e ž z i j k l lj m n nj o p r s t č u f h c č dž š

Slovenian alphabet:
a (ä) b c č d e f g h i j k l m n o (ö) p r s š t u (ü) v z ž


Post - Top - Home - Printer Friendly - Translate

Subject: Tech: HTML formatting
From: Artful Codger
Date: 10 Feb 11 - 01:51 AM

If you're curious how to add italics, bolding, links, headings, lists, tables, block quotes and other special formatting to your Mudcat messages, check out these other threads: (If you notice some clear expositions elsewhere on the web of how to code HTML formatting, post or PM a link, and I'll consider it for inclusion here.)


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Moderated thread on ampersand escapes
From: GUEST,Grishka
Date: 10 Feb 11 - 03:34 PM

If desired, my friends and I can donate 200 lines of Java code forming a tool which displays all the Unicode characters and copies any of them to the clipboard upon single-clicking. That is: Click on the character (if necessary, scroll in the table to find it), click into the Mudcat entrybox, press Ctrl-V, voilà.

Leenia, ArtfulCodger is going to explain that notion so that you understand it. It is required for anyone who wants to post to Mudcat anything beyond plain English text, for example schöne Grüße or 卄开尸尸丫  く卄工丼乇丂乇  丼乇山  丫乇开尺 !.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Moderated thread on ampersand escapes
From: Artful Codger
Date: 10 Feb 11 - 05:58 PM

To Grishka: Yes, please post your Java source in a new thread, and let me know the link.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Moderated thread on ampersand escapes
From: JohnInKansas
Date: 10 Feb 11 - 09:08 PM

In "plain English" an escape character is one designated in the language the computer is using that tells the computer to treat what follows in a special way.

The usual and most common usage is to tell the interpreter that what follows immediately after the "escape" is a code of some kind, and not just simple text to be copied and displayed.

The & character is designated in HTML as being an "escape" character that means that what follows it has a special meaning. Because of this special usage, in ancient times it was necessary to "use the escape code to code the display of the & character" so we had to type &amp; to "just type" &.

At mudcat, instead of "the & character is an escape" the interpreter has been told "the & character is an escape unless it's followed by a blank space" and we can now (usually) just type & if we just want to display & - as long as there's a space before and after it.

The HTML standard includes the ability to designate any typographical character by using the character number assigned to the character. Each character is assigned a unique number.

A variety of "character definitions" have been used, beginning with DOS and Unix, later merged into the ANSI standard definition for most uses. The current "most complete" definition of character numbers is the Unicode Standard.

All of the later sets of character definitions are intended to be able to "include" all earlier ones, but there are cases where this doesn't always work. Artful Codger will explain to us when and why it sometimes fails.

If you want to use the character numbers to be sure that the character you intend is sent to the html interpreter, you must type an & to tell the interpreter that "code is coming."

To use character numbers, the & must be immediately followed by a # character to tell the interpreter that what follows is a number.

The decimal number for the character must follow immediately after, and the "code" must be ended with a semicolon ;.

If you type &#0169; the "copyright" symbol © should be displayed in your html post, since the decimal number 0169 is assigned to that symbol in all of the various standards.

Because of variations in different Operating Systems, and different ways in which information is sent between computers, there are several different ways in which information "in transit" can be "encoded." Failures to get an accurate transmission and interpretation of the characters you attempt to send, when someone else receives and displays them, may result from these differences.

Artful Codger will explain these problems, and what you can do about them.

Fewer errors may result if the Hexadecimal numbers are used for characters that are assigned "bigger numbers." The Unicode Standard uses only hex numbers to define the characters.

As before, to "post" a specific character using the "hex number" assigned to it, you begin with the & "escape" to tell the interpreter that you're using a code.

The & must be immediately followed by the # character to say that the code is a number.

The # charcter must be immediately followed by an x (or X) to tell the interpreter that the number is in hexadecimal format.

The hex number follows immediately, and the code is ended with a semicolon ;.

The decimal number 169 is the same as the hexidecimal number 00A9.

Typing &#x00A9; should display the "copyright" symbol © in an html post.

The above is the "simple" explanation of the basics of posting characters using the numerical "names" assigned to characters in the standards.

Additional information will follow in later posts.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Moderated thread on ampersand escapes
From: JohnInKansas
Date: 10 Feb 11 - 11:47 PM

Any character defined in the ISO or ANSI character set standard can be posted to html using the decimal character number in the form:

&#nnn;

where & is the html "escape" to designate that a code follows
the number symbol # designates a numerical form for the code
nnn is the decimal number assigned to the character
a semicolon ; ends the code.


An important correction: One must use the Unicode values. ISO(-8859-1) and ANSI character values largely coincide with Unicode values, but there are also many differences.
-Artful Codger-


Earlier html standards do not specifically support the use of decimal numbers larger than "three digits," although they often work okay up to four.

Any character defined by the Unicode standard can be posted to html using the hexadecimal number assigned to that character by the Unicode Standard, in the form:

&#Xhhhh;

where & is again the html "escape" to designate that a code follows
# designates that the code is a number
X says that the number is in hexadecimal format
hhhh is the hexadecimal number

It is entirely possible, and perfectly correct, to type:

&#x004D;&#x0061;&#x0072;&#x0079;&#x0020; &#x0068;&#x0061;&#x0064;&#x0020; &#x0061;&#x0020;
&#x006C;&#x0069;&#x0074;&#x0074;&#x006C;&#x0065;&#x0020; &#x006C;&#x0061;&#x006D;&#x0062;

And if I actually do that, you will see:

Mary had a little lamb

Just in case somebody doesn't see it, I posted (in hex char code) "Mary had a little lamb."

Obviously it's not necessary to "code" characters you can just type on your keyboard in most cases.

Arful Codger will explain when it doesn't work to just type them (?).

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Moderated thread on ampersand escapes
From: JohnInKansas
Date: 11 Feb 11 - 01:13 AM

Characters that might not be on your keyboard can be posted using the ampersand-escape method, but for numerous reasonably common characters it is also possible to use an ampersand escape with the code being a name for the character rather than a character number. To do that you would use the form:

        &name;

The&once again tells the html interpreter that it's a code.

Without the # to tell it otherwise, the interpreter reads the code as a character string – i.e. as a "word."

The semicolon ;ends the code.

The characters for which "name codes" exist are called "named character entities" by the HTML standards, frequently shortened to just "character entities." There are a few character entities defined by various versions of the HTML standard, and several others that are in "common use" and can normally be expected to work. Most "HTML Textbooks" don't distinguished between the "specified" ones and the ones in general use.

Arful Codger may explain which are to be considered "useful" at mudcat, or perhaps just which ARE NOT to be considered okay here.

Note that in some cases "what you type" may be case sensitive. The following are the named character entities that my rather old handbook shows:

Instead of   &#034;   you can type   &quot;    "   Quotation mark
Instead of   &#038;   you can type   &amp;      &   Ampersand
Instead of   &#060;   you can type   &lt;       <   Less than
Instead of   &#062;   you can type   &gt;       >   Greater than
Instead of   &#160;   you can type   &nbsp;        Nonbreaking space
Instead of   &#161;   you can type   &iexcl;    ¡   Inverted exclamation point
Instead of   &#162;   you can type   &cent;    ¢   Cent sign
Instead of   &#163;   you can type   &pound;    £   Pound sign
Instead of   &#164;   you can type   &curren;   ¤   Gen currency sign
Instead of   &#165;   you can type   &yen;      ¥   Yen sign
Instead of   &#166;   you can type   &brvbar;   ¦   Broken vertical bar
Instead of   &#167;   you can type   &sect;    §   Section sign
Instead of   &#168;   you can type   &uml;      ¨   Umlaut
Instead of   &#169;   you can type   &copy;    ©   Copyright
Instead of   &#170;   you can type   &ordf;    ª   Feminine ordinal
Instead of   &#171;   you can type   &laquo;    «   Left angle quote
Instead of   &#172;   you can type   &not;      ¬   Not sign
Instead of   &#173;   you can type   &shy;      ­   Soft hypen
Instead of   &#174;   you can type   &reg;      ®   Registered trademark
Instead of   &#175;   you can type   &macr;    ¯   Macron accent
Instead of   &#176;   you can type   &deg;      °   Degree sign
Instead of   &#177;   you can type   &plusmn;   ±   Plus or minus
Instead of   &#178;   you can type   &sup2;    ²   Superscript 2
Instead of   &#179;   you can type   &sup3;    ³   Superscript 3
Instead of   &#180;   you can type   &acute;    ´   Acute accent
Instead of   &#181;   you can type   &micro;    µ   Micro sign (Greek mu)
Instead of   &#182;   you can type   &para;    ¶   Paragraph sign
Instead of   &#183;   you can type   &middot;   ·   Middle dot
Instead of   &#184;   you can type   &cedil;    ¸   Cedilla
Instead of   &#185;   you can type   &sup1;    ¹   Superscript 1
Instead of   &#186;   you can type   &ordm;    º   Masculine ordinal
Instead of   &#187;   you can type   &raquo;    »   Right angle quote
Instead of   &#188;   you can type   &frac14;   ¼   Fraction one-fourth
Instead of   &#189;   you can type   &frac12;   ½   Fraction one-half
Instead of   &#190;   you can type   &frac34;   ¾   Fraction three-fourths
Instead of   &#191;   you can type   &iquest;   ¿   Inverted question mark
Instead of   &#192;   you can type   &Agrave;   À   Capital A, grave accent
Instead of   &#193;   you can type   &Aacute;   Á   Capital A, acute accent
Instead of   &#194;   you can type   &Acirc;       Captial A, circumflex accent
Instead of   &#195;   you can type   &Atilde;   Ã   Capital A, tilde
Instead of   &#196;   you can type   &Auml;    Ä   Capital A, umlaut
Instead of   &#197;   you can type   &Aring;    Å   Capital A, ring
Instead of   &#198;   you can type   &AElig;    Æ   Capital AE ligature
Instead of   &#199;   you can type   &Ccedil    Ç   Capital C, cedilla
Instead of   &#200;   you can type   &Egrave;   È   Capital E, grave accent
Instead of   &#201;   you can type   &Eacute;   É   Captal E, acute accent
Instead of   &#202;   you can type   &Ecirc;    Ê   Capital E, circumflex accent
Instead of   &#203;   you can type   &Euml;    Ë   Capital E, umlaut
Instead of   &#204;   you can type   &Igrave;   Ì   Capital I, grave accent
Instead of   &#205;   you can type   &Iacute;   Í   Capital I, acute accent
Instead of   &#206;   you can type   &Icirc;    Π  Capital I, circumflex accent
Instead of   &#207;   you can type   &Iuml;    Ï   Capital I, umlaut
Instead of   &#208;   you can type   &ETH;      Ð   Capital eth, Icelandic
Instead of   &#209;   you can type   &Ntilde;   Ñ   Capital N, tilde
Instead of   &#210;   you can type   &Ograve;   Ò   Capital O, grave accent
Instead of   &#211;   you can type   &Oacute;   Ó   Capital O, acute accent
Instead of   &#212;   you can type   &Ocirc;    Ô   Capital O, circumflex accent
Instead of   &#213;   you can type   &Otilde;   Õ   Capital O, tilde
Instead of   &#214;   you can type   &Ouml;    Ö   Capital O, umlaut
Instead of   &#215;   you can type   &times;    ×   Multiply sign
Instead of   &#216;   you can type   &Oslash;   Ø   Capital O, slash
Instead of   &#217;   you can type   &Ugrave;   Ù   Capital U, grave accent
Instead of   &#218;   you can type   &Uacute;   Ú   Capital U, acute accent
Instead of   &#219;   you can type   &Ucirc;    Û   Capital U, circumflex accent
Instead of   &#220;   you can type   &Uuml;    Ü   Capital U, umlaut
Instead of   &#221;   you can type   &Yacute;   Ý   Capital Y, acute accent
Instead of   &#222;   you can type   &THORN;    Þ   Capital thorn, Icelandic
Instead of   &#223;   you can type   &szlig;    ß   Small sz ligature, German
Instead of   &#224;   you can type   &agrave;   à   Small a, grave accent
Instead of   &#225;   you can type   &aacute;   á   Small a, acute accent
Instead of   &#226;   you can type   &acirc;    â   Small a, circumflex accent
Instead of   &#227;   you can type   &atilde;   ã   Small a, tilde
Instead of   &#228;   you can type   &auml;    ä   Small a, umlaut
Instead of   &#229;   you can type   &aring;    å   Small a, ring
Instead of   &#230;   you can type   &aelig;    æ   Small ae ligature
Instead of   &#231;   you can type   &ccedil;   ç   Small c, cedilla
Instead of   &#232;   you can type   &egrave;   è   Small e, grave accent
Instead of   &#233;   you can type   &eacute;   é   Small e, acute accent
Instead of   &#234;   you can type   &ecirc;    ê   Small e, circumflex accent
Instead of   &#235;   you can type   &euml;    ë   Small e, umlaut
Instead of   &#236;   you can type   &igrave    ì   Small i, grave accent
Instead of   &#237;   you can type   &iacute    í   Small i, acute accent
Instead of   &#238;   you can type   &icirc;    î   Small i, circumflex accent
Instead of   &#239;   you can type   &iuml;    ï   Small i, umlaut
Instead of   &#240;   you can type   &eth;      ð   Small eth, Icelandic
Instead of   &#241;   you can type   &ntilde;   ñ   Small n, tilde
Instead of   &#242;   you can type   &ograve;   ò   Small o, grave accent
Instead of   &#243;   you can type   &oacute;   ó   Small o, acute accent
Instead of   &#244;   you can type   &ocirc;    ô   Small o, circumflex accent
Instead of   &#245;   you can type   &otilde;   õ   Small o, tilde
Instead of   &#246;   you can type   &ouml;    ö   Small o, umlaut
Instead of   &#247;   you can type   &divide;   ÷   Division sign
Instead of   &#248;   you can type   &oslash;   ø   Small o, slash
Instead of   &#249;   you can type   &ugrave;   ù   Small u, grave accent
Instead of   &#250;   you can type   &uacute;   ú   Small u, acute accent
Instead of   &#251;   you can type   &ucirc;    û   Small u, circumflex accent
Instead of   &#252;   you can type   &uuml;    ü   Small u, umlaut
Instead of   &#253;   you can type   &yacute;   ý   Small y, acute accent
Instead of   &#254;   you can type   &thorn;    þ   Small thorn, Icelandic
Instead of   &#255;   you can type   &yuml;    ÿ   Small y, umlaut


Note that the above "tabulation" uses a <pre> and </pre> tag and shoud show in a monospace type. If columns don't show, or if someone copies and pastes it, selecting Courier or some other mono font may straighten them out. If incorporated into the "permanent" thread, they probably should be made into a real table(?).

These named char entities are the ones commonly used in English/Roman text, and include those named in the html standard. I have NOT checked to see how many are included in the list that aren't "standard" although all of these should be read by most html website interpreters.

Other languages may include other entities based on common usage, but they're probably unlikely to be recognized correctly at mudcat. When in doubt, the numerical code should be used.

Decimal character numbers from 130 through 150 are shown in my (HTML ver 4) handbook as "sometimes won't work." I don't at present have an answer as to why they're listed if the don't; but maybe we'll get to that later.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Moderated thread on ampersand escapes
From: Artful Codger
Date: 11 Feb 11 - 03:32 AM

All mnemonic escapes (character entity references) are case-sensitive, according to the standard. The official list of mnemonic escapes is defined here:
http://www.w3.org/TR/html401/sgml/entities.html
The organization, however, leaves something to be desired, being neither by value nor by coherent subgrouping (acutes together, or accented a's together, or paired characters together).

Here, I'm only concerned with which standard mnemonics are (and aren't) understood by Mudcat, though perhaps that may depend on the user's browser more than on Mudcat. With my browser (Firefox), I've found only a handful of oddballs.

The numeric escapes should always work, regardless of browser—as long as the defined character is printable and defined in the reasonably standard set of fonts. As it happens, values x80-x9F (decimal 128-159) are control codes in the Unicode map, so it's not surprising if they "sometimes won't work."

I should point out that mnemonic escapes are defined for a number of characters with values > 255; John's table is correct as far as it goes, but incomplete. In particular, there are mnemonics for most of the "troublesome" characters I mentioned, as well as for other quote symbols, currency symbols, a bullet, daggers, lone diacriticals, and additional ligatures and mathematical symbols.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Moderated thread on ampersand escapes
From: GUEST,Grishka
Date: 11 Feb 11 - 04:47 PM

Here is CopyUnicode.java, as requested (10 Feb 11 - 05:58 PM).

That tool and this thread may help to alleviate the symptoms, but a good cure would be preferable, which I think would also take less effort than what we are doing here.
[TDB: Provide links to such threads.]
– one tightly moderated thread would suffice, in which the official Mudcat policy is explained and discussed. TIA. ♫ Imagine there's no encoding problem; it isn't hard to do! ♫. Have fun with CopyUnicode anyway.

Grishka


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Moderated thread on ampersand escapes
From: GUEST,Grishka
Date: 11 Feb 11 - 04:52 PM

Sorry, I was just intoxicated by the prospect. This is the thread, of course.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Entering special characters
From: Artful Codger
Date: 11 Feb 11 - 10:12 PM

Now, how can you tell if you've entered text properly? The short answer is, not easily. That's because what you see (given your system, locale and browser settings) isn't necessarily what other people will, so previewing will often mislead you in this regard. You just have to know which characters are valid "raw" and which aren't, or are suspect. The ASCII set I listed is safe, anything else isn't.

The input system here does automatically encode some characters for you, and if you preview your text, you'll see the resulting escapes in the input box (in the preview area, they'll show as you expect them to do). Unfortunately, the characters most likely to be converted are the ones you're least likely to use. To explain this, I have to get a bit nerdy.

The input/display system is still centered around codepages. Until recent computing times, memory was relatively limited, so early character sets were limited—specifically to a set of 256 characters, corresponding to the number of values one could represent in a byte. As you can imagine, different areas had need of different sets of characters, and there was also an increasing demand for symbols of all sorts. So a variety of codepages developed—various mappings of these 256 values to different sets of characters—and schemes developed for switching between them, so that a user could intermix characters that resided in different codepages. A user's default codepage would thus receive the lion's share of usage, with minimized forays into other codepages, managed largely under the covers.

To add to the confusion, different systems evolved their own sets of codepages; it was only late in the game that standards committees tried to impose some uniformity. To make matters worse, most text files had no provision to indicate which codepage was used to create them, so if one was transferred to a different system and opened, it might be interpreted using a different character mapping, and the result could be garbage.

That's the behavior we see here. What appears to be happening is this: Your browser typically has a default codepage setting active, appropriate to your locale, and it's used for most of the text you're most likely to peruse. But by default (i.e, when no encoding is explicitly stated for a web page, as currently on Mudcat), HTML pages are supposed to be encoded using only 7-bit ASCII (which is a subset of most other codepages, at least in Western Europe and the Americas). This comprises only the first 128 values. How, then, should the browser handle the remaining 128 values, when they are encountered? It could skip them, or report that the page is invalid, or replace them with some "I don't know" character like the query. But most browsers interpret these values according to your default codepage. The result is that if you enter a character value in the range 128-255, it is rendered differently from one user environment to another.

But you're only typing in characters, and the browser shows you those characters, and you can type in characters that are outside your codepage and they'll show up, too! So why doesn't the input system preserve the characters it knows you're entering, encoding them if necessary? Well, it kinda does and it kinda doesn't. For manipulating the characters internally, it probably converts them all to 16-bit or 32-bit Unicode. But when it returns collected input (as from the message box), it converts the text according to the specified encoding for the web page. If no encoding is specified, it returns (byte) values to the Mudcat software according to the poster's default or selected codepage—but without indicating which codepage that was. If the character isn't in the poster's codepage, one of two possibilities seems to be occurring (which have the same result): either the browser encodes the text as a character reference and returns that sequence value-by-value to Mudcat, or it hands Mudcat the Unicode value for that character, and Mudcat software encodes the value as character reference (since you can't store a 16-bit value in a single byte). Either way, life is good: the character ends up stored as a character reference, and should display properly on all systems (font problems aside).

The bad news is that characters in the 128-255 range according to the poster's effective codepage are left unencoded, and are ambiguously rendered. This means that even though a character like a right double quote has a Unicode value well beyond the problem range, it will likely be handed to Mudcat using one of the problem range values instead of being automatically encoded.

So let's consider an example: suppose a user copies some Cyrillic text off the net (the source doesn't matter) and pastes it into a Mudcat message. What will happen? We have to consider two scenarios.

If the poster is a Western European, generally using Western European settings, his codepage will not include Cyrillic characters, so they'll all be automatically encoded as escapes and will display properly to virtually all users. Not so for non-ASCII characters like the right double quote, long dashes, or copyright symbol, all of which are included in the poster's codepage. These will remain unencoded, and will be ambiguously displayed to other users.

If the poster is a Russian, his codepage might be KOI-8 or ISO-8859-5 or CP-866, all of which map Cyrillic, but to different values in the trouble range. In this case, all the text will remain unencoded, mostly with trouble range values, so that Western European viewers and even Russian viewers with a different default codepage will see garbage.

That's why it's so important to encode all non-ASCII characters as escapes, and not rely on the input system to do it for you.

There have been many suggestions about how to cure this situation, so that what you enter is what others see, and you don't have to do any special encoding (aside from the occasional ampersand and angle bracket). But each solution has drawbacks in regard to old threads, where the text has already been entered improperly and no automated fix is feasible. As I've said at the beginning of this thread, don't post your suggestions for fixing the problem here; post them to another thread (like this one).


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Entering special characters (moderated)
From: JohnInKansas
Date: 11 Feb 11 - 10:39 PM

Re comment added at 10 Feb 11 - 11:47 PM.

The Unicode value for the Euro is as stated.

There is no official "euro" in the ANSI standard. The decimal number 0128 was unassigned by ANSI, so MICROSOFT decided to use it for Windows computers released in the US and elsewhere outside the Euro nations, where keyboards don't have a key for the euro symbol.

The assignment of the euro symbol to decimal number 0128 is essentially a "Windows" font table extension that Microsoft describes as allowing the use of the Alt-NumPad method of entry. With NumLock turned on, you can hold down the Alt key while typing 0128 on the number pad, (the leading zero is required on my machine) and a euro symbol will be inserted into your document.

If, however, you put your cursor (in Word) directly following the euro symbol you've inserted, and click Alt-X to toggle it to the Unicode value for the character, you get "20AC" which is the correct hex number (decimal 8364) for the Unicode euro character.

I don't know how many other operating systems may have adopted the "0128" shortcut; but it's NOT an ANSI thing, it's just Microsoft trickery.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Entering special characters (moderated)
From: Artful Codger
Date: 11 Feb 11 - 10:53 PM

That doesn't surprise me. The chart I was referring to says that ANSI defines x80-x9F as control codes, but that ISO-8859-1 defines them for a variety of characters (quotes, dashes, daggers, TM, ...). They may have meant MS's implementation of ISO-8859-1. But the ISO mapping on my browser (Firefox on Mac) does appear to include these characters.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Entering special characters (moderated)
From: JohnInKansas
Date: 21 Sep 12 - 04:27 PM

Artful C -

A possible correction to a correction you made a little above is that the Euro is not an ANSI character, or at least wasn't the last time I looked a year or so ago.

That decimal character number was, and apparently still is, defined in ANSI as a "reserved number" with no assigned meaning. That allows any individual font designers and/or programmers to assign any character they need to that number. (And Unicode includes a fairly large number of such "unassigned" or "reserved" character numbers.)

Microsoft simply elected to use the Alt-Numpad-128 method to allow users without a € key on their keyboard an easier way to enter the new symbol.

When you use the Alt-Numpad-128 method in recent Word versions, the character actually printed has the correct Unicode numerical value of Hex 208C. The transformation in Windows is actually done in the "character code pages" that flip in and out of RAM during use of programs.

Essentially, the use of Alt-Numpad-128 for the &#x208C; is just a keyboard shortcut, added as a default in recent Windows versions, that enters something other than what you type.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Entering special characters (moderated)
From: Artful Codger
Date: 22 Sep 12 - 02:16 AM

To avoid potential confusion, I've removed the Euro example from my correction. The main point of the correction (that ISO and ANSI values don't always coincide with Unicode in the high-8-bit range, and that only the Unicode values are guaranteed to work in numeric references) stands.

Since the Euro is likely to be mapped somewhere in one's native codepage, pasting or typing a Euro symbol directly is likely to leave the character improperly encoded. Best to encode the Euro as &euro;. That's all people need to know here— the rest just clouds the issue.


Post - Top - Home - Printer Friendly - Translate
  Share Thread:
More...

Reply to Thread
Subject:  Help
From:
Preview   Automatic Linebreaks   Make a link ("blue clicky")


Mudcat time: 18 April 11:30 PM EDT

[ Home ]

All original material is copyright © 2022 by the Mudcat Café Music Foundation. All photos, music, images, etc. are copyright © by their rightful owners. Every effort is taken to attribute appropriate copyright to images, content, music, etc. We are not a copyright resource.