The Mudcat Café TM
Thread #119339   Message #2647042
Posted By: Artful Codger
02-Jun-09 - 09:30 PM
Thread Name: Tech: html from a word document
Subject: RE: Tech: html from a word document
Character encoding doesn't matter if you use the numeric or named escapes ("character references"). HTML uses Unicode values; any name or encoded value maps to one and only one Unicode character. Fonts typically only support a subset of all the Unicode characters, but virtually every modern font will support all of these quote characters, because they occur so frequently. A user's locale-specific quoting convention (angle bracket quotes, or bottom-aligned opening quotes) has no effect on the quotes as encoded.

Unless the file encoding is specified at the start of an HTML file (using a special directive), you can only use the ASCII characters in the 7-bit range (0 to 127). The apostrophe and the straight double quote are the only quote characters which fall within this range. And only these quotes can be used to delimit string values within HTML tags (i.e. within angle-brackets). Whenever you want a literal quote, you're supposed to encode it (as "'" or """), but HTML processors will seldom complain about bare quote characters outside of tags.

The other quote symbols don't fall within 7-bit ASCII range, so they must be escaped (like "&ldquot;") to ensure they will be properly handled. This also applies to any other character whose Unicode value is beyond 127: accented characters, most symbols, non-Roman characters, language-specific characters...

You don't have any control over the encoding of Mudcat messages, so here you always have to encode special characters, including word processor quotes. A lot of folks just copy-n-paste from their word processors without previewing, then they wonder why their messages end up with a bunch of question marks, or why their foreign language text is wonky when other people try to read it. Basically, it's because of these encoding issues. I've provided Java and Python programs which you can use to properly escape raw text on the clipboard before pasting it; search for "htmlesc".