|
|||||||
Tech: htmlesc.py: Mac script to escape text Related threads: Mudcat HTML Guide PermaThread (64) tech: HTML Ampersand Codes (33) Tech: Non-ASCII character display problems (47) (closed) Tech: Entering special characters (moderated) (18) Tech: HtmlEsc.java: Convert special chars (6) Tech: CopyUnicode: Create any char (17) Tech - ALTKEY Codes on Laptop (28) HTML Stuff II (126) HTML Tables (19) Clickable Links (14) HTML Beginners Study Guide (3) |
Share Thread
|
Subject: Tech: htmlesc.py: Mac script to escape text From: Artful Codger Date: 24 Jun 08 - 04:08 AM Below is a Python script for Macs which converts text into a format suitable for pasting into messages here. For a cross-platform Java version, see this thread: HtmlEsc.java Why would you need this? Because:
For the complete scoop on why you need a tool like this, see the guide "Entering special characters." The bad news:
Here's the one-time setup:
And now you're ready to use it:
The first time you run the script, the Python interpreter will create a p-code file with the same name as your script file, but with a .pyc extension. Keep it around; if you delete it, the interpreter will just have to recreate it the next time you run the script. |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: Artful Codger Date: 24 Jun 08 - 04:13 AM Script updated 12 Feb 2011 to remove several mnemonics which aren't well-supported: hibar, Zcaron, zcaron and bdquote. The script supplies numeric escapes for these characters instead. -Artful Codger- |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: treewind Date: 24 Jun 08 - 05:27 AM Sweet. Should work in Linux too. It assumes you're using UTF-8 encoding for input, doesn't it? But that's the best choice anyway. Anahata Sorry, but the clipboard interface is specific to Macs. It should be simple to create a Linux or Windows version, if one knows the appropriate code for grabbing and putting clipboard text on those platforms. In the meantime, however, I've created a Java version which is cross-platform; see this thread: HtmlEsc.java -Artful Codger- |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: JohnInKansas Date: 24 Jun 08 - 06:26 AM Encode away, and be pleased with your posts, but what a viewer of your posts will see is mostly limited to the character glyphs in the font chosen as the default in the viewer's browser. The viewer chooses what characters will be shown, not the one who posts a bunch of gibberish. Characters not included in the font used by someone reading your posts may have a more consistent representation of "illegibles" but your method will do nothing particularly helpful for most of us here. It would seem much simpler just to choose a language (and keyboard, where needed) appropriate to the post, with a font containing the required glyphs, and simply add at the beginning of the post the name of a font (and language used, if necessary) that everyone can use to read your post. Windows (recent versions) permits selecting more than 80 different languages, with or without choosing a keyboard layout appropriate to each. I was not aware that Macs lacked at least some ability to do the same, with sufficient languages to include the principal characters even if the choice is not as broad as for current Windows. It does nothing particularly beneficial to post, correctly coded or otherwise, things that the majority of viewers will be unable to read without special effort; and extreme delusional pomposity to think that a post that requires the effort attending the "method" is so important that it will induce others to make the adjustments needed to read your "wisdom." (Most of us, I think, just ignore the frequent thunking errors in the cut-n-paste postings Amos so frequently puts up in "his thread," since we know it's "just a Mac thing.") But try it out, and we'll see how it works if you wish. John |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: Artful Codger Date: 24 Jun 08 - 05:45 PM treewind: No, it will not work on Linux, because the PyObjC bridge and the clipboard interface are particular to Mac. The HtmlEscaper class within the script should work on any platform, but you'd have to write your own stuff to exchange text with your platform's clipboard. If there's a Linux command-line utility for interacting with the clipboard (similar to Mac's pbcopy/pbpaste), I might be able to write a version based around that, since my original scripts piped input to/from the Mac command-line utilities. However, they don't handle Unicode well--I could push data using a simple RTF wrapper, but for input I was having to parse RTF files, and although I did get rather far along with this approach, it was quite messy and prone to error. (It would, however, have allowed me to emit HTML formatting, so one day I might return to it.) John: Not so. The text types on the clipboard are mostly Unicode-based, which means they're essentially codepage-independent from a user's standpoint. When the system KNOWS it's dealing with Unicode, it can figure out which codepages to switch to for the various encoded characters. As long as the fonts used support the characters used, you're good. The standard fonts nowadays support most of the common non-pictograph languages, including Cyrillic, Hebrew, Arabic and even Devanagari and Tamil script (though not some of the dotted Celtic consonants, so you're still safer using h's.) So even if your source file is 8-bit encoded for a particular codepage, when you copy the text to the clipboard, your editor should translate the text to Unicode, the lingua franca for data exchange. If it doesn't, the text type my script expects will not be available and no translation will occur. Likewise, most modern browsers handle Unicode quite elegantly, and by default. The characters are converted as Unicode, the default character encoding for HTML, XML etc. Only older browsers should have trouble viewing any encoded text in any common language aside from the pictograph languages (e.g. Chinese), and even there, you can usually download support language files for your browser. The problem comes when people upload text encoded for older, 8-bit codepages--and even then, most browsers support the common codepages, you just have to select the right one. So your claim that "most people" would have trouble is erroneous, unless you mean those without computers. For instance, here's some encoded text: First a Slovak mountain holar: Ej, musel by to chlap byť 1. Ej, musel by to chlap byť, čo by ma chcel nabiť, vyberaný. [:Valaška pri boce, len sa tak ligoce, opasok vybíjaný.:] 2. Ej, čo vás bude sedem, sedemdesiatsedem, nebojím sa. [:Valaška pri boce, len sa tak ligoce, ej, veru ubránim sa.:] And the start of a Russian song: Эта песня интересна (Частушки) Эта песня интересна / ты послушай дорогой да В этой песне всё известно / как гуляли мы с тобой. And now the Esperanto alphabet: a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s t u ŭ v z You should be able to see all three properly regardless of the codepage your system uses by default when dealing with plain text. The procedure I used is just as I described, and in fact, I grabbed the text from three different source file types and three different editors. Doing a Mudcat message preview, I got the whole shebang poifectly with no muss or fuss. (I use the Firefox browser, in case you're interested.) By the way, those apostrophes after the t's in the Czech sample are actually part of the character--how Czech represents t-with-hacek. Now, when you copy and save the text, if you save it to a plain text file, you'd have problems, because no 8-bit encoding contains all these characters, and the encoding your system uses by default is unlikely to contain ANY of the special characters, aside from the accented vowels. But the characters which ARE in your native codepage would be handled properly. And if you paste into a word processor, saving with a format like RTF, PDF, Word or Pages, you're okay, since they use Unicode under the covers (or a combination of native codepages and Unicode). |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: Artful Codger Date: 24 Jun 08 - 06:49 PM Also, if you want my feedback to any questions or comments on the script, please send me a private message, since I won't necessarily be monitoring this thread. treewind: Actually, the encoding is UTF-16, not UTF-8. UTF-8 is an 8-bit encoding which uses 1 to 5 bytes per Unicode code point. (I speak of code points because a single character may actually be a composite of several code points, and there are code points which specify relative positioning, display order and the like.) Translating between UTF-8 byte sequences and Unicode code point values involves a lot of bit-twiddling, whereas with UTF-16 you can just combine byte pairs into an integer value (shifting one or the other up 8-bits) and you have the Unicode code point. Because of this, it's quite easy to examine a hexadecimal dump of a UTF-16 file and determine exactly what the encoded characters are. For characters with values above 0xFFFF you get into UTF-16 surrogate pairs, not as straightforward, but that is how characters need to be encoded for Mudcat, since you don't have the ability to change the page encoding for just the text you're supplying. UTF-8 is the standard for certain byte-stream protocols, like e-mail, and the Mac filesystem uses oddly-normalized UTF-8 for its internal representation. That may be where you got the impression UTF-8 is best. But for text files, UTF-16 is preferrable to UTF-8, because (bear with me here)... UTF-16 files are either stored in the native byte order of the operating system, or with a byte ordering specified by an explicit code point at the start of the file, called a Byte Order Mark (BOM). When an editor encounters a UTF-16 BOM, it knows immediately it's dealing with a UTF-16 file, and whether the bytes are stored in big-endian or little-endian order. It also silently strips the BOM from the start of the text. There are very few programs that don't know how to deal with UTF-16 BOMs. There is a similar BOM for UTF-8--it's actually the same code point value as for UTF-16 (0xFFFE) that, encoded, comes out as one of two distinctive three byte sequences, depending on the byte order. HOWEVER, it is not common for programs to prepend the BOM for UTF-8 files, since UTF-8 is touted as "byte-order independent". It was designed to serialize Unicode through byte streams, where the byte order was invariant. Consequently, if most programs--including most editors--do encounter a UTF-8 BOM, they usually don't recognize it as such, and instead treat it as raw text! This can cause programs (like compilers) to report syntax errors and reject your file. More information than you want, probably, but there it is... |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: Artful Codger Date: 24 Jun 08 - 07:17 PM Jon: Cocoa is an object-oriented application support layer for Mac OS X (contrasted with Carbon, which is generally lower-level and procedural). Cocoa has an Objective-C interface. Objective-C is a dynamically-typed object-oriented C-derivative language with Smalltalk-like object messaging. This is why my Python script requires the PyObjC bridge: to interact with the Cocoa layer, and through it, the pasteboard service. Quite nifty, but sadly non-portable. |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: JohnInKansas Date: 25 Jun 08 - 10:26 AM It's really ^#$@%! irritating when someone points out progress that I didn't notice, and makes me go back and study up. Maybe if I hadn't been so busy trying to make Office 2007 work I'd have noticed. (cheap excuse, obviously). I found only 1,048 characters in the first 3,070 Unicode char numbers (  thru జ) that failed to display (in Preview) in my browser when coded in a post preview. That's a remarkable improvement over a few years ago when I tried out the first few thousand Unicode chars here. I think I was using Win98SE and IE5 way back then? (Fortunately, for the reader, the preview function has been added since then, so I didn't need to actually post all of them to try them out. Maybe that thread was the reason they decided to add the preview?) Since the charDict given contains only 141 characters, and these should all be "printable" ones, the objection to the possibility of a posted result being unreadable by significant numbers of people may be disregarded. (But you were gonna do that anyway.) While the browsers have quite obviously been updated to handle a broader range of characters, I'm not sure whether our Win95/Win98 users can benefit from the improvement; but for the limited character set involved it's quite possible that they'll see truth and beauty in posts using the converter. Surprisingly, when I copied the nonprinting characters from the mudcat preview and pasted them (unformatted) into Word to make my list of unprintables, a few (<100?) of them did display apparently good glyphs in Word. By accident I found that there appeared to be a very few that displayed in my browser, but not in Word, although I didn't search specifically for this effect, and the few that may have shown this behavior(u)r may have been just sloppiness on my part. Word 2007 is so slow that the ones that didn't display may have just been because it was still looking for the glyphs. There also appeard to be one or two characters that might have displayed a different glyph in Word than in the browser. Since this was just a quicky test, it will take some additional checking to make sure both of the above weren't just operator hastiness effects. While it's been pointed out before, probably few remember that in (PC) Word (someone with Mac Word might want to check this for the Mac version) if you place your cursor immediately to the right of an unknown character and hit Alt-X the Unicode char value will replace the character. It generally is not necessary that the glyph be displayed correctly to get the hex charvalue from it. You can also type the Unicode char number (hex of course), and with the cursor immediately to the right of the last character typed, hit Alt-X to transform it to the appropriate glyph – if it's one your computer knows about. Unfortunately this only works one char at a time. I don't know how far back into old versions this worked, but it was good in Office 2002 and later for the versions we've had. I don't think Win98 knew about Unicode, so possibly it may not work in Word95/97 and other versions around that era. Maybe one of our ancient ones can try it and let us know. Back to the drawin' board for some more study. John |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: treewind Date: 25 Jun 08 - 10:42 AM AC : OK - I understood most of that. But I see the lack of need for a BOM as an advantage of UTF-8. Also UTF-8: (a) makes smaller files in most cases (b) is compatible with ASCII I suppose if I was commonly using non-Latin based alphabets I'd see it differently. Does the Mac OS use UTF-16 as standard for all text files now? Anahata |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: GUEST,Jon Date: 26 Jun 08 - 04:22 AM I've just had a quick look for python Linux clipboard functions without success. Some say Linux doesn't have a clipboard (which in itself is fair enough) and suggest looking to KDE/qt/Gnome/Gtk but my apps, eg Firefox have working clipboards regardless of whether I'm using Gnome or KDE and I guess work with other Desktops. I guess there must be something at a lower (X?) level? Oh well, water butts are the order of the day... I'll try to have another look some time... |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: Artful Codger Date: 26 Jun 08 - 06:43 AM Lack of a BOM is a decided disadvantage. Without it, UTF-8 is indistinguishable from any of the myriad legacy 8-bit character sets. And, sadly, with it, most programs will do the wrong thing (unlike with the UTF-16 BOM). That single BOM character not only tells a program it's dealing with a Unicode file with a particular byte ordering, it identifies whether the encoding is UTF-7, UTF-8, UTF-16 or UTF-32! That's quite a lot of information for such a little tag. UTF-8 makes files smaller only for the Western languages. For the Asian languages, where the need for Unicode is immensely greater, UTF-8 is more bloated than UTF-16. UTF-8 must also be converted into UTF-16 or UTF-32 in order to process it at all efficiently, and the conversion involves a lot of bit twiddling. Anyway, in these days of multi-megabyte software updates pushed over the Internet, plain text file size is a complete non-issue. Yes, 7-bit ASCII is a subset of the UTF-8 encoding, but so what? In terms of code point values, every Unicode encoding has ASCII as a subset. And the whole idea of Unicode is to transcend the woeful limitations of ASCII and the mess produced by incompatible legacy character sets. No, Mac OS X hasn't made any Unicode encoding the standard for plain-text files. The default in most editors, including TextEdit, is to interpret text files using a locale-dependent character set like Mac-Roman, no doubt for backwards-compatibility. Oddly, TextEdit won't let you change the encoding to be Unicode, nor will it let you save a new file as plain text, only as RTF. Mac config files are mostly XML; I'm assuming they're actually UTF-8 encoded, but of course mine look like plain ASCII, so I can't tell. |
Subject: RE: Tech: htmlesc.py: Mac script to escape text From: Artful Codger Date: 03 Jul 08 - 10:18 PM For a cross-platform Java version of this script, see this thread. |
Share Thread: |
Subject: | Help |
From: | |
Preview Automatic Linebreaks Make a link ("blue clicky") |