Mudcat Café message #2373565

The Mudcat Café ^TM
Thread #112265 Message #2373565
Posted By: Artful Codger
24-Jun-08 - 05:45 PM
Thread Name: Tech: htmlesc.py: Mac script to escape text
Subject: RE: Tech: htmlesc.py: Mac script to escape text

treewind: No, it will not work on Linux, because the PyObjC bridge and the clipboard interface are particular to Mac. The HtmlEscaper class within the script should work on any platform, but you'd have to write your own stuff to exchange text with your platform's clipboard.

If there's a Linux command-line utility for interacting with the clipboard (similar to Mac's pbcopy/pbpaste), I might be able to write a version based around that, since my original scripts piped input to/from the Mac command-line utilities. However, they don't handle Unicode well--I could push data using a simple RTF wrapper, but for input I was having to parse RTF files, and although I did get rather far along with this approach, it was quite messy and prone to error. (It would, however, have allowed me to emit HTML formatting, so one day I might return to it.)

John: Not so. The text types on the clipboard are mostly Unicode-based, which means they're essentially codepage-independent from a user's standpoint. When the system KNOWS it's dealing with Unicode, it can figure out which codepages to switch to for the various encoded characters. As long as the fonts used support the characters used, you're good. The standard fonts nowadays support most of the common non-pictograph languages, including Cyrillic, Hebrew, Arabic and even Devanagari and Tamil script (though not some of the dotted Celtic consonants, so you're still safer using h's.)

So even if your source file is 8-bit encoded for a particular codepage, when you copy the text to the clipboard, your editor should translate the text to Unicode, the lingua franca for data exchange. If it doesn't, the text type my script expects will not be available and no translation will occur.

Likewise, most modern browsers handle Unicode quite elegantly, and by default. The characters are converted as Unicode, the default character encoding for HTML, XML etc. Only older browsers should have trouble viewing any encoded text in any common language aside from the pictograph languages (e.g. Chinese), and even there, you can usually download support language files for your browser. The problem comes when people upload text encoded for older, 8-bit codepages--and even then, most browsers support the common codepages, you just have to select the right one. So your claim that "most people" would have trouble is erroneous, unless you mean those without computers.

For instance, here's some encoded text:
First a Slovak mountain holar:

Ej, musel by to chlap byť

1. Ej, musel by to chlap byť, čo by ma chcel nabiť, vyberaný.
[:Valaška pri boce, len sa tak ligoce, opasok vybíjaný.:]

2. Ej, čo vás bude sedem, sedemdesiatsedem, nebojím sa.
[:Valaška pri boce, len sa tak ligoce, ej, veru ubránim sa.:]

And the start of a Russian song:

Эта песня интересна
(Частушки)

Эта песня интересна / ты послушай дорогой да
В этой песне всё известно / как гуляли мы с тобой.

And now the Esperanto alphabet:
a b c ĉ d e f g ĝ h ĥ i j ĵ k l m n o p r s t u ŭ v z

You should be able to see all three properly regardless of the codepage your system uses by default when dealing with plain text. The procedure I used is just as I described, and in fact, I grabbed the text from three different source file types and three different editors. Doing a Mudcat message preview, I got the whole shebang poifectly with no muss or fuss. (I use the Firefox browser, in case you're interested.) By the way, those apostrophes after the t's in the Czech sample are actually part of the character--how Czech represents t-with-hacek.

Now, when you copy and save the text, if you save it to a plain text file, you'd have problems, because no 8-bit encoding contains all these characters, and the encoding your system uses by default is unlikely to contain ANY of the special characters, aside from the accented vowels. But the characters which ARE in your native codepage would be handled properly. And if you paste into a word processor, saving with a format like RTF, PDF, Word or Pages, you're okay, since they use Unicode under the covers (or a combination of native codepages and Unicode).