mudcat.org: Tech: Misinterpreted Characters

sj

Post to this Thread - Sort Descending - Printer Friendly - Home

Tech: Misinterpreted Characters

Joe Offer	24 Nov 09 - 01:31 PM
treewind	24 Nov 09 - 02:24 PM
Jack Campin	24 Nov 09 - 03:26 PM
Simon G	24 Nov 09 - 05:06 PM
Susan of DT	24 Nov 09 - 06:24 PM
Artful Codger	24 Nov 09 - 07:23 PM
Genie	24 Nov 09 - 09:31 PM
JohnInKansas	25 Nov 09 - 02:38 AM
treewind	25 Nov 09 - 03:50 AM
Susan of DT	25 Nov 09 - 07:42 AM
Artful Codger	25 Nov 09 - 09:09 AM
Simon G	25 Nov 09 - 11:28 AM
Mysha	25 Nov 09 - 11:28 AM
Mysha	25 Nov 09 - 11:47 AM
treewind	25 Nov 09 - 04:26 PM
Susan of DT	25 Nov 09 - 06:18 PM
Artful Codger	25 Nov 09 - 06:42 PM
dick greenhaus	25 Nov 09 - 08:37 PM
treewind	26 Nov 09 - 04:01 AM
Mysha	26 Nov 09 - 05:47 AM
Simon G	26 Nov 09 - 08:59 AM
Simon G	26 Nov 09 - 09:01 AM
Bill D	26 Nov 09 - 09:22 AM
Mysha	26 Nov 09 - 11:46 AM
dick greenhaus	26 Nov 09 - 12:20 PM
Simon G	26 Nov 09 - 12:39 PM
dick greenhaus	26 Nov 09 - 05:00 PM
Artful Codger	26 Nov 09 - 08:47 PM
dick greenhaus	26 Nov 09 - 10:31 PM
Artful Codger	27 Nov 09 - 12:09 AM
Joe Offer	12 Jan 18 - 01:57 AM
michaelr	13 Jan 18 - 01:46 PM
Sandra in Sydney	13 Jan 18 - 07:52 PM
Mr Red	14 Jan 18 - 06:14 AM

Share Thread

Lyrics & Knowledge Search [Advanced]
DT Forum Child
Sort (Forum) by:relevance date

DT Lyrics:

Subject: Tech: Misinterpreted Characters
From: Joe Offer
Date: 24 Nov 09 - 01:31 PM

We have a problem with some of the text that is posted here - I'm assuming it's mostly text that is word processors and copy-pasted here, or it's special characters like umlauts. Some post OK here at Mudcat, but certain song texts get all scrambled up when collected for the Digital Tradition, which is in ASCII text.

The main things that don't look right at Mudcat are the curly quotation marks that you find in word processors like Microsoft Word - is there an HTML tag we can set at the beginning of a post or thread so those quotation marks will look right?

Another thing we have trouble with is Russian characters - can I add a tag at the beginning of a Russian tag that will tell browsers to read the Cyrillic character set?

And as for the Digital Tradition, is there an efficient way to copy Mudcat posts into a text file and not get all those characters confused? I've done it, but I haven't been able to get consistent results. I think the best results I've had came from pasting the Mudcat text into Notepad, and then saving the file and reopening it.

Oh, another problem we've had is that the double carriage returns separating verses in a song, sometimes disappear when copy-pasted into a text file.

-Joe-

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: treewind
Date: 24 Nov 09 - 02:24 PM

You can put a character set declaration in the HTML header, but on a system like Mudcat the problem is that different people will post message using different character sets - one will post with ISO-8859-1 ("Latin-1") and another will use UTF-8 and there's nothing you can do.

The best way, when posting, is for everybody to use proper HTML/SGML character-entity codes for anything not ASCII.
Here's the official list for HTML 4.

“ and ” give you “left and right double quotes” for example
£ is for £

In the mudcat server code, you could attempt to guess what character set posters are using and translate, but the guess won't always be right.

Anahata

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Jack Campin
Date: 24 Nov 09 - 03:26 PM

With left and right quotes, the simple fix is to edit them all into straight quotes. Chances are that no human thought went into putting them in the original text.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Simon G
Date: 24 Nov 09 - 05:06 PM

The root of the problem is that mudcat pages aren't properly formed, no doc type, meta tags, etc. These would give you better control over how the text is processed in the browser and what you get back in a POST from the form.

Check a page here

http://validator.w3.org/

will give some hints at what is needed.

The user has no control over the character set that is being input, the page has full control, if it doesn't use it then the browser usually defaults to UTF-8.

Once you have input in a known character set your server side environment should provide functions to convert as necessary.

Your problems with cyrillic output must also be rooted in not declaring the character set for the page.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Susan of DT
Date: 24 Nov 09 - 06:24 PM

My harvesting process is:
1) copy from thread and paste into Word
2) adjust the format, check for carriage returns between verses (a very common problem), assign a filename, keywords, etc.
3) save as a txt file
4) import txt file into askSam database (an old DOS program that is still the best we have seen)

Characters that looked fine in the text file turn funny in askSam. This thread came about because I asked Joe what to do about some of the German songs he had posted. The umlauts and double S characters went weird and I asked for reasonable all normal English characters to use instead. I also had problems in this batch (still hoping for a 2009 version, but it may not have the tunes straightened out) with French e acutes and some regular single and double quotes - not always consistent within a single song with some working and some not.

I have not harvested any of the Cyrillic songs, since I cannot even see them properly on mudcat on my computer, nor would I have the foggiest notion of whether it was right.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Artful Codger
Date: 24 Nov 09 - 07:23 PM

Users should not have full control--that opens the door to phishing and other malicious practices. Furthermore, users would experience fewer problems if all text were properly posted in Unicode.

When posting text pasted from a formatted source, always preview, since what you see in the message box will look right even if the the result, interpreted as HTML, is improper.

I wrote two scripts, called "htmlesc", that make it easy to post text with the proper HTML entities. One is in Python, specific to the Mac; the other is in Java and should operate on all platforms (though some Linux users have run into problems). Both scripts have been posted here, so do a search (I get tired of repeating this info and creating links). Using these scripts, I've posted songs in a variety of languages including Russian and Ukrainian; they also take care of most word-processor special characters like quotes, dashes and copyright symbols, so even if you're only posting English lyrics from a formatted source, they come in handy.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Genie
Date: 24 Nov 09 - 09:31 PM

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: JohnInKansas
Date: 25 Nov 09 - 02:38 AM

The HTML standards in general use when many web pages still around were created DO NOT AUTOMATICALLY accept "curly quotes" in html code, so if you prepare stuff in a word processor and expect to code anything in your posts you should turn them off.

In Word, a document that contains them can be "cleaned" by unchecking the "use curly quotes" and then doing an Edit|Replace| Replace " with " | Replace all. (You also need to replace curly ' with straight ' for complete "cleansing.") The process can be reversed by turning curlys back on and doing the same global replacements.

Office provides a couple of fonts that include "extended Unicode" character sets, and using one of these may improve legibility of "funny chars" that appear in other people's posts.

Maximum character conversion should occur if you use a font containing a "full Unicode" character set. Microsoft information identifies:

Arial Unicode MS font is a full Unicode font. It contains all of the characters, ideographs, and symbols defined in the Unicode 2.1 standard.

(Unicode: A character encoding standard developed by the Unicode Consortium. By using more than one byte to represent each character, Unicode enables almost all of the written languages in the world to be represented by using a single character set.)

This is obsolete information, since technically it applies only to Office 2003 and WinXP. And, technically, it's wrong, since the font cited doesn't include large parts of the Unicode sets. (Notably omitted are many chars for the "top down" and "right to left" language sets - which you can't type anyway unless you have a "regionalized" OS and keyboard.)

There are, I believe, at least two other "full Unicode" fonts (in the Microsoft sense) commonly supplied with more recent Office, but Microsoft hasn't added anything helpful to their information data bases or to Help files since Vista came out, so it's more difficult to find font information than I'm willing to face up to at present.

For those encountering "strange characters," sometimes you can copy text from a web page and paste it into Word. In recent versions of Word, if you place your cursor (insert point) immediately to the right of a character and hit Ctl-X the "Unicode Hex Character Number" should be displayed. The glyph (character picture) displayed on your screen depends on the html page setup and the font you have selected (in your browser and in Word), but in the document file it is represented by its "number." This means that sometimes Word can tell you what the intended character is, by showing you the HEX NUMBER, even if your current setup can't display it. If you know the hex code number, you can browse through Character Map (Start|Programs|Accessories|System Tools in Vista) for a font on your machine that contains a glyph for that char number. Alternatively, you can go to the Unicode Standard to look up what the char should look like.

This "utility" breaks down in some cases because both hardware and programs "regionalized" to specific languages sometimes use "character maps" that replace Roman chars with something else. Rather than making the keyboard produce the proper Unicode char numbers for the language of the region, the number produced by the key is mapped (in the OS) to produce the glyph for a different char.

Problems with "reading the char no" can also arise from differences in operating systems. A 16 bit OS stores a 16 bit character as one byte. A 32 bit OS stores two 16 bit characters, or one 32 bit one, per byte. A 64 bit OS stores four 16 bit, two 32 bit, or one 64 bit number per byte. The process of splitting up long bytes to short ones, or reassembling the bits to bytes (technically called "thunking" by Microsoft back when someone there thought about technical stuff), sometimes results in "breakage" so char numbers may be lost. A similar "lossage" occurs fairly consistently for a few characters when the little-endian/big-endian flop is done in going from Mac (and some others) to/from Windows.

Legibility depends largely on the reader's font set so the person posting has little control over what the reader will see if anything other than "common characters" are used. If there's a question, naming the font used (by the one posting) might permit the reader to load a better font. The reader can use an "extended" or "complete" font to somewhat improve things, in some cases; but the variety of methods used in regionalizing of hardware and software precludes there being a complete solution for all users at present.

Full-time use of the "full Unicode font(s)" available since Office 2000 will slow down your computer, because the font itself is

HUGE

and some older OSs and programs may not be able to load them.

Embedding (or calling) the full fonts on a website probably would also be prohibitive, both from the standpoint of server space and traffic load, and from the inconvenience to users of slow services (dial-up?) who might need to download a >30MB(?) font every time they load a page.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: treewind
Date: 25 Nov 09 - 03:50 AM

Simon G is right that the page does not specify what character set is used, but it wouldn't help if it did, for discussion forum pages where anyone can post.
You can put
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1">
in the headers, but it's a promise that can't be kept in the forum pages, because you can't stop people from posting in UTF-8 or some other flavour of ISO-8859.

As for Susan's problem: when you save a DOC file as text, it asks what character set convention to use, so if you select Latin-1 (a.k.a. West European or ISO-8859-1) at least you know what you are using (it's probably the default anyway). You then have two options:

1. pasting the text into your web page, and making sure you have a charset definition as above, that matches what you chose when exporting to text. This ought to work unless somebody's browser overrides it.

2. Running it through a utility like txt2html (there's one on this Linux system, don't know if there's a windows version but there must be something similar). Apart from formatting line breaks etc., this will by default also translate all non-ASCII characters to their equivalent character entity names (like “). Those are guaranteed to be displayed by any browser if the characters are available in any font on the system.

To be sure of the maximum potential coverage of foreign characters, use UTF-8 instead of ISO-8859-1. That's one character set for everything, ASCII compatible, but a lot of software still doesn't use it, or not by default. Windows should support it properly because it uses UTF internally.

There's a UTF-8 to HTML converter *HERE* (an online converter) : Google will probably find others for you.

The only other problem is the database. I have no idea what it does to text outside the ASCII set. I do know that MySQL goes to enormous lengths to define character sets for data to make sure that sorting and collation are done correctly. If your database only handles 7 bit text you should still be OK if you store HTML entity names in it, i.e. convert before storing in the database. They'll look funny in anything other than a web browser, but they will work where they are needed to work.

Anahata

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Susan of DT
Date: 25 Nov 09 - 07:42 AM

Anahata - I'm not trying to get things to html. I am preparing the Digital Tradition, one version of which, several steps later, comes back to html. I want the DT to show legible characters so that all of you can see them straight in whatever form you view the DT. I will take a closer look at that screen that comes up when I convert to text - I have been just saying OK to what comes up.

John - I'll take a look at the quote thing on my Word template

Would I be better off working in Notepad than Word?

I'll be getting a new computer soon and thus switching from XP to Win7 (and DOSBOX, I guess), which may give me a different set of issues.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Artful Codger
Date: 25 Nov 09 - 09:09 AM

I think you guys are missing the mark here. First, OSs, browsers and software outside the techie Dark Ages all support Unicode pretty well. For the languages commonly seen on this forum, there are no longer display problems when the most common fonts are used. This includes the various Cyrillic languages and even Hebrew and Arabic (where right-to-left rendering is handled properly if the Unicode strings include right-to-left code point instructions). If you're concerned about the rendering of Chinese, Hindi, Runic, Glagolitic or other spottily-supported languages, you're addressing a very niche audience in this forum. For properly encoded text, the only display problem you're likely to encounter here among the commonly used languages is that most fonts don't include all Irish consonants or the "agus" symbol.

And as I've already pointed out, when you paste text into Mudcat, you're already pasting raw Unicode rather than 8-bit code-page-dependent text, regardless of your source. So the real question is how to enter HTML-encoded text rather than raw Unicode for characters outside the common ASCII set. Simplest would be if Mudcat automatically converted raw Unicode for you. Unless/until Max implements that, you have to use some other alternative--like my scripts or the utilities Anahata pointed to.

The advantage of my scripts is that they work with any program--in the same way--and they only require one copy and one paste, since they operate directly on the clipboard. Using a utility like Anahata suggested requires two copy/paste operations: first from your source to the web utility, then from the utility's output window to Mudcat. But the advantage is that you don't have to know squat about command windows and configuring script environments, so this may be a better route for the more technically challenged.

I recommend against copying text from "HTML page source" views provided by word processors or browsers, because of the likelihood of importing unbalanced tags and tag/style attributes which aren't defined/understood in the Mudcat page context. Similarly, conversion utilities built into word processors may generate HTML entities which aren't (yet) widely supported, or may embed proprietary tags and other junk; caveat emptor. They also tend to operate in-place, so you may have to create a separate, disposable work file to keep from munging your original source.

Susan: For the Digital Tradition, storing as Unicode is better for collation and searching, and is what you get automatically if you manually cut/paste from Mudcat screens I'd be surprised if your database program didn't accept/store Unicode for text fields. If you read the HTML source files directly you'll get HTML escapes (and some raw Unicode where you shouldn't) and may have to reconstitute into Unicode for proper collation and searching.

Choosing between Notepad and Word is like choosing between Scylla and Charybdis, but you'll still be far better off with Word, since Notepad probably isn't Unicode aware--it was designed for plain-text 8-bit files, like .bat files.

Jack is being alarmist: you really don't have to muck about with your quotes as long as you encode them when generating HTML--see my earlier comments. You're going to have to encode non-ASCII Unicode anyway, so if you have a generic conversion script or utility, they'll get handled properly without having to force them into the straight-quote straightjacket. One warning: certain characters (like straight quotes but not smart quotes) may have special significance to your database program, and may need to be escaped when they're embedded in text to be stored. That's a separate issue from HTML/Unicode encoding.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Simon G
Date: 25 Nov 09 - 11:28 AM

Anahata - its the browsers responsibility to convert the text into the correct character set as specified by the page or the default setting before shipping it back to the server. Any glitches as a result of copy/paste should be visible to the user in the text they have pasted in. What they see in the text area the server will get.

As long as the server consistently stays in a character set and deals with any escaping required there won't be a problem. Problems on the server usually result from conversions to ASCII which lose information.

John - your getting you bytes and words mixed up. a byte is always 8bits, a 32 bit OS has a 32bit word (4 bytes), a 64 bit OS has a 64 bit word (8 bytes). As for data disappearing in conversions between 16, 32 and 64 bit, this would be in very poorly constructed software. Not something for us to worry about. The advent of full Unicode fonts means operating systems load characters or pages of characters on demand so there is no extra load, other than disk space. Correct me if I'm wrong but HTML never down loads a font.

As for DT, the time is long past that is should be in an ISO character set. The tools do full internationalise are ubiquitous, maybe you don't need to support mandarin or arabic but you will probably get it all for free. The archaic link in Susan's process is the ancient copy of askSAM - perhaps time for an upgrade to a newer version.

Simon

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Mysha
Date: 25 Nov 09 - 11:28 AM

Hi,

Susan, there are several character set problems associated with moving between programs and between formats. However, ultimately, the one problem that concerns you is that your source format in your final step is ".txt". This, the simple/flat text format, will always use the standard character encoding. For current Windows systems, that encoding is usually set to "utf8", which uses more than one byte if it needs to represent less common characters. As your askSam is an old DOS program, however, it will most likely assume the standard character-set always uses one byte per character, as that was true at the time the program was created. Obviously, this difference results in a misinterpretation of the .txt files that are being read in in askSam.

There are several ways to put in extra effort to avoid or correct such problems, but as askSam is still in development, I suggest you expend the $89,95 that will update you to version 7. Unicode has, after all, been around for quite a while now, so, it should be a reasonable assumption that this issue will have been dealt with by now. (If you want to be absolutely certain, you can ask them beforehand.)

Joe, what you want on the input side is that all html pages specify that they are utf8. This should solve most problems with different character sets beforehand, provided Max's backend can handle it, and where things go wrong they will be visible immediately, rather than at harvesting time. (This may come at the cost of Max having to do once an automated pass through the entire base to convert the existing messages, though, system depending.)

As others noted above, however, the curly quote marks are usually caused by automated conversion in a Microsoft program. Ideally, those should simply be converted back unless the users indicate they really mean it. (And, yes, there are ways to determine whether that's true.)

So, how many songs do we know on the subject of computer problems, then?

Bye,
Mysha

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Mysha
Date: 25 Nov 09 - 11:47 AM

Hi,

I sometimes forget little connecting remarks about things that are obvious to me, but not to others:

- "Unicode" is intended as a character set of all characters of human language.

- "utf8" is a way to store texts in Unicode in a file - and html pages are text files too. (Basically, the characters are encoded in such a way that the characters of the English latin alphabet, plus a number of other often-used characters, will each take up one byte, while other characters will each use several bytes. It should probably be considered today's standard encoding, but not all programs support that standard, yet.)

- askSam are at http://asksam.com.

Bye,
Mysha

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: treewind
Date: 25 Nov 09 - 04:26 PM

What Mysha said.
I concur that UTF-8 is the way forward. Good news if AskSam can be upgraded to a UTF-8 version.
And a UTF-8 declaration on the http headers from the Mudcat servers too - that is The Right Way To Do It and if anything then doesn't display correctly it should be fixed by changing it to UTF-8.

Unfortunately I think we'll find that a lot of users still post in Latin-1 by default and will complain.

Never mind, I live in hope that one day we'll look back and laugh at the Tower of Babel that code pages used to give us.

Anahata

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Susan of DT
Date: 25 Nov 09 - 06:18 PM

The windows version of askSam is not nearly as talented as the DOS version, per Dick. Of course, it was some years ago that he looked at it.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Artful Codger
Date: 25 Nov 09 - 06:42 PM

Simon: You're both correct and incorrect. While the text you paste in the Mudcat box consists of the (Unicoded) characters you see, the current character set for Mudcat web pages is the default ASCII. So any characters outside the 7-bit range which are not encoded as HTML entities constitute improper HTML and get munged when the message is displayed. What you see in the message box is not necessarily what will be displayed, even to the poster.

It might be possible for Mudcat to fix the problem by declaring a Unicode encoding for its web pages--at least any new ones generated. With any declared web page encoding, you can still embed HTML entities in place of raw Unicode, so there would be no compatibility problems with older text that was properly encoded in the first place, or with text that was HTMLized before submission.

Anahata: Actually, UTF-8 (as opposed to UTF-16 or UTF-32) is an outmoded and error-prone way to go. Although for Roman text it is more compact than the other Unicode encodings, it is less efficient for text handling, particularly collation and sorting. Software has to unpack the character information into a more expanded form, and this involves a lot of bit twiddling.

Furthermore, unlike UTF-16 files etc., UTF-8 files generally lack signal bytes at the start that unambiguously indicate the UTF encoding type and byte ordering. So not only do they end up as garbage when transferred to a machine with different byte ordering, but they can be easily mistaken for plain-text files, and thus improperly interpreted according to a native character set. UTF-8 is fine for email transmission and such because the email protocols define the encoding and byte ordering unambiguously. Otherwise, it's a bad choice.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: dick greenhaus
Date: 25 Nov 09 - 08:37 PM

The question of using AskSam is really an irrelevancy. The purpose of the DT is to enable searches, not just to accurately list songs that use different character sets; if users can't type in what they're looking for, there no point to it.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: treewind
Date: 26 Nov 09 - 04:01 AM

AC - I think you underestimate the importance of ASCII compatibility in UTF-8. And I know Wikipedia is not the absolute fount of all knowledge, but: "Sorting of UTF-8 strings as arrays of unsigned bytes will produce the same results as sorting them based on Unicode code points."
Also it doesn't need a byte order mark. Some software relies on a UTF-8 marker to distinguish it from ASCII text, but that's not the best way to do it. A lot of software designed for ASCII will work with UTF-8, but it won't work with UTF-16.

As for bit twiddling, the world is full of compression algorithms that do far more of that, but nobody complains about that.

Anahata

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Mysha
Date: 26 Nov 09 - 05:47 AM

Hi,

Dick, can you elaborate a bit? Are you telling us you're not married to askSam (or not to the DOS version), or that you're afraid people with US keyboards will be unable to find cyrillic texts?

Bye,
Mysha

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Simon G
Date: 26 Nov 09 - 08:59 AM

I think Dick means DT should find zoé (with e acute) when I search for zoe. One possible answer is to use google or similar to do the searching, as their system does all this stuff already.

I guess searching for pravda should also produce Правда which I don't think google can manage. To do that need a anglicised text for songs.

Simon

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Simon G
Date: 26 Nov 09 - 09:01 AM

BTW on my browser the cyrillic in my last message shows correctly. Does it on yours?

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Bill D
Date: 26 Nov 09 - 09:22 AM

It shows fine in my Firefox.... haven't looked in Opera.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Mysha
Date: 26 Nov 09 - 11:46 AM

Hi,

But I would not want to find Zoé if I searched for Zoe. If I had wanted to find Zoé, I would have search for Zoé. That's exactly what's wrong with Google: They give you lots of hits for a search you didn't want them to perform.

Yes, the Cyrillic shows up fine. What happens in the background, though, is that our browsers take a look at the page, can't find the character encoding, and hazard a guess.
* My Firefox guesses it's Latin 1, and then adds extra characters that are encoded as (Unicode) character entities (codes between & and ; to represent characters that can't be represented in HTML, otherwise.)
* The W3C HTML validator guessed Utf-8, rejected that, and then tried Windows-1252 (that's the character set with the non-standard curly quotes), and from there took the same path.
It's differences like those that we should be able to avoid by having the pages specify the character set.

(There were also a few other errors I noticed before, that the validator now protested about. If Max needs help to clean them up, he should feel free to pm me.)

I agree that entering Cyrillic on a Latin keyboard is rather difficult, though, but transliteration would have its drawbacks too. Either way, my guess would be that people who want to search in texts that use other character sets, are likely to also have means to enter such searches.

So how many songs are there about computers?
Bye
Mysha

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: dick greenhaus
Date: 26 Nov 09 - 12:20 PM

Let me clarify. I'm not committed to AskSam--it's just that I haven't found any database that does the same job, and we(Susan and I) find it very useful in dealing with submissions that are really duplicates (or very-close relatives) of existing entries.

My main concern, though, with Cyrillic Characters, or accents grave, or any of the non-ASCII characters, is that the person searching for a song doesn't have a simple way of entering, say, Zoé.

I seem to recall that postal services have established ASCII equivalents to these special character sets---if so it would behoove us to use them. I'm sorry if I'm being Anglocentric, but, in reality, that's what I am---in the company of a vast number of people who search the DT.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Simon G
Date: 26 Nov 09 - 12:39 PM

Dick

You don't have to venture outside English to have search issues. If I search for "splendour" do I mean "splendor" as well?

Personally I like the relaxed all inclusive google way of count e and é as the same and splendour and splendor as the same. The two issues are different one is just a slightly different spelling but as Mysha points out e and é as two entirely different letters. These aren't special character though, just because they are rarely used in American English.

Would there be any purpose to recognising translations so if I look up Silent Night I get Stille Nacht with translations, for this song and many others this would represent the tradition more effectively.

Simon

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: dick greenhaus
Date: 26 Nov 09 - 05:00 PM

One thing I like about AskSam is that you can use wildcards in a search. So if you look for splend* it covers both spellings.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Artful Codger
Date: 26 Nov 09 - 08:47 PM

Entering Cyrillic from my Mac's Latin keyboard is a piece of cake. I just select a "Russian - Phonetic" keyboard mapping and type pretty much as I'd expect to type Romanized Russian. Setting up the selection menu initially (through the International preferences) was a snap, too.

When I was running Windows, things weren't quite so easy, since Microsoft only provided native Cyrillic keyboard layouts--I had to create my own QWERTY-style layouts using a keyboard construction utility. After that, it was easy. (I offered to upload them, if someone would provide a repository, but no one took me up on the offer--your loss.)

Why not provide both exact and character-equivalent forms of searching? There is a need for each. Note also that you may have to handle case equivalence and composed-sequence equivalents (like o-overstrike-acute for ó or c-h for the single glyph ch)--Unicode collation software handles this by "normalizing" to standard forms before comparison. There are other instruction and formatting code points which might be interspersed with text, which custom searching software would have to ignore (or possibly consider as word breaks).

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: dick greenhaus
Date: 26 Nov 09 - 10:31 PM

It's a good idea, but makes things a bit complicated. Most users haven't been able to master searches beyond song titles.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Artful Codger
Date: 27 Nov 09 - 12:09 AM

Dick: I hope I'm responding to what you meant (that wildcards or exact/folded options make things a bit more complicated?)

Search options benefit the more experienced users--those who typically make heavy use of searches and bother to read documentation on how to make those searches more efficient. Experienced users should not be deprived of useful options just because lazy users never learn how to use them (preferring to waste their time wading through garbage results), or because options are implemented inconsistently from engine to engine or site to site. Nor should we have to call up special "advanced search" screens if we can just use a few wildcards or special query sequences in the regular search field.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Joe Offer
Date: 12 Jan 18 - 01:57 AM

Many of the special characters are being interpreted properly - umlauts, graves, and tildes seem to be working, but I'm not completely sure. But apostrophes and quotation marks are still giving us trouble. I'm not going to worry about it for most things, but I am especially concerned about lyrics posted with incorrect characters. If you post lyrics and they don't come out right, let me know by email or personal message, and I'll do my best to fix 'em.
-Joe-
joe@mudcat.org

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: michaelr
Date: 13 Jan 18 - 01:46 PM

I've been seeing lots of question marks in posts here lately, apparently in place of apostrophes.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Sandra in Sydney
Date: 13 Jan 18 - 07:52 PM

me, too, but it might be my Mac

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: Misinterpreted Characters
From: Mr Red
Date: 14 Jan 18 - 06:14 AM

Most of those question marks are from copy & paste from who knows where? Apple does things differently, even in e-mails I see the hand of Steve Jobs & his successors.

Just a thought but the problemo may start at the keyboard. Not being a Mac evangelist I don't know, but I suspect it may be a factor. When cutting and pasting (From a Mac) try over-writing apostrophes and quotation marks after pasting and before submitting. See if that improves things. If not, there is little you can do short of techie things like using hash codes.
eg typing :
' for ' - apostrophe
" for " - either quotation mark

Most other problems would be European accented characters and that is universally a problem on the 'Cat. (at the moment)

Post - Top - Home - Printer Friendly - Translate

Share Thread:

Reply to Thread

Subject:	Help
From:

Preview Automatic Linebreaks Make a link ("blue clicky")

Mudcat time: 24 April 2:04 PM EDT

[ Home ]

All original material is copyright © 2022 by the Mudcat Café Music Foundation. All photos, music, images, etc. are copyright © by their rightful owners. Every effort is taken to attribute appropriate copyright to images, content, music, etc. We are not a copyright resource.