The Mudcat Café TM
Thread #125269   Message #2773433
Posted By: Mysha
25-Nov-09 - 11:28 AM
Thread Name: Tech: Misinterpreted Characters
Subject: RE: Tech: Misinterpreted Characters
Hi,

Susan, there are several character set problems associated with moving between programs and between formats. However, ultimately, the one problem that concerns you is that your source format in your final step is ".txt". This, the simple/flat text format, will always use the standard character encoding. For current Windows systems, that encoding is usually set to "utf8", which uses more than one byte if it needs to represent less common characters. As your askSam is an old DOS program, however, it will most likely assume the standard character-set always uses one byte per character, as that was true at the time the program was created. Obviously, this difference results in a misinterpretation of the .txt files that are being read in in askSam.

There are several ways to put in extra effort to avoid or correct such problems, but as askSam is still in development, I suggest you expend the $89,95 that will update you to version 7. Unicode has, after all, been around for quite a while now, so, it should be a reasonable assumption that this issue will have been dealt with by now. (If you want to be absolutely certain, you can ask them beforehand.)


Joe, what you want on the input side is that all html pages specify that they are utf8. This should solve most problems with different character sets beforehand, provided Max's backend can handle it, and where things go wrong they will be visible immediately, rather than at harvesting time. (This may come at the cost of Max having to do once an automated pass through the entire base to convert the existing messages, though, system depending.)

As others noted above, however, the curly quote marks are usually caused by automated conversion in a Microsoft program. Ideally, those should simply be converted back unless the users indicate they really mean it. (And, yes, there are ways to determine whether that's true.)


So, how many songs do we know on the subject of computer problems, then?

Bye,
                                                                Mysha