The Mudcat Café TM
Thread #118075   Message #2549950
Posted By: JohnInKansas
27-Jan-09 - 01:36 AM
Thread Name: Tech: OCR Tips - Optical Character Recognition
Subject: RE: Tech: OCR Tips - Optical Character Recognition
Joe - I went straight from Office 2002 to Word 2007 on WinXP, when the XP updates crippled 2002, so I guess I missed the Tools in Office 2003. I'll have to take a look at LiK's setup and see if she can be trained to use it instead of passing it all to me to OCR.

In lots of OCR conversions you get bunches of "soft breaks" that look like a bar with a hook on the end. In Word, you can often use "replace ^011" and "replace with nothing" (delete everything in the "replace with box." "Replace All"

The same thing often works in pastes from some web pages.

If that doesn't work, you can try:

"Replace ^l" for "line breaks" (that's a lower case L, not a numeral 1) Replace with a blank, a space, or with ^p if you want a "real" paragraph end.

"Replace ^n" for "column breaks"
"Replace ^." for "soft hyphens" if any show up in the paste

Usually stuff that you get from a source that you use often is worth figuring out which "breaks" crop up frequently, so that you can use a global replace to clean them up all at once. Word's "replace" function is very versatile which should make you very happy to use it, but is also very powerful which can elicit an occasional "OH SH*T!" if you're not careful. (Undo works even on large blunders, if you catch them in time.)

It's usually easiest, after OCR, to just scan visually for the recognition errors, and some OCR programs mark the "questionables" quite prominently; but the Word spell checker often can help pick up the ones that don't stand out on visual scan and the ones the OCR thought it got but didn't. Decide for yourself whether to spell check first or last.

Amos pastes a lot of funny characters (or did before I quite reading them all), 'cause either the editors he's copying from use Macs with the setting* to "replace ??? with "real glyphs,"" or it's on his own WP setup. In both Win Word and Mac Word, when you do this, the "glyph" is taken from a local "symbol" char map that may have arbitrary char numbers substituted for the "real" Unicode char nums, so it's impossible to generalize on proper replacements. Win Word users seem to have mostly learned to not do that (or never learned how), but it still pops up frequently from Mac users (especially some web editors). An OCR usually will be unable to "read" these.

* This "setting description" is a pseudonym for "really f**k up your page layout people and p*ss off your printers," but it's still pushed by Microsoft, and if you're using a "short font" you'll probably never realize you're doing it.

John