Lyrics & Knowledge Personal Pages Record Shop Auction Links Radio & Media Kids Membership Help
The Mudcat Cafemuddy

Post to this Thread - Sort Descending - Printer Friendly - Home


Tech: OCR Tips - Optical Character Recognition

Joe Offer 26 Jan 09 - 07:47 PM
Jack Campin 26 Jan 09 - 08:39 PM
JohnInKansas 26 Jan 09 - 11:19 PM
Joe Offer 27 Jan 09 - 12:06 AM
Amos 27 Jan 09 - 12:42 AM
JohnInKansas 27 Jan 09 - 01:36 AM
The Villan 27 Jan 09 - 04:20 AM
Richard Bridge 27 Jan 09 - 06:10 AM
JohnInKansas 27 Jan 09 - 06:18 AM
The Villan 27 Jan 09 - 06:42 AM
Richard Bridge 27 Jan 09 - 07:41 AM
The Villan 27 Jan 09 - 09:59 AM
JohnInKansas 27 Jan 09 - 03:19 PM
The Villan 27 Jan 09 - 03:25 PM
JohnInKansas 01 Feb 09 - 08:58 PM
Share Thread
more
Lyrics & Knowledge Search [Advanced]
DT  Forum
Sort (Forum) by:relevance date
DT Lyrics:




Subject: Tech: OCR Tips - Optical Character Recognition
From: Joe Offer
Date: 26 Jan 09 - 07:47 PM

Somebody has a nice songbook in PDF format, and wanted to post a number of songs from the book here at Mudcat. Since the book was PDF, the member said that it couldn't be copy-pasted here.
Well, it can't be posted here as an image, but it's relatively easy to OCR just about anything, and post it here.

The tools I like best for OCR are Microsoft Office Document Imaging and MS Office Document Scanning. The come with Word and Office, but they are not installed on normal installation. Go to Control Panel/Programs and select "uninstall a program." A list of programs installed on your computer will appear on your screen. Highlight Microsoft Office, and an option to "change" the program will appear. Select "Office Tools" and choose to run MS Office Document Imaging from your computer. Office will then install the program, and you'll find it under "Office Tools" on your list of programs.

MS Office Document Scanning operates your scanner and takes an image of the document. Then you can highlight the portion you want to copy, right-click, and select "copy" to get OCR text that you can paste into a Mudcat box or word processing document.

Here's a way to OCR PDF files - Adobe Reader allows you to highlight and copy an image from a PDF document. You can take that copied image and paste it into Microsoft Office Document Imaging (mspview.exe) - go to page/paste page. That program does a pretty good job of OCR, and then you can just paste the text into a Mudcat message box (or Word document) and edit it. Sometimes, it comes out with no OCR errors at all (but that's rare).

-Joe-


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Jack Campin
Date: 26 Jan 09 - 08:39 PM

Some PDFs encapsulate images and need to be scanned to recover the text - others are effectively styled text with the fonts included, and for those the full version of Acrobat will extract the text (unless it's been password-protected somehow). If you can do the latter it'll be a lot quicker and more accurate.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 26 Jan 09 - 11:19 PM

A PDF that was made from a text document usually preserves the letter characters so you can copy text from the PDF to another kind of document or other file.

If you can select the text in a PDF, the simplest method is to select and copy the text and paste into Word. Pasting into Word first is recommended, since editing and cleanup is easier there.

If you need line-to-line alignment ? as for chord symbols over text ? put the text into a monospaced font like Courier, adjust the spacing, then put a <pre> at the beginning and </pre> at the end, and when you copy from Word and paste to the 'cat the spacing should be preserved. A <pre> tag should always display in the reader's browser in a monospaced font, but not necessarily the same one you use to set it up. (But always preview before posting, please.)

A PDF can be set by the creator (if said creator has the high-dollar version from Adobe) to prevent copying. There reportedly are methods to "work around" the copy lockout, but they've varied with the version of the PDF maker, and aren't consistently reliable. If you run into one of these, got to the next step.

Nearly all PDFs can be printed. (There's a lockout for that too, but it's seldom used). A print can, as Joe points out, be scanned to get an image file, and and the image file then run through OCR to get text. (The method Joe cites appears to take a "screen scan" of the PDF page, if it's an image and not individual characters.)

Some PDFs are just bundles of images, usually with one page-image per PDF page. Photoshop Elements will extract images (but not text) from a PDF, so if you run into one of these you can extract an image of each page, and run those images through OCR. (You can also "do it the hard way" and just select one page at a time as a picture and paste each page into an image editor to make page-images, if the PDF hasn't been blocked for copying.)

Note that the Office Tools Joe refers to are, so far as I know, only available in Office 2007. Office 2007 is the standard for Vista, and will run on WinXP SP2 or later. I haven't seen any blather about it being usable on anything earlier than WinXP, but it might run on Win2K.

The features cited are based on a new "universal document" format (called Open XML) that came into being with Office 2007 and is claimed to be at least partly "interoperable with PDF."

Those who don't have Office 2007 might still find an OCR program quite useful. The leading commercial products that I've heard of are OmniPage and TextBridge. Both have "lite" versions available for a little under $100 (or did the last time I looked), but a full-blown program will run closer to $500 if you want truly impressive performance. There are freeware programs available, but the reviews have mostly said that they lag the "useful state of the art" by five or ten years.

For OCR, using separate programs I've looked at, it often is best to avoid having the "best possible images." For ordinary text, a scan at 150 dpi may give better results than a "photo scan" at 300 dpi or higher, and monochrome usually works a little better than color. Results generally are so variable that the only way to know is to try out a program an a variety of source images.

Expect to proof read carefully, and correct the typos before posting, for any OCR-produced text.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Joe Offer
Date: 27 Jan 09 - 12:06 AM

Microsoft Office Document Imaging and MS Office Document Scanning were part of Office 2003, as well as Office 2007. I don't know about earlier versions. They're very simple tools, but they work very easily and very well.

Most online copies of old songbooks seem to be in PDF format - scans of full pages, and not taken from text. Therefore, you have to use OCR to extract the text.

-Joe-


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Amos
Date: 27 Jan 09 - 12:42 AM

In the full version of Adobe Acrobat, under the Document menu, choose "convert using OCR" and select the option for the whole document.

You will still need to copy ( in some PDFs one page at a time) and paste as John describes above. The need for careful proofing is because OCR is always only 98-99% accurate at best. Even if you get lucky there will be issues of line breaks and peculiar treatment of hyphens.

A


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 27 Jan 09 - 01:36 AM

Joe - I went straight from Office 2002 to Word 2007 on WinXP, when the XP updates crippled 2002, so I guess I missed the Tools in Office 2003. I'll have to take a look at LiK's setup and see if she can be trained to use it instead of passing it all to me to OCR.

In lots of OCR conversions you get bunches of "soft breaks" that look like a bar with a hook on the end. In Word, you can often use "replace ^011" and "replace with nothing" (delete everything in the "replace with box." "Replace All"

The same thing often works in pastes from some web pages.

If that doesn't work, you can try:

"Replace ^l" for "line breaks" (that's a lower case L, not a numeral 1) Replace with a blank, a space, or with ^p if you want a "real" paragraph end.

"Replace ^n" for "column breaks"
"Replace ^." for "soft hyphens" if any show up in the paste

Usually stuff that you get from a source that you use often is worth figuring out which "breaks" crop up frequently, so that you can use a global replace to clean them up all at once. Word's "replace" function is very versatile which should make you very happy to use it, but is also very powerful which can elicit an occasional "OH SH*T!" if you're not careful. (Undo works even on large blunders, if you catch them in time.)

It's usually easiest, after OCR, to just scan visually for the recognition errors, and some OCR programs mark the "questionables" quite prominently; but the Word spell checker often can help pick up the ones that don't stand out on visual scan and the ones the OCR thought it got but didn't. Decide for yourself whether to spell check first or last.

Amos pastes a lot of funny characters (or did before I quite reading them all), 'cause either the editors he's copying from use Macs with the setting* to "replace ??? with "real glyphs,"" or it's on his own WP setup. In both Win Word and Mac Word, when you do this, the "glyph" is taken from a local "symbol" char map that may have arbitrary char numbers substituted for the "real" Unicode char nums, so it's impossible to generalize on proper replacements. Win Word users seem to have mostly learned to not do that (or never learned how), but it still pops up frequently from Mac users (especially some web editors). An OCR usually will be unable to "read" these.

* This "setting description" is a pseudonym for "really f**k up your page layout people and p*ss off your printers," but it's still pushed by Microsoft, and if you're using a "short font" you'll probably never realize you're doing it.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: The Villan
Date: 27 Jan 09 - 04:20 AM

The best program I ever bought for this sort of job was Scansoft - Textbridge Pro Millenium. It is one of the best. I still use it today on Windows Vista.
It was worth the money. The program that came with my scanner was absolutely crap.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Richard Bridge
Date: 27 Jan 09 - 06:10 AM

I always liked Omnipage best.

Office 2000 does not have OCR.


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 27 Jan 09 - 06:18 AM

TVil -

I've gotten "OmniPage Lite" as bundled software on a couple of computers, but found TextBridge more accurate "way back when." They were two different companies back then.

Vista doesn't really seem to agree with my old TextBridge 9, so I've considered an update, but it seems as though someone called "Nuance" has bought up both TextBridge and OmniPage.

I'm not finding much about just who this "nuance" is, or what their full line of stuff can do. Their web sites look more like a "marketing org" than like a product maker.

It seems like just a few weeks ago that I was getting all kinds of ads from the original TextBridge builders, but thinking back it seems my version probably is about ? or ?? years old. (My .exe file shows a "created 1999," so I must have gotten it when I was running Win98(?); but as far as I can tell there's only been one newer version released.)

The newer TextBridge Pro 11 is still a little under $100, but the OmniPage Pro 16 is $500 - apparently from the same company. And I don't think I've seen a legitimate "product comparison" review on OCR programs since juat before I bought my current one.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: The Villan
Date: 27 Jan 09 - 06:42 AM

John
Have a lolok at this review.

It would seem that the newest version of Textbridge is not as good as the older ones.

http://www.pcpro.co.uk/reviews/32724/scansoft-textbridge-pro-11.html

Les


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Richard Bridge
Date: 27 Jan 09 - 07:41 AM

That's a 2002 review!


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: The Villan
Date: 27 Jan 09 - 09:59 AM

Oh sorry. So what is the latest textbridge version then?


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 27 Jan 09 - 03:19 PM

As noted, that's a rather old review. I found "November" but no year, so I'll take RB's 2002. It does not show compatibility with Vista, which is apparently my problem with TextBridge 9.

ScanSoft is no longer ScanSoft, and is calling itself Nuance.

So far as I've been able to find, TextBridge Pro 11 is the current version, and is $89 (US). This would indicate that there has been no update to TextBridge since at least 2002(?). The advertising I can find at the Nuance website does not indicate, so far as I've found, what OS versions it's compatible with.

Nuance is also offering OmniPage Pro 16 at $499 (US), but in the modern advertising mode it says "it's wonderful" - "you'll love it" - "you should buy it now" - and "your teeth will rot and your hair will fall out if you don't get it immediately" - but it neglects to actually say what it does in any meaningful way.

Abby FineRreader 9.0 is shown as $399 (US) in a May 2008 review at PC Magazine. (This article appeared in today's Google results, but wasn't in results with the same search term two days ago.)

The link goes to the review header for OmniPage, but the "Next" button goes to the header for Abby FineReader. A sidebar on the left should take you to the start of the "23 utilities" full article.

John


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: The Villan
Date: 27 Jan 09 - 03:25 PM

Well I am using Textbridge Pro Millenium with Vista. Works great


Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 01 Feb 09 - 08:58 PM

Textbridge Pro Millenium apparently was an intermediate issue between the Pro 9 that I have, which "works sort of" with Vista but not as well as with XP, and the current Pro 11 that's apparently still available but not well regarded by users. It appears that Pro 11 is about the only thing that can be acquired new now unless you resort to eBay.

Having looked a little at Microsoft Office's OCR, it appears to be at least as accurate as my old Textbridge, and perhaps even a bit better; but it definitely lacks some ease of use features.

Joe O may help me out here, but the Office version appears to require that you load pages or images one at a time, so a long document could take quite a bit of fumbling about to convert. Textbridge (like most other OCR programs I've seen) allows you to load a sequence of scans/images and convert them all at once. Textbridge (my old version) often makes a horrible mess of patching them together, but it gives you all the text in one document, all in one shot, to do the cleanup.

The Office OCR seems to work fairly well with scanned-in images, but if you want to OCR an existing image file it apparently accepts only TIFF files. I've avoided using TIFs for archiving due to the inconsistent results with them when you have multiple programs and sources (there are at least a half dozen different fairly common TIF formats, and several popular programs use proprietary "extensions" that can't be read by other programs: e.g. old Adobe Illustrator .tif files that I have can't be opened in several other Adobe graphics programs).

With Photoshop Elements, I can fairly quickly convert a folder of images to .tif using the batch processor, and the Office OCR accepts the tifs; but that means having - at least temporarily - another set of duplicates of the same images.

As noted, this is from a pretty quick look at it, but the Office package is definitely "usable" if you don't have something better; and if you have Office it's a no-cost option.

Also to be noted: there is NOTHING in Vista or Office 2007 "Help" files that I can find that even suggests that this utility exists, much less tells you what it's for - but then there's not much of anything else helpful in either of those places as well.

John


Post - Top - Home - Printer Friendly - Translate
  Share Thread:
More...

Reply to Thread
Subject:  Help
From:
Preview   Automatic Linebreaks   Make a link ("blue clicky")


Mudcat time: 19 February 9:31 AM EST

[ Home ]

All original material is copyright © 1998 by the Mudcat Café Music Foundation, Inc. All photos, music, images, etc. are copyright © by their rightful owners. Every effort is taken to attribute appropriate copyright to images, content, music, etc. We are not a copyright resource.