mudcat.org: Tech: OCR Tips - Optical Character Recognition

sj

Post to this Thread - Sort Descending - Printer Friendly - Home

Tech: OCR Tips - Optical Character Recognition

Related threads:
Tech: OCR (Optical Character Recognition) (22)
OCR for Gaeilge? (1)

Joe Offer	26 Jan 09 - 07:47 PM
Jack Campin	26 Jan 09 - 08:39 PM
JohnInKansas	26 Jan 09 - 11:19 PM
Joe Offer	27 Jan 09 - 12:06 AM
Amos	27 Jan 09 - 12:42 AM
JohnInKansas	27 Jan 09 - 01:36 AM
Rasener	27 Jan 09 - 04:20 AM
Richard Bridge	27 Jan 09 - 06:10 AM
JohnInKansas	27 Jan 09 - 06:18 AM
Rasener	27 Jan 09 - 06:42 AM
Richard Bridge	27 Jan 09 - 07:41 AM
Rasener	27 Jan 09 - 09:59 AM
JohnInKansas	27 Jan 09 - 03:19 PM
Rasener	27 Jan 09 - 03:25 PM
JohnInKansas	01 Feb 09 - 08:58 PM
Joe Offer	18 Jan 22 - 03:08 PM
Stilly River Sage	18 Jan 22 - 03:29 PM
cnd	18 Jan 22 - 03:59 PM
Bonzo3legs	18 Jan 22 - 04:31 PM
Felipa	18 Jan 22 - 04:57 PM
Jon Freeman	20 Jan 22 - 11:06 AM
Stilly River Sage	20 Jan 22 - 12:10 PM

Share Thread

Lyrics & Knowledge Search [Advanced]
DT Forum Child
Sort (Forum) by:relevance date

DT Lyrics:

Subject: Tech: OCR Tips - Optical Character Recognition
From: Joe Offer
Date: 26 Jan 09 - 07:47 PM

Somebody has a nice songbook in PDF format, and wanted to post a number of songs from the book here at Mudcat. Since the book was PDF, the member said that it couldn't be copy-pasted here.
Well, it can't be posted here as an image, but it's relatively easy to OCR just about anything, and post it here.

The tools I like best for OCR are Microsoft Office Document Imaging and MS Office Document Scanning. The come with Word and Office, but they are not installed on normal installation. Go to Control Panel/Programs and select "uninstall a program." A list of programs installed on your computer will appear on your screen. Highlight Microsoft Office, and an option to "change" the program will appear. Select "Office Tools" and choose to run MS Office Document Imaging from your computer. Office will then install the program, and you'll find it under "Office Tools" on your list of programs.

MS Office Document Scanning operates your scanner and takes an image of the document. Then you can highlight the portion you want to copy, right-click, and select "copy" to get OCR text that you can paste into a Mudcat box or word processing document.

Here's a way to OCR PDF files - Adobe Reader allows you to highlight and copy an image from a PDF document. You can take that copied image and paste it into Microsoft Office Document Imaging (mspview.exe) - go to page/paste page. That program does a pretty good job of OCR, and then you can just paste the text into a Mudcat message box (or Word document) and edit it. Sometimes, it comes out with no OCR errors at all (but that's rare).

-Joe-

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Jack Campin
Date: 26 Jan 09 - 08:39 PM

Some PDFs encapsulate images and need to be scanned to recover the text - others are effectively styled text with the fonts included, and for those the full version of Acrobat will extract the text (unless it's been password-protected somehow). If you can do the latter it'll be a lot quicker and more accurate.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 26 Jan 09 - 11:19 PM

A PDF that was made from a text document usually preserves the letter characters so you can copy text from the PDF to another kind of document or other file.

If you can select the text in a PDF, the simplest method is to select and copy the text and paste into Word. Pasting into Word first is recommended, since editing and cleanup is easier there.

If you need line-to-line alignment – as for chord symbols over text – put the text into a monospaced font like Courier, adjust the spacing, then put a <pre> at the beginning and </pre> at the end, and when you copy from Word and paste to the 'cat the spacing should be preserved. A <pre> tag should always display in the reader's browser in a monospaced font, but not necessarily the same one you use to set it up. (But always preview before posting, please.)

A PDF can be set by the creator (if said creator has the high-dollar version from Adobe) to prevent copying. There reportedly are methods to "work around" the copy lockout, but they've varied with the version of the PDF maker, and aren't consistently reliable. If you run into one of these, got to the next step.

Nearly all PDFs can be printed. (There's a lockout for that too, but it's seldom used). A print can, as Joe points out, be scanned to get an image file, and and the image file then run through OCR to get text. (The method Joe cites appears to take a "screen scan" of the PDF page, if it's an image and not individual characters.)

Some PDFs are just bundles of images, usually with one page-image per PDF page. Photoshop Elements will extract images (but not text) from a PDF, so if you run into one of these you can extract an image of each page, and run those images through OCR. (You can also "do it the hard way" and just select one page at a time as a picture and paste each page into an image editor to make page-images, if the PDF hasn't been blocked for copying.)

Note that the Office Tools Joe refers to are, so far as I know, only available in Office 2007. Office 2007 is the standard for Vista, and will run on WinXP SP2 or later. I haven't seen any blather about it being usable on anything earlier than WinXP, but it might run on Win2K.

The features cited are based on a new "universal document" format (called Open XML) that came into being with Office 2007 and is claimed to be at least partly "interoperable with PDF."

Those who don't have Office 2007 might still find an OCR program quite useful. The leading commercial products that I've heard of are OmniPage and TextBridge. Both have "lite" versions available for a little under $100 (or did the last time I looked), but a full-blown program will run closer to $500 if you want truly impressive performance. There are freeware programs available, but the reviews have mostly said that they lag the "useful state of the art" by five or ten years.

For OCR, using separate programs I've looked at, it often is best to avoid having the "best possible images." For ordinary text, a scan at 150 dpi may give better results than a "photo scan" at 300 dpi or higher, and monochrome usually works a little better than color. Results generally are so variable that the only way to know is to try out a program an a variety of source images.

Expect to proof read carefully, and correct the typos before posting, for any OCR-produced text.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Joe Offer
Date: 27 Jan 09 - 12:06 AM

Microsoft Office Document Imaging and MS Office Document Scanning were part of Office 2003, as well as Office 2007. I don't know about earlier versions. They're very simple tools, but they work very easily and very well.

Most online copies of old songbooks seem to be in PDF format - scans of full pages, and not taken from text. Therefore, you have to use OCR to extract the text.

-Joe-

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Amos
Date: 27 Jan 09 - 12:42 AM

In the full version of Adobe Acrobat, under the Document menu, choose "convert using OCR" and select the option for the whole document.

You will still need to copy ( in some PDFs one page at a time) and paste as John describes above. The need for careful proofing is because OCR is always only 98-99% accurate at best. Even if you get lucky there will be issues of line breaks and peculiar treatment of hyphens.

A

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 27 Jan 09 - 01:36 AM

Joe - I went straight from Office 2002 to Word 2007 on WinXP, when the XP updates crippled 2002, so I guess I missed the Tools in Office 2003. I'll have to take a look at LiK's setup and see if she can be trained to use it instead of passing it all to me to OCR.

In lots of OCR conversions you get bunches of "soft breaks" that look like a bar with a hook on the end. In Word, you can often use "replace ^011" and "replace with nothing" (delete everything in the "replace with box." "Replace All"

The same thing often works in pastes from some web pages.

If that doesn't work, you can try:

"Replace ^l" for "line breaks" (that's a lower case L, not a numeral 1) Replace with a blank, a space, or with ^p if you want a "real" paragraph end.

"Replace ^n" for "column breaks"
"Replace ^." for "soft hyphens" if any show up in the paste

Usually stuff that you get from a source that you use often is worth figuring out which "breaks" crop up frequently, so that you can use a global replace to clean them up all at once. Word's "replace" function is very versatile which should make you very happy to use it, but is also very powerful which can elicit an occasional "OH SH*T!" if you're not careful. (Undo works even on large blunders, if you catch them in time.)

It's usually easiest, after OCR, to just scan visually for the recognition errors, and some OCR programs mark the "questionables" quite prominently; but the Word spell checker often can help pick up the ones that don't stand out on visual scan and the ones the OCR thought it got but didn't. Decide for yourself whether to spell check first or last.

Amos pastes a lot of funny characters (or did before I quite reading them all), 'cause either the editors he's copying from use Macs with the setting^* to "replace ??? with "real glyphs,"" or it's on his own WP setup. In both Win Word and Mac Word, when you do this, the "glyph" is taken from a local "symbol" char map that may have arbitrary char numbers substituted for the "real" Unicode char nums, so it's impossible to generalize on proper replacements. Win Word users seem to have mostly learned to not do that (or never learned how), but it still pops up frequently from Mac users (especially some web editors). An OCR usually will be unable to "read" these.

^* This "setting description" is a pseudonym for "really f**k up your page layout people and p*ss off your printers," but it's still pushed by Microsoft, and if you're using a "short font" you'll probably never realize you're doing it.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Rasener
Date: 27 Jan 09 - 04:20 AM

The best program I ever bought for this sort of job was Scansoft - Textbridge Pro Millenium. It is one of the best. I still use it today on Windows Vista.
It was worth the money. The program that came with my scanner was absolutely crap.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Richard Bridge
Date: 27 Jan 09 - 06:10 AM

I always liked Omnipage best.

Office 2000 does not have OCR.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 27 Jan 09 - 06:18 AM

TVil -

I've gotten "OmniPage Lite" as bundled software on a couple of computers, but found TextBridge more accurate "way back when." They were two different companies back then.

Vista doesn't really seem to agree with my old TextBridge 9, so I've considered an update, but it seems as though someone called "Nuance" has bought up both TextBridge and OmniPage.

I'm not finding much about just who this "nuance" is, or what their full line of stuff can do. Their web sites look more like a "marketing org" than like a product maker.

It seems like just a few weeks ago that I was getting all kinds of ads from the original TextBridge builders, but thinking back it seems my version probably is about ? or ?? years old. (My .exe file shows a "created 1999," so I must have gotten it when I was running Win98(?); but as far as I can tell there's only been one newer version released.)

The newer TextBridge Pro 11 is still a little under $100, but the OmniPage Pro 16 is $500 - apparently from the same company. And I don't think I've seen a legitimate "product comparison" review on OCR programs since juat before I bought my current one.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Rasener
Date: 27 Jan 09 - 06:42 AM

John
Have a lolok at this review.

It would seem that the newest version of Textbridge is not as good as the older ones.

http://www.pcpro.co.uk/reviews/32724/scansoft-textbridge-pro-11.html

Les

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Richard Bridge
Date: 27 Jan 09 - 07:41 AM

That's a 2002 review!

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Rasener
Date: 27 Jan 09 - 09:59 AM

Oh sorry. So what is the latest textbridge version then?

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 27 Jan 09 - 03:19 PM

As noted, that's a rather old review. I found "November" but no year, so I'll take RB's 2002. It does not show compatibility with Vista, which is apparently my problem with TextBridge 9.

ScanSoft is no longer ScanSoft, and is calling itself Nuance.

So far as I've been able to find, TextBridge Pro 11 is the current version, and is $89 (US). This would indicate that there has been no update to TextBridge since at least 2002(?). The advertising I can find at the Nuance website does not indicate, so far as I've found, what OS versions it's compatible with.

Nuance is also offering OmniPage Pro 16 at $499 (US), but in the modern advertising mode it says "it's wonderful" - "you'll love it" - "you should buy it now" - and "your teeth will rot and your hair will fall out if you don't get it immediately" - but it neglects to actually say what it does in any meaningful way.

Abby FineRreader 9.0 is shown as $399 (US) in a May 2008 review at PC Magazine. (This article appeared in today's Google results, but wasn't in results with the same search term two days ago.)

The link goes to the review header for OmniPage, but the "Next" button goes to the header for Abby FineReader. A sidebar on the left should take you to the start of the "23 utilities" full article.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Rasener
Date: 27 Jan 09 - 03:25 PM

Well I am using Textbridge Pro Millenium with Vista. Works great

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: JohnInKansas
Date: 01 Feb 09 - 08:58 PM

Textbridge Pro Millenium apparently was an intermediate issue between the Pro 9 that I have, which "works sort of" with Vista but not as well as with XP, and the current Pro 11 that's apparently still available but not well regarded by users. It appears that Pro 11 is about the only thing that can be acquired new now unless you resort to eBay.

Having looked a little at Microsoft Office's OCR, it appears to be at least as accurate as my old Textbridge, and perhaps even a bit better; but it definitely lacks some ease of use features.

Joe O may help me out here, but the Office version appears to require that you load pages or images one at a time, so a long document could take quite a bit of fumbling about to convert. Textbridge (like most other OCR programs I've seen) allows you to load a sequence of scans/images and convert them all at once. Textbridge (my old version) often makes a horrible mess of patching them together, but it gives you all the text in one document, all in one shot, to do the cleanup.

The Office OCR seems to work fairly well with scanned-in images, but if you want to OCR an existing image file it apparently accepts only TIFF files. I've avoided using TIFs for archiving due to the inconsistent results with them when you have multiple programs and sources (there are at least a half dozen different fairly common TIF formats, and several popular programs use proprietary "extensions" that can't be read by other programs: e.g. old Adobe Illustrator .tif files that I have can't be opened in several other Adobe graphics programs).

With Photoshop Elements, I can fairly quickly convert a folder of images to .tif using the batch processor, and the Office OCR accepts the tifs; but that means having - at least temporarily - another set of duplicates of the same images.

As noted, this is from a pretty quick look at it, but the Office package is definitely "usable" if you don't have something better; and if you have Office it's a no-cost option.

Also to be noted: there is NOTHING in Vista or Office 2007 "Help" files that I can find that even suggests that this utility exists, much less tells you what it's for - but then there's not much of anything else helpful in either of those places as well.

John

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Joe Offer
Date: 18 Jan 22 - 03:08 PM

I started this thread in 2009. I don't know what version of Windows I was using then, but now I'm on Windows 11. In 2009, my favorite OCR software was Microsoft Office Document Scanning. It was simple and quick. But then, like all good software, it disappeared. I used something called FreeOCR on my last computer, and it wasn't bad. But it wasn't installed on my newest computer, so I decided to go looking and found that Microsoft OneNote does OCR. I've had OneNote on my computer for a long time, and never found anything I wanted to do with it. But I decided to try it. T took a screen shot of a page from Google Books, and then saved it as a JPG and cropped it down to just the part I wanted. Then I pasted that into OneNote, right-clicked, and selected to "extract text" from the image. Then I pasted the extracted text into a Mudcat message window, corrected the text, and came out with good results.
This technique has interesting possibilities.
More later as I refine this process.

-Joe-

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Stilly River Sage
Date: 18 Jan 22 - 03:29 PM

I missed this thread the first time around. At the university library I was working with scanned historic documents on pages showing their age (browned, foxing, spots, etc.) and old-style fonts. Often times you can see text bleeding through somewhat from the other side. Accuracy is important for transcription and research, so we used a couple of commercial programs. The one we were using when I retired, and is the one I have here (that I haven't used for a while myself - I need to install it on this computer) is OmniPage. There is a learning curve, but it allows you to bring up the text line-by line and type corrections over any mistaken or blurred text that OmniPage got wrong.

In this setting, you are pretty much reading the whole document, or at least doing a slower-than-usual skim, and where you find the program has flagged issues, reading line by line to fix them. (I read a lot of speeches given for or against the 1846 Wilmot Proviso - arguing that new states shouldn't have slavery, and a lot of correspondence from during the US-Mexico War.)

With modern fonts and good quality paper, it sounds like the technique you've developed worked. I tested the transcriptions by searching for various words from smudged parts of the document to see if the transcription was correct. (In these files the scanned text is overlaid so a search takes you to that part of the document, and what you see is the original document.)

I have a copy of it because I have lots of newsprint clippings from my grandfather that I still need to scan. They fit the description of the historic documents I listed above.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: cnd
Date: 18 Jan 22 - 03:59 PM

I've had mixed success using www.onlineocr.net -- it's free and technically has a cap on how many images it can do per hour, but I've only hit that limit once, when trying a new method where I broke the OCR into dozens of smaller pictures.

In general, it does a passable job, but can be really hit-or-miss. Anything with marginal patterns or pictures in the background is a no-go, and like I hinted at in the beginning, longer pages tend to lower its accuracy. If the surface isn't glossy, it works much better on pictures taken using flash. I like to get a setup with an overhead light, a lamp, and natural light if possible and have 2/3+ light sources to help illuminate well and minimize shadows and glare.

When I say it can be hit or miss, I mean that it does a fine job 80% of the time, but sometimes it will get a text totally wrong -- as in, not a single legible word. This can be the case even when every other image from the set renders fine. But this isn't a very common issue and I think tends to be more likely if the image I'm trying to OCR isn't ideal.

As a side note, I've discovered recently that Google Drive auto-OCRs images and non-searchable text documents (ie scanned PDFs). This is great for finding things, but you have to format your stuff correctly. It will only tell you the document the word came up in, but not which page -- as a result, ever document I wanted searchable I broke up into single-page TIFFs or PDFs, and then I just have to read the one page rather than a large document.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Bonzo3legs
Date: 18 Jan 22 - 04:31 PM

ABBYY Fine Reader 15 is very good, which I use with a 2006 Epson SX printer/scanner to unscramble bank statements into Excel.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Felipa
Date: 18 Jan 22 - 04:57 PM

serendipity = I've been looking at digital texts from the National Library of Scotland. Viewers can see both images of the actually book or document pages, and optical recognition texts. There are quite a few errors in the latter, so I would advise anyone copying the texts to refer back to the originals and make corrections. I've had to do this for some lyrics I've shared on Mudcat.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Jon Freeman
Date: 20 Jan 22 - 11:06 AM

It's not something I've really needed but tesseract which has Google as the main developer, seems to be one of the main open source ones.

Following on from the comments at the bottom of the Wikipedia article, The Internet Archive now seem to be using it in preference to proprietary offerings.

Post - Top - Home - Printer Friendly - Translate

Subject: RE: Tech: OCR Tips - Optical Character Recognition
From: Stilly River Sage
Date: 20 Jan 22 - 12:10 PM

ABBYY Fine Reader and OmniPage are the two commercial OCR programs we compared at the university library, and settled on OmniPage. The condition of scanned texts that Felipa describes is WHY we used the high-end software and WHY a human had to go through and correct the text. You can let Adobe make an image of a document searchable, but what you find there is entirely dependent on how new and clear the text is in the TIF (we scanned TIF images then saved the result into searchable PDFs.)

Historic documents need more attention, and the thing about OmniPage is that it learns along the way on a long document and gets more of it right the more you use it on a particular text/font/document.

Post - Top - Home - Printer Friendly - Translate

Share Thread:

Reply to Thread

Subject:	Help
From:

Preview Automatic Linebreaks Make a link ("blue clicky")

Mudcat time: 25 April 6:36 AM EDT

[ Home ]

All original material is copyright © 2022 by the Mudcat Café Music Foundation. All photos, music, images, etc. are copyright © by their rightful owners. Every effort is taken to attribute appropriate copyright to images, content, music, etc. We are not a copyright resource.