Lyrics & Knowledge Personal Pages Record Shop Auction Links Radio & Media Kids Membership Help
The Mudcat Cafesj

Post to this Thread - Sort Descending - Printer Friendly - Home


BS: Digitizing text: a distributed project

Desert Dancer 18 Aug 08 - 04:26 PM
Desert Dancer 18 Aug 08 - 04:38 PM
Bill D 18 Aug 08 - 06:36 PM
Desert Dancer 18 Aug 08 - 07:26 PM
Desert Dancer 19 Aug 08 - 04:38 PM
Stilly River Sage 20 Aug 08 - 10:13 AM
katlaughing 20 Aug 08 - 10:19 AM
GUEST,petr 20 Aug 08 - 11:47 AM
Desert Dancer 20 Aug 08 - 02:07 PM
BK Lick 12 Sep 09 - 03:05 AM

Share Thread
more
Lyrics & Knowledge Search [Advanced]
DT  Forum Child
Sort (Forum) by:relevance date
DT Lyrics:





Subject: BS: Digitizing text: a distributed project
From: Desert Dancer
Date: 18 Aug 08 - 04:26 PM

Some folks really do have bright ideas.

~ Becky in Tucson

http://sciencenow.sciencemag.org/cgi/content/full/2008/814/1>online

Digitizing Old Text and Fighting Spam, Too

By Phil Berardelli
ScienceNOW Daily News
12 August 2008

The next time a Web site asks you to read a string of crooked letters as a security precaution, don't grimace. You could be helping to digitize a deteriorating historical document. A team of computer scientists has taken a common Internet tool for screening out spam and adapted it to help convert text from old books and manuscripts into electronic files. The effort might not put professional transcribers out of business, but it could cut the cost of creating digital libraries.

In the battle between Web security designers and spammers, programs called Completely Automated Public Turing test to tell Computers and Humans Apart (CAPTCHA) have proven an effective foil. The programs require online users to read a distorted word or line of text and retype it in a designated box--something that few optical scanners or digital-text readers can do. Insidious programs deployed by spammers can penetrate sites such as Gmail and lift their e-mail address lists. CAPTCHAs block the attempt by requiring an extra step before providing access. They are used online about 200 million times every day.

Computer scientist Luis von Ahn of Carnegie Mellon University in Pittsburgh, Pennsylvania, and colleagues thought all that effort could be put to another use, too. "Since each [CAPTCHA] takes about 10 seconds of human time," von Ahn says, "we figured humanity as a whole was wasting about 500,000 hours every day typing." And that much time constituted a valuable resource in efforts to digitize old books with deteriorating pages and faded text.

The team developed a new program, called reCAPTCHA, which collects words flagged as unreadable by optical scanners as they digitize texts. Those words, in the form of computer optical scans, are then sent to cooperating Web sites and used in place of random CAPTCHAs. The software presents one optically unreadable word and one "control" CAPTCHA word. Getting the control word right identifies the user as a human, and the program records his or her response to the unreadable word and adds it to a database. To improve accuracy even further, reCAPTCHA sends the most difficult words to multiple users and selects the consensus response as correct. This process can peg more than 99% of words accurately, the team reports online today in Science.

The reCAPTCHA system now automatically collects about 4 million responses every day from 40,000 Web sites, the equivalent of 1500 people working full-time and transcribing 60 words per minute, von Ahn says. The service, available at www.recaptcha.net, is free to any Web site that requests it. After a year of operation, reCAPTCHA has helped resolve about 440 million words for client users that are digitizing newspaper and document archives; von Ahn says his team just completed the entire 1908 archive from The New York Times, for example.

Information scientist Paul Kantor of Rutgers University in New Brunswick calls reCAPTCHA "an ingenious idea." It creates the opportunity to sell the labor of third parties to interested customers, he says, all at the cost of "at most, a fraction of a second more" to generate the benefit--"wow!" And cryptographer Josh Benaloh of Microsoft Research in Redmond, Washington, says the approach "is simple, brilliant, and makes people who hear about the idea smile while asking themselves, 'Why didn't I think of this?' "


Post - Top - Home - Printer Friendly - Translate

Subject: RE: BS: Digitizing text: a distributed project
From: Desert Dancer
Date: 18 Aug 08 - 04:38 PM

Here's the reCAPTCHA web site: click

They say: "Currently, we are helping to digitize books from the Internet Archive and old editions of the New York Times."

~ Becky in Tucson


Post - Top - Home - Printer Friendly - Translate

Subject: RE: BS: Digitizing text: a distributed project
From: Bill D
Date: 18 Aug 08 - 06:36 PM

Wow....and I say again...wow.

That is beyond ingenious. It is..ummmmm....clever!

So a million monkeys CAN eventually create Shakespeare...and Balzac, too!


Post - Top - Home - Printer Friendly - Translate

Subject: RE: BS: Digitizing text: a distributed project
From: Desert Dancer
Date: 18 Aug 08 - 07:26 PM

Or at least bits of them!

~ B in T


Post - Top - Home - Printer Friendly - Translate

Subject: RE: BS: Digitizing text: a distributed project
From: Desert Dancer
Date: 19 Aug 08 - 04:38 PM

justanotherday


Post - Top - Home - Printer Friendly - Translate

Subject: RE: BS: Digitizing text: a distributed project
From: Stilly River Sage
Date: 20 Aug 08 - 10:13 AM

This is great! Thank you! I'll pass it along in my library circles. . .


Post - Top - Home - Printer Friendly - Translate

Subject: RE: BS: Digitizing text: a distributed project
From: katlaughing
Date: 20 Aug 08 - 10:19 AM

How ingenious! Thanks!


Post - Top - Home - Printer Friendly - Translate

Subject: RE: BS: Digitizing text: a distributed project
From: GUEST,petr
Date: 20 Aug 08 - 11:47 AM

a great idea, except that security bit against bots, also makes it impossible for blind people to use that site.

there is a similar attempt with www.gwap.com
(games with a purpose) - the idea is there are a lot of photos and say bits of music on the internet but not easily searchable since there is no other info attached to the pics.. or sound files.
so in gwap you compete with an unknown partner to describe a picture
and all the words that are in common are then associated with the picture..


Post - Top - Home - Printer Friendly - Translate

Subject: RE: BS: Digitizing text: a distributed project
From: Desert Dancer
Date: 20 Aug 08 - 02:07 PM

Web accessibility is sure an important issue Guest,petr. Fortunately, I've noticed that some sites that use CAPTCHA have some alternative process available for the visually impaired; I hope it's most or all of them.

~ Becky in Tucson


Post - Top - Home - Printer Friendly - Translate

Subject: RE: BS: Digitizing text: a distributed project
From: BK Lick
Date: 12 Sep 09 - 03:05 AM

The reCAPTCHA folks have now added a nifty Mailhide service. It hides my email address thusly:
curm...@rudegnu.com (click on the embedded ...).


Post - Top - Home - Printer Friendly - Translate


 


You must be a member to post in non-music threads. Join here.


You must be a member to post in non-music threads. Join here.



Mudcat time: 25 April 4:32 PM EDT

[ Home ]

All original material is copyright © 2022 by the Mudcat Café Music Foundation. All photos, music, images, etc. are copyright © by their rightful owners. Every effort is taken to attribute appropriate copyright to images, content, music, etc. We are not a copyright resource.