Monday, March 28, 2011

Captchas' Real Purpose!

"Back in the day", in order to attend an event, you would have to make phone calls, write, or wait to go to the destination and hope for tickets at the box office. Now, it is as easy as making a few clicks and entering your credit card information on the Internet. But before these ticket distributors take your money, they might first present the user with two sets of distorted, wavy letters and ask for a transcription. These are called Captchas, and only humans can read them. They are meant so that robots cannot hack or access secure Web sites. What the readers do not know though, is that they have been enlisted in a project to transform an old book, magazine, newspaper into an accurate, easily sortable, and searchable computer text file.

One of these wavy, distorted words probably came from a digitalized image from an old text, and while the original page has already been scanned into an online database, the scanning programs made a lot of mistakes. So basically, the users and readers who are entering their copy of these letters are correcting them. So in other words, buy a ticket to a such event, and help preserve history! The set of software tools that that accomplishes this is called reCaptcha. Its original project was to clean up the digitized archive of the New York Times, but now it is the main method used by Google to authenticate text in Google Books, its vast project to digitized rare and out of print texts on the Internet.

Digitization is usually a three-stage process: create a photographic image of the text (a bitmap), encode the text in a compact, searchable form using character recognition software (O.C.R.), and correct the mistakes. Normally, the first two steps are easier to complete and more straightforward. The third step is sometimes more difficult because a lot of the time, O.C.R. programs mess up a large portion of the words, so that only humans can fix these issues. In order to get around this obstacle, Captchas were developed. It was estimated that humans around the world could decode about 200 million Captchas a day, at 10 seconds per Captcha. So now, reCaptcha is being used by 70 to 90 percent of Web sites such as Ticketmaster, banks and Facebook. Although this has been extremely helpful, reCaptcha has run into numerous errors such as it not being able to easily read cursive writing. Besides these points, reCaptcha achieves an accuracy rate of about 99 percent.

Usually turning to the public would be a bad idea when trying to accomplish major goals such as this. But in this instance, it was a great idea. It's amazing how well the results have turned out and how accurate they are as well. I would think that a program such as reCaptcha would have more issues when turning to the public, but it seems to be more successful than anything. I have used and had to translate these Captchas before, and I never knew that that was what they were for, to correct words for a greater reason. 

Article Name: Deciphering Old Texts, One Woozy, Curvy Word At A Time
by Guy Gugliotta

No comments:

Post a Comment