There’s a nice, short post up at Villanova’s Library blog about the proofreading work that goes into Project Gutenberg’s free ebooks. Digitizing books isn’t just a matter of snapping pictures of pages and gluing it into a PDF.
I mean, it can be. At the Internet Archive you can see pictures of the pages of Dickens’s “A Christmas Carol” as it was published in 1877. That’s pretty cool. And a 2.1 megabyte download isn’t huge, although the book is only 64 pages long. But this text isn’t searchable, adaptable to different screens, or especially light-weight like an EPUB file is.
To get an image to become searchable text, you need to use some kind of optical character recognition (OCR) software. The software does its best to guess which marks are letters and which are noise or old age, but it’s not perfect by a long shot.
People are much better than this than software. The Distributed Proofreaders project does just that, letting participants compare the OCR text to the photo of the page. By going one page at a time, and reviewing the corrections others have made, you get reliable editions online, for free.
I helped out with the Distributed Proofreaders project several years ago. At the time, at least, you could choose which text(s) you wanted to work on and how much you wanted to do at once. Even nudging the ball forward makes progress when lots of people contribute. You’ve heard of crowdsourcing? This is it. And Villanova is working with this project to improve the digitized books in their collection.