Wednesday, February 07, 2007

Now this is cool

There is an extraordinary thing called Project Gutenberg. It is an amazing, huge, non-profit project which aims to put as much of the world library on line, in searchable form, as possible. Here's what they say about their origins:

In 1971, Michael Hart was given $100,000,000 worth of computer time on a mainframe of the era. Trying to figure out how to put these very expensive hours to good use, he envisaged a time when there would be millions of connected computers, and typed in the Declaration of Independence (all in upper case--there was no lower case available!). His idea was that everybody who had access to a computer could have a copy of the text. Now, 31 years later, his copy of the Declaration of Independence (with lower-case added!) is still available to everyone on the Internet.
Hart essentially invented the e-book at that time, long before anyone but defense researchers at DARPA had heard of the Internet. As of now, they have about 20,000 books online, available to all, totally free. They range from ancient classics to history to Sherlock Holmes to early 20th century writings. The only requirement is that they be public domain. They have online writings in Chinese, Dutch, English, Finnish, French, German, Italian, Portuguese, Spanish, and Tagalog. That's just the major languages, with more than 50 titles each. The minor ones include writings in Afrikaans, Catalan, Frisian, Mayan, Napoletano-Calabrese, Romanian, Sanskrit, and Welsh, to name but a few.

They take the best available copy of each title and scan it. They also work with Google, which is engaged on a similar and highly publicized project, that of scanning the total contents of many major libraries, and making at least a sample of them available online. By sticking to public domain materials, Project Gutenberg is avoiding the nasty copyright fights that Google is already getting drawn into. I'm afraid the re-thinking and re-definition of copyright is going to be THE biggest and nastiest issue of the developing online information world.

Anyway. This original scan really just produces a photo of each source page, which is not machine readable, not searchable, meaning search engines like Google can't find the data in them. So they use the best Optical Character Recognition software available and have it make a transcript, one that can be edited like an ordinary text file, and can be accessed and indexed by search engines. But even the best OCR software is not perfect, especially when dealing with odd typescripts or old microfilm. Its output has to be reviewed by an intelligent mind that can recognize letters obscured by smudges and piece things together from context.

That's where we come in.

There is an affiliated project, called Distributed Proofreaders. It's a clearinghouse for volunteer proofreaders who want to help Project Gutenberg prepare texts for final release on the Web, and I'm helping them out. Every text goes through several stages. First is basic proofreading, which is all I've done so far. A split screen shows you the image of the original scan and the best OCR transcription. You compare them, and correct the transcript. Once you've got it done as best you can, you click a button, the page is saved and sent onto the next stage, and they send you another page. There are later stages of further proofreading, then formatting, and eventually HTML versions are created for the Net.

I just LOVE this. It's not just a service to civilization, by my standards, but it's a hell of a lot of fun. It's similar to the pleasure I took at dealing first hand with collections materials at the museum. It's so cool to know I may be the first person in over a hundred years to really read this stuff. Things like:

  • A literary and social journal published in London around 1838, containing a memoir of the legendary actor Edmund Kean playing Shylock in The Merchant of Venice, as well as a review of the latest from Mr. Thackeray. Also were some gossip notes from a young American lady visiting Paris. She wrote of her visit to an elderly exiled Polish prince and his young wife, who made exquisite embroidery and sold it at an annual bazaar to benefit other Polish exiles. She also wrote about visiting an archive where she examined plans of French fortified towns, including the one where Napoleon's nephew was imprisoned.
  • An essay from a scientific journal of the 1880s, inquiring into the precise definition and determination of death, by exploring suspended animation, namely how close animals could be taken to true death and then revived. Remember that this was a major issue at a time when you you could not scan for brain activity to settle the matter, and many people were terrified of falling ill, becoming unconscious, being pronounced dead, and then buried, only to wake up imprisoned in a coffin to die of suffocation in terror. This was important.
  • A collection of speeches by Bertie, the Prince of Wales, in the 1880s and 90s, before his mother Queen Victoria died and he became King Edward VII. A lot of routine speeches -- the anniversary dinner of the Royal Geographical Society, opening a hospital -- but intriguing at this distance in time, at least to me.
  • An issue of Stars and Stripes, the US Army newspaper, published in France in 1918, during World War I. Lots of letters and articles complaining bitterly, and from experience, of the incompetence of the people who designed and manufactured the uniforms they had to wear and the weapons they had to fight with. Also ads about how to wire their wages back home to their families via Wells Fargo.
Is this cool or what? I'm having a ball doing this. It can be disjointed, it's not like reading a book or an article straight through. You get presented with pages out of sequence, due to other proofreaders working on the same project when you're offline. But when the final version is published on the Gutenberg Project, you can read it if you wish, straight through.

I think this is great fun, and a great thing to do for a civilization that needs all the help it can get. And if I were offering advice to somebody like -- oh, I don't know, a nephew or someone who might need to find a good senior project in a few years -- I'd say he could do a lot worse than to check this out.

1 comment:

Don said...

Very cool. I've read a handful of 19th Cent. novels from there and had to stifle my proofreading impulse because I "knew" my corrections would benefit no one. Maybe now I can try again. Speaking of nephews, your eldest's senior project will be to learn and perform an Italian aria. Not yet selected. The other one wants to invent a perpetual motion machine. Go figure. Don't bet against him either.