The Project Gutenberg FAQ - VV-BC

Ben Crowder

I've been a book lover ever since the day I learned to read. Several years ago I discovered Project Gutenberg while surfing the net and was delighted to find so many good books freely available. I downloaded all the etexts I was interested in and read quite a few of them. After a few years, I decided to get more involved, so I started proofing with Distributed Proofreaders. I liked that a lot -- I was a newspaper editor in high school for two years -- but I felt an itch to try to produce etexts on my own. I didn't have a scanner, however, so the only solution I could see at the time was to find a book and start typing it in by hand. I'm a relatively fast typist and I figured it wouldn't take that long.

So, I went to my university library, found a pre-1923 edition of G.K. Chesterton's The Ball and the Cross (Chesterton is one of my favorite writers), and began typing. It took much longer than I expected -- certainly over 30 hours, perhaps even close to 50. When I finished, I came across a page on the PG site that mentioned there should be two spaces between sentences. I looked at the etext I'd just typed in and realized in horror that I'd used single spaces the whole way through. :) [1] I had been *sure* that PG used single spaces, convinced that I'd read it in one of the PG docs, which had taken a little while to get used to since I normally use two spaces. But all the PG etexts I checked had two spaces between sentences, so I began the monotonous task of adding an extra space between each sentence (and being very careful not to add spaces in where they shouldn't be). Several hours later the book was finally done. I'd gotten copyright clearance before I started, so I soon submitted it and within a few days I saw those lovely words in my inbox, "Posted (#5265, Chesterton)".

[1] Ben was right both times: people have posted advocating both one space and two. Either would have been accepted!--jt

Since then, I've been addicted to producing etexts. Languages interest me greatly, so I found an Old Icelandic primer that someone had scanned in, OCRed the images using DocMorph (it didn't take as long as I thought it would, and the output was decent enough to work with), and realized I would have a problem entering in the foreign characters (o's with hooks underneath, etc.). Thank heavens for Unicode. Vim (my editor of choice) has fairly good Unicode support and it didn't take long to make a list of the Unicode codes for the Icelandic characters.

As noted, I use Vim for all my editing. I can rewrap lines to 65 characters by typing "gq", I can use regular expressions for search and replaces (*very* handy), I can edit in Unicode when I need to, and I can speed things up greatly by making keyboard mappings for repetitive tasks. (On one text I was working on, I had to add a blank line between each paragraph. Each was numbered, but the blank lines had somehow been taken out before I got the text, so I started going through and adding them in by hand. The file was 30,000 lines long, however, and I quickly realized it would take a *long* time. I then noted which keys I was pressing to add the blank line between each paragraph, mapped them to <F9>, and held the key down while Vim zipped through the rest of the file. It sped it up by a factor of over a hundred.)

My university library is well-stocked and has lots of old books, so I usually rely on it when I need to get TP&V's for texts I'm not typing in myself. I still don't have a scanner, so I either find already-existing texts on the Internet and reformat them for Project Gutenberg (after getting permission, of course), or find page images on the net and OCR them myself, or type the books in by hand. Typing in by hand takes a long time and so I prefer the first two methods.

Volunteering with Project Gutenberg has been extremely satisfying. The people are wonderful to work with, the work is fun, and it feels very good to know that one is making a difference in the world.

Top