As stated back in the Basics section, all you need to do is:
Borrow or buy an eligible book.
Send us a copy of the front and back of the title page.
Turn the book into electronic text.
Send it to us.
That's all you actually need to know in order to be a producer. But if you're interested in the details of how other people actually do this, and want to know what else happens behind the scenes, here's a full, blow-by-blow account.
Volunteers find eligible books [V.18] in all sorts of ways. Some lucky people have them in their bookshelves, or their attic. A lot of people have a good library nearby, where they can find books, or request them on interlibrary loan. Some people are big eBay fans; others like to hunt for bargains on specialist booksites. And of course lots of volunteers enjoy rummaging through actual used bookstores, or local markets, or yard sales.
Even if you're not going to take on a book yourself right now, search for some on the Net and find out about how to get a copy. Next time you pass an antiquarian bookstore, or a book market, drop in and browse. Ask your local library about interlibrary loans. Eligible books aren't hard to find once you know where to look.
New volunteers sometimes find it hard to understand why this is so important, and why, in particular, Project Gutenberg is so careful about it. At base, it's simple: by keeping a filed copy of the TP&V [V.25] of every book we produce, we can at any time protect our publications against claims from publishers that they "own" the work, and thus we can keep them available to the public.
The copyright laws can be difficult to understand, and sometimes it may take serious research to prove that a particular edition is actually in the public domain. If you're not legally-inclined, just keep repeating "Pre-'23 is free" if you're in the U.S.A. and stick to books published before 1923. If you do want to delve deeper, read our Copyright Rules page at <http://www.gutenberg.net/howtos/copyright-howto> and then go on to reading the Library of Congress Copyright Office official papers at <http://www.copyright.gov/>. If you're in another country, find out about your own copyright laws.
Volunteers send in the TP&V from the book for us to inspect. This not only gives us the proof to file, it also lets us know that someone is really working on the text so that we can list it as being In Progress for the information of others who might be interested.
3. Scanning, typing, proofing and editing
This makes up the bulk of PG's effort, and is discussed at great length elsewhere in this FAQ. There are many, many ways to create an etext from a paper book, and different people use different methods, but it all boils down to making a text file. For a typical book, it will probably take 40 hours of a volunteer's time. All that happens here is that somebody makes the effort to transcribe one paper book into a file that can be shared around the world and for all time.
Since "The Slashdot Incident", when Distributed Proofreaders was featured on Slashdot, a popular technical news site, most of PG's production has gone through DP. The production steps there are more formalized, but it's still the same basic formula of scan, OCR, proofread. The big difference is that no one person has to do it all.
[Note: this information is quite specific to the process we go through now. It is quite likely to change as we improve the automation of the tasks.]
Posting is done by the Posting Team. The basic job is to receive the text from the producer, check that it has been copyright cleared, check that it conforms to Project Gutenberg standards, check it for correctness (which can be anything from XML validity to simple spelling), add the Project Gutenberg header and copy the text to the two PG servers.
In a simple case, where everything goes right, this can take as little as fifteen minutes. In a complicated case, where we have to convert formats, or there are a lot of errors in the text, or there are problems with the copyright clearance, it can take hours or even days while we wait for responses, or do a lot of editing, or find conversion tools.
Michael Hart used to do this work entirely alone, but in September 2001, he created the Posting Team to handle the load. (The Posting Team are nicknamed the "Whitewashers" in honor of Tom Sawyer's victims. :-)
You send the text to us [V.46] either by Web, by FTP (with a username and password that any of the Posting Team can give you privately), or by e-mail.
If you're FTPing, you should e-mail one or more of us as well, to let us know what you've uploaded.
One problem is files that don't transfer correctly. Especially by e-mail, some files get damaged on the way. It's better to ZIP the file before sending, if possible, to prevent some common problems with text files. The use of compression formats other than Zip can also create problems. Members of the Posting Team work on multiple platforms--DOS, Windows, Linux, Solaris--and zipping and unzipping programs are commonly available for all of these. Other compression methods, like Stuffit or bzip2, are not so readily available, and may give us trouble.
We login via ssh to pglaf.org, which is the Unix system on which we work when posting, the same one that you uploaded the file to, unzip the file and glance at the top of it.
We then check it for copyright clearance. The one and only absolute rule that we never bend, no matter what, is that we will not post a file that doesn't have a clearance. If it ain't in the clearance files, it don't get posted.
Most regulars know that they should include their clearance line in the web form or e-mail submitting the text, but not everybody does, and not everybody remembers every time. This can be frustrating, when clearance is not included and not obvious.
When you received your clearance on a book, you got what we call a "Clearance Line", something like this:
The Works Of Homer [Iliad/Odyssey] Tr. George Chapman Jim Tinsley 06/14/01 ok
These are saved in files that we posters can access. We regard this information as private, so we don't publish the details of who has cleared what.
When we get the text, we check whether the submitter has cleared it. If there is a clearance line in the e-mail notifying us about the text, there's no problem. If we can find the title of the text under the submitter's name in the clearance files, there's no problem. Unfortunately, sometimes we can't find it. There are two usual reasons: either the text submitted is part of the work cleared (for example, submitting one play from a collection), or the text hasn't been cleared yet. If the clearance isn't straightforward, we can go back and forth and round and round in e-mails for a while.
This is why it's important to paste the clearance line into the web form that you use to upload your etext, or into your e-mail, if for some reason you couldn't use the web form.
If the title of the text you're sending isn't the same as the title of the text cleared, BE SURE to paste in the clearance line AND explain that the text you're sending is PART of the cleared book. Please also list the titles of the other parts; it really does cause confusion and delay when this is not clear.
Sometimes, people send in a book in a non-text format like Word Perfect or Microsoft Word, or send a text with unwrapped lines. In that case, we try to get the submitter to fix them, but if they can't, we have to convert the file to straight text before starting.
Some producers, particularly inexperienced ones, want to add non-standard annotations and mark-up and symbols to the text. This can get ticklish; we don't want to discourage them, but we need to keep texts reasonably standard. Usually, we can work something out. Maybe the book should be added in both text and HTML, for example.
Assuming that it's a plain text file, we next run gutcheck and a quick spellcheck on the file. This will tell immediately if it adheres to PG standards and if there is any serious problem with it.
If the file looks clean, we may skim it, looking for potential problems or formatting issues. For clean texts, the only things we usually need to change are unindented quotations or inconsistent chapter headings (a lot of people seem to mix "CHAPTER III" with "Chapter 14" and have irregular numbers of blank lines) or spacing and a few 8-bit characters. Occasionally, we have to rewrap a text. We also look out for included publishers' trademarks, which we normally prefer to remove (trademarks are NOT subject to copyright expiration: Macmillan(TM), the publishing house, is still around and trading), unnecessary or downright odd indentation or centering, stray page numbers, and prefaces or introductions or appendices that may not be in the public domain. If the file has lots of 8-bit characters, we probably need to make a separate 7-bit version, and post both.
If the gutcheck and spellcheck don't look clean, or if conversion is required, we may spend a lot more than 15 minutes on it. In a bad case, we may have to get the file re-proofed.
If you are conscious that you're doing something non-standard, and really mean it to stay, say so in your e-mail. (For example, I recently posted a text containing a family-tree representation that had lines over 80 characters. Now, I would have left that one alone anyway, but it helped that the submitter drew my attention to it in the e-mail.) If it's too non-standard, the poster may not allow it to stay, but at least you can discuss it. When a text needs a lot of non-standard formatting or markup, you really need to ask yourself whether you shouldn't be submitting it in HTML, with all the bells and whistles, and settle for something more normal in the text variant.
Mostly, errors are obvious, and there are at least some obvious errors in most texts. When errors are completely obvious, we just fix them without feedback to the producer unless you have specifically asked for feedback in your e-mail.
We're getting more HTML formats now, which is great, but incoming HTML often needs a lot of work, because people who are not experienced with HTML often make mistakes. The W3C <http://validator.w3.org> is the official standard for valid HTML, but, for the average volunteer, it's awkward to use. However, if you're submitting a HTML format, please use Tidy, which you can get from <http://tidy.sourceforge.net>, to check your text before sending it. If you're using CSS along with your HTML, you need to check that separately at the W3C CSS Validator <http://jigsaw.w3.org/css-validator/> as well.
We add the PG header and footer. If there is a header and footer already there, we strip them off first, since recent changes in the header mean that a lot of people send files with headers that are out of date. We have written programs to help with this.
We get the number for the text from a program called "ticket" that Brett Fishburne wrote, that dispenses the next number. That way, if two or three of us are posting at the same time, we won't all grab the same number. Given the number, we know the filenames, according to the rules for post-10K texts listed at [R.35], and finally zip up the file.
We now transfer the posted files to two servers: ftp.ibiblio.org (which also serves as gutenberg.net) and ftp.archive.org. (This is usually the point at which we realize that we forgot to make a change we noticed while checking. Aaaargh!)
Currently, we usually do this by uploading the files in one big zip to pglaf.org, where a timed job automatically puts the files onto both servers. At the moment, this happens 20 minutes past the hour, every hour.
At this point, the book is posted, but nobody knows about it! We need to do something about that. . . .
We compose an e-mail to the "posted" e-mail list, cc: the producer, with the line that is to go into GUTINDEX.ALL, the master list of PG files.
The "posted" list has only a few subscribers. These are the people who index and create links to PG texts, and include both PG volunteers and the maintainers of other sites that link to PG texts.
They also commonly download the texts to get more information for their indexes, and tell us if there is anything wrong with the files.
This e-mail is simply the official notification to all these people and the producer that the file has been posted. Here's a sample of such an e-mail:
To: "Posted Etexts for Project Gutenberg" <posted@listserv.unc.edu> Subject: [posted] Posted (#5301, Duncan) ! From: "Jim Tinsley" <jtinsley@pobox.com> Date: Tue, 25 Jun 2002 06:21:27 -0400 (EDT) Cc: you@example.com Mar 2004 The Imperialist, by Sara Jeannette Duncan [SJD#4][mprlsxxx.xxx]5301
There may also be some remarks, if the text is in any way non-standard, or if files other than plain text were posted with it.
From this e-mail, you can, if you want to see any corrections made, immediately download the posted file and compare it to your version. Since the notification is made after the file has been copied to the servers, it should be there waiting for you.
To find out how to download a book that has just been posted, see the FAQ "R.3. How can I download a PG text without using the web catalog?" [R.3]
From the "posted" list, the posting line is added to GUTINDEX.ALL. A skeleton index entry is then made to the website database by an automatic job daily, containing title, subtitle, author, etext number, language and character set. and our indexers begin the cataloging process, which is much more thorough, for the website. This includes work like finding author's dates of birth & death, getting the Library of Congress classification, and the other information that makes up the website searchable index. That process takes extra time, which is why the website searchable catalog must always lag behind the actual titles posted.
It's remarkable how many people who went over and over the text to the point of hating it suddenly see problems with it when they download it a couple of days after it's posted! Something psychological there, I expect. Anyhow, if you do download your text and see problems with it, don't worry, just e-mail whoever posted it, or any other member of the Posting Team. No, you're not stupid, or if you are, you're in good company, because we've all done it! There's no big deal about replacing the posted file with a corrected copy immediately.
Over time, other readers may submit corrections. If you find an error in a PG etext, see the FAQ "I've found some obvious typos in a Project Gutenberg text. How should I report them?" [R.26]
When the corrections are small, as most are, we will just make the change to the existing text. We never make a new edition when we get corrections immediately after posting; we just update the file.