The Project Gutenberg FAQ - S-21

S.21. Will PG store scanned page images of my book?

Yes. As of July, 2004, we are beginning to offer archive space to page images of books posted to PG.

While page images cannot be searched, or converted to other text-based formats for reading, they do have some value -- for checking possible errors in the transcription, for holding images that might not have been preserved in HTML, for checking cited page numbers, for re-printing, and just generally for anyone who wants in-depth information on the source paper. This is not our core purpose, and the page images must be seen as an adjunct to the text rather than a main feature. However, disk space and bandwidth are now plentiful enough that it is practical to preserve these, if only for the relatively few people who might make use of them.

We do have to be careful in our use of space and bandwidth, though. To use 40 KB per page is reasonable, given today's resources; to use 140 KB per page is not. Thus, we insist on maximally-compressed black and white page images only, for normal pages, and the best size-to-quality ratio we can get for pictures.

Our current guidelines on the submission of page images are:

1. PG is now accepting page images of books posted. Page images will be posted only as an addition to an etext posted in the normal way -- we will not post page images without plain text.
2. Page images are an option; they are not and will not be required for the posting of a text.
3. All page images should be good enough to work reasonably well with OCR packages, up to 600 dpi, and should be stored as black-and-white TIFFs with CCITT-4 (aka ITU-G4 or Fax Group 4) compression. This is important, so that we keep the overall file size down to a sustainable level. With this compression, a typical 600dpi page can be stored for about 40KB. Our ability to post these images depends on the file sizes staying fairly reasonable. Pages such as color pictures or greyscale photos that cannot reasonably be stored as black-and-white only should be stored as TIFF or JPEG with the best compression you can get for that image.

(Note: Irfanview for Windows does this nicely individually or in batch. ImageMagick v 6.x: convert myimage.png -compress group4 myimage.tif ) [P.1]
4. Each page image should be a separate file and named with the page number within the set; e.g. 001.tif, 002.tif, etc. Separate, non-page images, such as covers or color images scanned separately from the pages, should have suitable names, such as "cover.jpg" or "072-image.tif" All page images for the book will be zipped into one file, to be called FILENUMBER-page-images, e.g. for etext #12345, and stored in the main directory for that etext. It will unzip to a subdirectory ./page-images, but we will not post separate page images in that directory, since that would double the space used, and we believe that people who want to consult the images will probably want them all. So, for now at least, if you want the images, you must download the ZIP file.

Page images submitted to Distributed Proofreaders [B.2] are automatically saved, and, while not publicly available today, will probably become so in the future.

For storing higher resolution page images or pictures than we can reasonably post today, you might consider the Internet Archive. To find out more, go to