The Project Gutenberg FAQ - S-7

S.7. How do I scan a book?

This depends on whether you have cut the pages out, or whether you are working with an intact book.

If you have cut the pages out, and you have an ADF, then you will obviously feed them through that.

If you don't have an ADF, there usually isn't much point in cutting the pages. Most modern OCR will recognize a "dual-page" or "two-up" scan, and, if yours does, then that's normally the best option. Scanning the uncut book, open and flat, is the most common scanning method used in PG.

Take the book and place it open, flat on the scanner glass. To fit both pages on the glass, you may need to position it lengthways, at 90 degrees to its natural angle. Most OCR software will recognize that the image has been rotated through a right-angle, and will correct it when it reads the text.

A common problem with scanning an opened book is "guttering", which happens when the spine of the book is not pressed flat enough, and the inside of each page, where it meets the spine, is curved against the glass. There's more about this, and an example, scan3, in the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?". To avoid guttering, make sure that the spine is held down throughout the scan. (Some people put a weight on the spine to hold the spine down on each scan; others just press their hand against it.)

Another common problem is light scattering, when too much light gets into the scanner. The scanner head detects light, and you want the only internal light source to be from the scanner itself, not ambient room light or sunlight. Scanners have covers, that are intended to be closed while scanning, for a controlled light level, but when you're scanning a book held open and flat, you can't close the cover fully. In a bad case, this can lead to a condition of the scan like overexposure of film and you can see an example in scan4 of the FAQ [S.17] "Why am I getting a lot of mistakes in my OCRed text?". If this happens, just make sure that your room is dim while you scan--don't have a ray of bright sunlight bouncing around the inside of the scanner!

Occasionally, when scanning cut pages with very thin paper, you may get a shadow of the text on the other side showing through. If this happens, you can try covering the inside of the scanner lid, which is normally white, with a piece of black paper.

Many modern OCR packages will control the scanner automatically, and you may be able to set your OCR so that it does an automatic timed scan every, say, 30 seconds. This is a great timesaver, since you don't have to go back and forth between the scanner and the screen. Just set your timer, hold down the book for the scan, take the book up, turn the page, put it down again, and wait for the next scan to start. Set the timer for whatever interval you are comfortable with. Highly recommended, if your OCR or scanning package can do it.

By default, most scanners will always scan the entire area of the flatbed, but usually, your book will occupy only about half of it. Look for a setting on your OCR or scanning package which allows you to reduce the area that the head scans. Just scan enough to get the image of your pages. This makes the time for each scan and subsequent OCR recognition shorter, and in a really good case can cut your total scanning and OCR time in half.

Scanning all pages together is usually fastest, but you may prefer to scan each double-page, then correct it in your OCR package's editor, then scan the next. This is a more leisurely approach favored by some volunteers.