The Project Gutenberg FAQ - P-1

P.1. What useful programs are available for Project Gutenberg work?

These suggestions came largely from a poll of volunteers in June, 2002. The programs listed are a summary of the programs we actually use. There are many other programs out there that can do the same jobs, so don't limit your search just to these.


1. OCR

  Abbyy <>
  OmniPage <>
  TextBridge <>

These are the three main commercial packages that volunteers bought specifically for the purpose. In a few cases, people had got older versions of these bundled with their scanners.

  Clara OCR <>
  Gocr <>

These are Free Software packages. Some people who responded to the survey had tried them, but nobody had actually used them to produce a text.


  DocMorph -- a free, web-based OCR <>

This one is interesting--you can just submit your image through a web page, and the service will return OCRed text. However, the process of submission, waiting for your text, and then cutting and pasting into your document is slow.

Other volunteers use various OCR software that came bundled with their scanner.


2. Editing

The main answers, given by more than one person, were:

  AbiWord    <>
  Microsoft Word
  Windows WordPad
  Word Perfect

Other editors mentioned included:

  Crisp for Windows <>
  Editpad for Windows <>
  Editplus for Windows <>
  Foxpro 2.6 for DOS
  Metapad <>
  Windows Notepad

Programs recommended by Apple Macintosh users included:

  BBEdit Lite <>
  Microsoft Word
  Nisus Writer <>
  Text-Edit Plus <>
  TextSpresso <>
  Add/Strip <>


3. Checking and proofing

For spelling, most people just use the spellchecker built into their editor or word-processor. The *nix users running emacs or vi tended to use variants of the standard Unix spell command, such as ispell or aspell. Mac users have the free spelling checker Excalibur, available from <>.

Gutcheck <> was used for format checking, and a few people had written some checking procedures of their own.


4. Working with HTML

In the survey, most volunteers preferred to handcraft their HTML using their normal editor. Those using a word processor edited the HTML as text, rather than composing a word processor file and then Saving As HTML. There was remarkable unanimity on this.

Specific HTML editors that were mentioned for occasional use were:

  Adobe PageMill (no longer available)
  Mozilla Composer <>
  HTMLKit <>
  HTMLPad <>

However, not all HTML work is about editing, and the following packages were honorably mentioned for other functions. Especially important is Tidy, which is pretty much necessary for all but the most experienced people for quick HTML checking. <> has the original, and links to versions of Tidy for Windows (Tidy-GUI) and just about all other platforms.

Converts Project Gutenberg texts to HTML and TeX.

HTMSTRIP by Bruce Guthrie:
MS-DOS. Converts HTML to text

Lynx (lynx --dump):
Converts HTML to text

Dave Raggett's HTML Tidy:
Checks HTML for correctness, reformats and fixes

W3C html2txt (web-based):
Converts HTML to plain text.

W3C Validator (web-based):
The Last Word on the correctness of HTML.

A very neat utility for getting web pages


5. Working with images.

There are two main applications of images in PG--images to be used within texts, like illustrations in HTML, and the management of page images for scanning. These packages are used by volunteers variously for both of those purposes. Their typical use within PG is indicated. "Advanced image processing" packages will permit you to edit and restore damaged images, but for PG work, we mostly just need to manage, convert, resize and crop them.

ACDSEE for Windows
For image reviewing

Adobe Photoshop
For advanced image processing

ImageMagick for *nix, Mac and Windows
Resizing and format conversion

Irfanview for Windows
Image viewing, conversion, cropping and resizing

The Gimp
For advanced image processing

Picture Publisher
For advanced image processing

VuePrint Pro
For viewing images