The Project Gutenberg FAQ - S-16

S.16. What types of mistakes do OCR packages typically make?

Each text has its own peculiarities, but there are a number of well-known scanning errors you will be dealing with all the time.

Punctuation is always a problem. Periods, commas and semi-colons are often confused, as are colons and semi-colons. There are also usually a number of extra or missing spaces in the e-text.

The problem of quotes can assume nightmarish proportions in a text which contains a lot of dialog, particularly when single and double quotes are nested.

The numeral 1, the lower-case letter l, the exclamation mark ! and the capital I are routinely confused, and often, single or double quotes may be mistaken for one of these.

Lower-case m is often mistaken for rn or ni.

The letters h and b and e and c are commonly mis-read, and these are probably the hardest of all to catch, since ear/car, eat/cat, he/be, hear/bear, heard/beard are all common words which no spell-checker will flag as problems.

For example:


  " Hello1' caIled jirnmy breczily.  11Anyone home ? "

  There seemed to he no-oneabout. Only tbe eat beard him."

should read:


  "Hello!" called Jimmy breezily, "Anyone home?"

  There seemed to be no-one about. Only the cat heard him.

Top