The Project Gutenberg FAQ - S-17

S.17. Why am I getting a lot of mistakes in my OCRed text?

If you're new to OCR, you may have come with the idea that OCR is almost perfect, and just makes a few mistakes now and then. No. It's slightly amazing that OCR works at all, and when it does, it isn't perfect.

You might reasonably expect to average anything up to 10 errors per page for typical PG work; if you're seeing more, then there is a problem with

a) your printed book
b) your scan, or
c) your OCR package

Problems with the printed book fall into three categories: bad printing, age, and unusual fonts. Bad printing consists of problems like too much or too little ink on the press at the time the book was printed, and irregularities in the print where the metal type was damaged. Age causes yellowing--even browning--of the paper, and faded print. Unusual fonts may be hard for OCR to recognize, and very tightly-spaced print may make adjacent letters seem to touch, which confuses OCR software.

There are many ways for you to have problems with your scan. Obviously, if your scanner is defective or the glass is dirty, you will notice it immediately, but there are many mistakes you can make that will result in a poor-quality image, and cause later problems for your OCR.

You may not be able to control the quality of the paper you have to work with, but there is a lot you can do about the quality of your scan.

The two mistakes that people inexperienced with scanners most commonly make are not holding the spine down firmly enough to get a flat image of the paper, and not setting the brightness correctly, or letting too much light get in. In your early scans, watch out for these problems.

First, if you haven't already, read the FAQ "How do I scan a book?" [S.7] and check that you're following the basic recommendations there.

Now let's look at some samples, and see the kinds of problems you might encounter.

A disclaimer about these samples: specific OCR packages are named, but you should not take these as a fair and comprehensive comparative review of the software. The object of this exercise is to show typical scanning conditions and problems, and the resulting OCR output. OCR packages have quite a range of variance within themselves, may work better on some texts than others, may improve with "training" or different settings, and I have even seen the same OCR package produce different text from the same image with the same settings! Further, since OCR quality is improving rapidly, and packages leapfrog each other in quality, the next version of a particular brand may be vastly better than any of the software mentioned here. Of particular interest in this context is the leap in quality between OmniPage 10 and OmniPage 11.


 

Scan 1--A perfect Scan

Scan1 is as near to a perfect scan as you can expect in PG work. It comes from "The Founder of New France" by Charles W. Colby. It is only a 300 dpi image, but given the quality of the print and of the scan, 300dpi is all we need. Ironically, it comes from Gardner Buchanan, who complains about the age and infirmity of his scanner in his description of how he produces a text. The moral is that you don't have to have the latest equipment to get good results!

The actual scan is in the image file scan1-3.tif

It doesn't really need any comment, and all of the packages except gocr rendered it perfectly. Note the fake "space" before the semicolon--if you look closely at the image, you will see why the OCR packages mistook it for a full space, as discussed in the FAQ [V.104] "My book leaves a space before punctuation like semicolons, question marks, exclamation marks and quotes. Should I do the same?"

  Champlain was now definitely committed to
  the task of gaining for France a foothold in
  North America. This was to be his steady
  purpose, whether fortune frowned or smiled.
  At times circumstances seemed favourable ;
  at other times they were most disheartening.
  Hence, if we are to understand his life and
  character, we must consider, however briefly,
  the conditions under which he worked.


gocr 0.3.6 converted this as:

  Champtain was now definitely committed to
  the task of gaining for France a foothotd in
  _orth America.  This was to be his steady
  purpose, whether fortune frowned or smiled.
  At times circumstances seemed favourable .,
  at other times they were most disheartening.
  _ence, if we are to understand his life and
  character, we must consider, however brieRy,
  the conditions under which he worked.



 

Scan 2--A Typical Scan

Scan2 is a paragraph from Baroness Orczy's "Castles in the Air". Notice the ink-splotch above the capital "I" in the first line, which will give our OCR some problems. The page is also unevenly inked elsewhere, and I have scanned it with the brightness level a bit too high.

I have made two separate scans, one at 300dpi and one at 400dpi, both Black and White, named scan2-3.tif and scan2-4.tif respectively. The page was cleanly cut, and carefully placed straight onto the scanner glass with the cover down. The original print is somewhere between the size of Times New Roman 10 and 11, with capital letters about 2.2 millimeters high, but better and more clearly spaced. These scans are fairly typical for PG work. Because of the relatively large letters, and the reasonable scan, there isn't much difference between the text produced from the 300 dpi scan and the 400 dpi scan.

I actually cut this book to get the pages out so that I could feed it through my ADF, but the paper is so thick and textured that it sticks together, and jams when feeding through. The thick, absorbent paper, combined with the uneven inking, means that, no matter how good the scan, any OCR has to contend with the irregular edges of letters, which are clearly visible even at 300dpi.

Here is the output for these scans from some OCR software packages. I changed just one thing: Abbyy recognized the em-dashes as such, and output them as a special character in Codepage 1252 for em-dashes, which isn't available in ASCII, so I converted that to the PG standard 2 dashes.

Abbyy FineReader 6:

  Yes, indeed, I was on the track of M. Aristide Fournier,
  and of one of the most important hauls of enemy goods
  which had ever been made in France. Not only that. I
  had also before me one of the most brutish criminals it
  had ever been my misfortune to come across. A bully, a
  fiend of cruelty. In very truth my fertile brain %vas
  seething with plans for eventually laying that abominable
  ruffian by the heels: hanging would be a merciful pun-
  ishment for such a miscreant. Yes, indeed, five thousand
  francs--a goodly sum in those days, Sir--was practically
  assured me. But over and above mere lucre there was
  the certainty that in a few days' time I should see the
  light of gratitude shining out of a pair of lustrous blue
  eyes, and a winning smile chasing away the look of
  fear and of sorrow from the sweetest face I had seen for
  many a day.

  Yes, indeed, Twas on the track of M. Aristide Fournier,
  and of one of the most important hauls of enemy goods
  which had ever been made in France. Not only that. I
  had also before me one of the most brutish criminals it
  had ever been my misfortune to come across. A bully, a
  fiend of cruelty. In very truth my fertile brain was
  seething with plans for eventually laying that abominable
  ruffian by the heels: hanging would be a merciful pun-
  ishment for such a miscreant. Yes, indeed, five thousand
  francs--a goodly sum in those days, Sir--was practically
  assured me. But over and above mere lucre there was
  the certainty that in a few days' time I should see the
  light of gratitude shining out of a pair of lustrous blue
  eyes, and a winning smile chasing away the look of
  fear and of sorrow from the sweetest face I had seen for
  many a day.


gocr 0.3.6:

  __e_, indeed, f___as on_the track of h_. hristide Fournier,
  3nd of one of the most im__ant hau1s of enem)_ goods
  ___hich had e__er been made in France.  h?ot onl3_ that.  I
  had a1so before me one of the most brUtish crimnat_s it
  h__4 e___er been m31 misfortune to co_me acro__3.  A bu113_, a
  tiend o cruelt__.  In very truth m3_ fertiIe brain ___as
  s_e_1_::_g __-ith planS for e__entua113_ _ay:ng that abominab1e
  ru_iin b.__ t1_e hee1s . hanginig __ou1d be a n_erciful pun-
  i;__,i__gnt or such a miscreanf.  yes, in_i__ee3, fj_1e thou3and
  franc-a b_ood13_ sum in those days, _ir-_vas practica1l3_

  a3_ured me.  _ut o___er and above n_ere lucre there was
  the certaint_v that in a few_ da3_s' ti_e I shou1d see the
  lib_ht of gratitude shininb_ out of a pair _f _usLtrous btue
  e3_e3_, and a ___inning smi1e chasing a__ay the Ioo_ of
  _ear and of sorrow from the s__eetest iace T had Seen fof
  man)_ a day.

  Yes, indeed, f___as on the track of h__. Ariseide Fournier,
  and of one of the most important hau1s _f enemy goods
  ___hich had ever been made in France.  NoEUR on1y that.  I
  had also before me one of the most brutish crimina1s it
  h_ad ever been my misfo__tune to come acros__.  A bu11y, a
  fiend of crue1ty.  _n very truth my fertib brain _vas
  see3_:i_g __ith plans for e__entua11p 1aying _at abom_in_ ab1e
  ru_an by the heels. hanging _____ou1d _ a merciful pun-
  i_h_ment for such a miscreant.  Yes, indeed, five thou__and
  f_ancs-a b_ood1y sum in those days, _ir-_vas practica1ly
  a3ured me.  But over and above mere _ucre th.ere was
  th_e certainty that in a few days' ti_e _ shou1d see the
  1i__t of gratjtude shining out of a pair o_, _userous b1ue
   b                                .
  e__es, and a __inning smi1e chasing away the l_k of
  _,ear and of sorrow from the s___,eetest face _ _ad _.een _o_
  many a day.           .             .


Recognita Standard 3.2.7AK:

  ~'es, indeed, ~w-as on the track of ltT. Aristide Fournier,
  and of one of the most important hauls of enemy goods
  "=hich had ever been made in France. ~Tot only that. I
  ha~i also before me one of the most brutish criminals it
  had ever been my misfortune to come across. A bully-, a
  fiend of cruelty. In very truth my fertiIe brain was
  s; ething w-ith plans for eventually iaying that abominable
  ruffian by the heels : hanging ~-ould be a merciful pun-
  ishment for such a miscreant. ires, indeed, five thousand
  franes-a goodly sum in those days, Sir-was practically
  as~ured me. But over and above mere lucre there was
  thP certainty that in a few days' time I should see the
  light of gratitude shining out of a pair of lustrous btue
  eyes, and a winning smile chasing away the hk of
  fear and of sorrow from the sweetest face I had seen for
  many a day.

  Yes, indeed, l~was on the track of h~i. Aristide Fournier,
  and of one of the most important hauls of enemy goods
  w~hich had ever been made in France. lVot only that. I
  had also before mP one of the most brutish criminals it
  had ever been my misfortune to come acrass. A bully, a
  fiend of cruelty. In very truth my fertile brain was
  seething with plans for ez~entually laying that abomin_ able
  ruffian by the heels : hanging ~~.-ould be a merciful pun-
  ishment for such a miscreant. Yes, indeed, five thousand
  f:ancs-a goodly sum in those days, Sir-was practically
  assured me. But over and above mere lucre there was
  the certainty that in a few days' time I should~ see the
  Iight of gratitude shining out of a pair of iEustrous blue
  eyes, and a w inning smile chasing away the Iook of
  fear and of sorrow from the s"-eetest face ~ had seen ~'or
  rr~any a day.


OmniPage Pro 10:

      Yes, indeed, twas on the track of 11T. Aristide Fournier,
  and of one of the most important hauls of enemy goods
  which had ever been made in France. Not only that. I
  ha(i also before me one of the most brutish criminals it
  had ever been my misfortune to come across. A bully, a
  fiend of cruelty. In very truth my fertile brain was
  seething with plans for eventually laying that abominable 
  ruffian by the heels: hanging would be a merciful pun-
  ishment for such a miscreant. Yes, indeed, five thousand
  francs-a goodly sum in those days, Sir-was practically
  assured me. But over and above mere lucre there was
  the certainty that in a few days' time I should see the 
  light of gratitude shining out of a pair of lustrous blue
  eyes, and a winning smile chasing away the look of
  fear and of sorrow from the sweetest face I had seen for 
  many a day.

      Yes, indeed, fwas on the track of h-I. Aristide Fournier,
  and of one of the most important hauls of enemy goods
  which had ever been made in France. Not only that. I
  had also before me one of the most brutish criminals it
  had ever been my misfortune to come across. A bully, a
  fiend of cruelty. In very truth my fertile brain was
  seething with plans for eventually laying that abominable
  ruffian by the heels: hanging would be a merciful pun-
  ishment for such a miscreant. Yes, indeed, five thousand
  francs-a goodly sum in those days, Sir-was practically
  assured me. But over and above mere lucre there was
  the certainty that in a few days' time I should see the
  light of gratitude shining out of a pair of lustrous blue
  eyes, and a winning smile chasing away the look of
  fear and of sorrow from the sweetest face I had seen for
  many a day.


OmniPage Pro 11:

  Yes, indeed, twas on the track of AT. Aristide Fournier, 
  and of one of the most important hauls of enemy goods 
  which had ever been made in France. Not only that. I 
  had also before me one of the most brutish criminals it 
  had ever been my misfortune to come across. A bully, a 
  fiend of cruelty. In very truth my fertile brain was 
  seething with plans for eventually laying that abominable 
  ruffian by the heels: hanging would be a merciful pun-
  ishment for such a miscreant. Yes, indeed, five thousand 
  francs-a goodly sum in those days, Sir-was practically 
  assured me. But over and above mere lucre there was 
  the certainty that in a few days' time I should see the 
  light of gratitude shining out of a pair of lustrous blue 
  eyes, and a winning smile chasing away the look of 
  fear and of sorrow from the sweetest face I had seen for 
  many a day.

  Yes, indeed, fwas on the track of h-I. Aristide Fournier, 
  and of one of the most important hauls of enemy goods 
  which had ever been made in France. Not only that. I 
  had also before me one of the most brutish criminals it 
  had ever been my misfortune to come across. A bully, a 
  fiend of cruelty. In very truth my fertile brain was 
  seething with plans for eventually laying that abominable 
  ruffian by the heels: hanging would be a merciful pun-
  ishment for such a miscreant. Yes, indeed, five thousand 
  francs-a goodly sum in those days, Sir-was practically 
  assured me. But over and above mere lucre there was 
  the certainty that in a few days' time I should see the 
  light of gratitude shining out of a pair of lustrous blue 
  eyes, and a winning smile chasing away the look of 
  fear and of sorrow from the sweetest face I had seen for 
  many a day.


Textbridge Millennium Pro:

  Yes, indeed, rwas on the track of M. Aristide Fournier,
  and of one of the most important hauls of enemy goods
  which had ever been made in France. Not only that. I
  hail also before me one of the most brutish criminals it
  had ever been my misfortune to come across. A bully, a
  fiend of cruelty. In very truth my fertile brain was
  seething with plans for eventually laying that abominable
  ruffian by the heels: hanging would be a merciful pun-
  ishment for such a miscreant. Yes, indeed, five thousand
  francs-a goodly sum in those days, Sir-was practically
  assured me. But over and above mere lucre there was
  the certainty that in a few days' time I should see the
  light of gratitude shining out of a pair of lustrous blue
  eyes, and a winning smile chasing away the look of
  fear and of sorrow from the sweetest face I had seen for
  many a day.                   -  - -

   Yes, indeed, f was on the track of M. Aristide Fournier,
  and of one of the most important hauls of enemy goods
  which had ever been made in France. Not only that. I
  had also before me one of the most brutish criminals it
  had ever been my misfortune to come across. A bully, a
  fiend of cruelty. In very truth my fertile brain was
  seething with plans for eventually laying that abominable
  ruffian by the heels: hanging would be a merciful pun-
  ishment for such a miscreant. Yes, indeed, five thousand
  francs-a goodly sum in those days, Sir-was practically
  assured me. But over and above mere lucre there was
  the certainty that in a few days' time I should see the
  light of gratitude shining out of a pair of lustrous blue
  eyes, and a winning smile chasing away the look of
  fear and of sorrow from the sweetest face I had seen for
  manyaday.                          -



 

Scan 3--Guttering and Smaller Print

Scan3 is a paragraph from "The Egoist" by George Meredith. It was scanned in a dim room, with the scanner cover open and the book held open, flat against the scanner glass. However, the spine was not pressed firmly enough against the glass, and as a result you can see that the words on the left-hand edge (which were near the spine) appear to be slanted, a bit distorted, and not well lit. This problem is familiar to people who scan for PG--everybody gets distracted sometimes, and fails to keep enough pressure on the spine. As you see from the results below, it caused problems for all of the OCR packages on the words affected. If you find this kind of "guttering" regularly in your own scans, where the characters near the spine are not being recognized correctly by your OCR, you need to make sure that your book is down as flat as possible before making a scan. Because of the smaller size and the guttering problem, the 400dpi scan made for better quality text in this case.

Here's the output from the sample OCR:

Abbyy FineReader 6:

  NEITHER Clara nor Vernon appeared at the mid-day table,
  n Middleton talked with Miss Dale on classical matters,
  like a good-natured giant giving a child the jump from
  stone to stone across a brawling mountain ford, so that an
  uncdified audience might really suppose, upon seeing her
  over the difficulty, she had done something for herself. Sir
  \Villoughby was proud of her, and therefore anxious to
  soltlo her business while he was in the humour to lose her.
  He hoped to finish it by shooting a word or two at Vernon
  before dinner. Clara's petition to be set free, released from
  him, had vaguely frightened even more than it offended hia
  nrido.

  NEITHER Clara nor Vernon appeared at the mid-day table.
  Dr. Middleton talked with Miss Bale on classical matters,
  like a good-natured giant giving a child the jump from
  stone to stone across a brawling mountain ford, so that an
  unedified audience might really suppose, upon seeing her
  over the difficulty, she had done something for herself. Sir
  "VVilloughby was proud of her, and therefore anxious to
  settle her business while he was in the humour to lose her.
  He hoped to finish it by shooting a word or two at Vernon
  before dinner. Clara's petition to be set free, released from
  him, had vaguely frightened even more than it offended his
  pride.


gocr 0.3.6:

  __,,,____,_ Cl,_I._c nor Vernon a__e_Ped _t tl_le _id_da_ tab1e_
  _, _ii_(__etoiI f,,_lk(;cl with _MiSs _ale _U_1d_ abS8iG_l I_i_t_t_l.__
  i,_i,;,_ .,, _(_u_-i,L_t_ii.e(l 6iiLIblt 6'7_V. ill_ _ C 'll .  tf e__Ul__b rU_l
  gt(),ii_, tu _fj(),I(, ,_uruSS.,__ T__ Illl_ g UlOUUt_lU  o_ _ 8O .t _' t_ail
  u,,_,_ifj(;il ;,_i((ic,IGG l_i_' lt re_ y 8UE)_OB_'_ U_Oll 8eelll6  lttr
  _,__i. t_ic (li__icu1ty, SIIe t1_d iluI_e 8ol_eth_ng_ fo_ be_.Self.  _i__
  _ji___()_i___lIl)y w,,s prui_il of heT_ and k__eTefope an_iouS  to
  _(_(.__u l___i. i)i__, ii,ess wIlile he Wa8 in the hU_ouT to luse Iier_
  j__ l_()_)(_(l t() tiiIish it b_ ShOOtiltg a WOTd o__ t_O &t Verno_
  _o__(),__ (li,_iIci._  Cl__T_'S _eti_tio_ tO be Set fTee_.Te1ea8ecl fro_
  )ii))),, lIL_Ll v_b__uely f_.ighteUe  eVen _OTe kba_ lt OfEe_ded hi_
  pi_i..(l_u- .  _  ,  ,  --.___ _ _,- - -__-


   ________ Cl__i.a nop Vernon appeared &t t'h_e _id_day t__le_
  D_. _id(lle_oi_ t_lked with Miss _ale ,on _ _Ssi__l __i tt_r_'_
  iij_e _ 6ood-n___tLi_.ed 6iai_t 6_i_ing & Ghild the ___np _'_.on_
  _tune to _tone aGro_S a braWlin( __ inOU__tai _foPd_ So t2_at a__
  u__p,(_ified ___idiei_Ge _ni62it real y 8uppO.8e_ upon _seeii_6 l_e_
  o______ the difhculty_ she had done _o_neth_n6 fop ber_elf_  _i_
  _viljoli____k)y w__s proud of heT, and the_efo_e an_iouS to
  ___.tle li__i. i)u__inesS Whike he W_S _ the hum'ou_ to_ lose her_
  __e l_op(_d to finish it by 8hooting a wopd o_ tWo ak Verno__ _
  _eforR_ _(in_icr_  Clara's petition to _ Set _free, releaSed fro_
  )ii__, h_d va6uely frigbte_ed eve_ _ore tban it o_e_ded hiD
  pi.icle.  -.  -  -   -  -  - '


Recognita Standard 3.2.7AK:

  ~rFr~rrmx Clara nor Vernon apneared at the mid-da~'table.
  Dr. bLidrlleton talkc;d wi.th Miss Dale vn elassieal matters,
  like a ~n~a-mZtured giant gi.ving a child th jucnp frvm
  stonc to stone across a brawling mounta,in ford, so that au
  uiicilificd .ruciicucc mil;ht really suppasc, upon seeixig hor
  n~er thc ciillicul.ty, she had clouo something for herself. Sir
  ~Villcm;;lrlry wvs proua of her, and therefors angiaus to
  sct.tla lrur tn~sincss while he was in the humoar to lose her.
  lle lu,hcot to iinish it by shooting a word ar two at Vernon
  bol'ore ~linncr. Clara's petition to bo set froe, released rom
  JGGnt., hvd vagucly frighteued even more than it offended hia
  ri~le.
  p

  NEITfi~R Clara nor Vernon appeareci at the xnid-day table.
  Dr. Middleton talked with Miss Dalo on classics,l rnatters',
  like a good-natured giant giving a child the jtimp from
  stone to stone across a brawling mountain ford, so that an
  unedified audience might really suppose, upon ~ seeing her
  over the difficulty, she had done something for herself. Sir
  yillon ;hby was proud of her, and therefore anxiotis to
  scttle luer business while he w~as in the hurxiour to lose her:
  He hoped to finish it by shooting a word or two at Vernon
  before dinner. Clara's petition to be set free, released from
  jcLm, had vaguely frighteued even more than it offended his
  pride.


OmniPage Pro 10:

      NF r~rn,Px Clara nor Vernon appeared at the mid-dap table.
  Dr. Middleton talked with Miss Dale on classical matter,
  like .t good-natured giant giving a child the jump from
  stone to stone across a brawling mountain ford, so that an
  uneVified audience might really suppose, upon seeing her
  over the difficulty, she had done something for herself. Sir
  jV;llo,r;;lrl>y was proud of her, and therefore anxious to
  set.tlo lror Uusiness while he was in the humour to lose her.
  Ile. lropcol to finish it by shooting a word or two at Vernon
  bol'ore dinner. Clara's petition to beset free, released from
  )zinc, had vaguely frightened even more than it offended his
  pride.

      NEITHER Clara nor Vernon appeared at the mid-day table.
  Dr. Middleton talked with Miss Bale on classical matters',
  like a good-natured giant giving a child the jump from
  stone to stone across a brawling mountain ford, so that an
  unedified audience might really suppose, upon ~ seeing her
  over the difficulty, she had done something for herself. Sir
  yillou ;hby was proud of her, and therefore anxious to
  settle her business while he was in the humour to lose her.
  He hoped to finish it by shooting a word or two at Vernon
  before dinner. Clam's petition to be set free, released from
  him, had vaguely frightened even more than it offended his
  pride.


OmniPage Pro 11:

  NF f,rnMR Clara nor Vernon appeared at the mid-day table. 
  Dr. Middleton talked with Miss Dale on classical matters, 
  like .t good-natared giant giving a child the jump from 
  stone to stone across a brawling mountain ford, so that an 
  une(lifie(l audience might really suppose, upon seeing her 
  over the difficulty, she had done something for herself. Sir 
  jVillon;hl)y was proud of her, and therefore anxious to 
  setale leer business while he was in the humour to lose her. 
  lle hoped to finish it by shooting a word or two at Vernon
  bofore dinner. Clara's petition to beset free, released from 
  )lint, had vaguely frightened even more than it offended his 
  pride.
  -.2 ..1_ - ____

  NEITHER Clara nor Vernon appeared at the mid-day table. 
  Dr. Middleton talked with Miss Dale on classical matters', 
  like a good-natured giant giving a child the jump from 
  stone to stone across a brawling mountain ford, so that an 
  unedified audience might really suppose, upon,seeing her 
  over the difficulty, she had done something for herself. Sir 
  Willoughby was proud of her, and therefore anxious to 
  settle her business while he was in the huniour to lose her. 
  Il"e hoped to finish it by shooting a word or two at Vernon 
  before dinner. Clara's petition to be set free, released from 
  hint, had vaguely frightened even more than it offended his 
  pride. - -


TextBridge Millennium Pro:

  NErr'!'~~ Clara nor Vernon appeared at the mid.day table.
  pr. ~1id(lIeto11 talked with Miss Dale on classical matters,
  like a good-natured giant giving a child the jump from
  stone to stone across a brawling mountain ford, so that au
  ~1edifi~ tLU(llCIlCC might really suppose, upon seeing h er
  over the (hjiheulty, she had done something for herself. Sir
  wiflouighby was proud of her, and therefore anxious to
  settle her business while he was in the humour to lose her.
  lie ho1)ed to finish it by shooting a word or two at Vernon
  before dinner. Clara's petition to be set free, released from
  him, had vaguely frightened even more than it offended his
  pr~t~.

   NEITHER Clara nor Vernon appeared at the mid-day table.
  Pr. Middleton talked with Miss Dale on classical matters,
  like a good-natured giant giving a child the jump from
  stone to stone across a brawling mountain ford, so that an
  une(lified audience might really suppose, upon - seeing her
  over the difficulty, she had done something for herself. Sir
  Willoughby was proud of her, and therefore anxious to
  settle hier l)uSifleSS while he was in the humour to lose her.
  lie hoped to finish it by shooting a word or two at Vernon
  before dinner. Clara's petition to be set free, released from
  hirn~, had vaguely frightened even more than it offended his
  pri(le.



 

Scan 4--A Really Bad Case!

Scan4 is a paragraph from Pope's translation of Homer's "Odyssey". This is a very, very tough one. It was obviously a cheap printing to begin with, using thin, poor-quality paper in a page size of 6" by 4.5", with capital letters about 1.5 mm high, a little bigger than Times New Roman size 8. Text this small really needs a higher-resolution scan. The book was falling apart when I got it, the ink was fading and flaking, and there was no point in even thinking about trying to scan it flat, so I cut the pages. To add an extra challenge, I scanned the sample with the cover open in a medium-lit room for the 300 and 400dpi scans, but closed the cover for the 600dpi to show the best quality I could possibly get. (I was pleased to note that Abbyy, while recognizing the page in the 300dpi and 400dpi images, flashed up a suggestion that I should lower the brightness of the scan.)

This particular book was one I sporadically tried to produce, without success, on an older scanner and a bundled OCR program over a period of two years, back in 98/99. Eventually, in 2000, it was the first book processed through Charles Franks' Distributed Proofreaders site. The initial text produced by the OCR was very poor, but the human volunteers made up for it! Thanks, guys! Today, just two years later, with a better scanner and better OCR, I could have done it myself, as you will see from the best of the results of the 600dpi scans. That's how much things have improved recently.

A separate point to note here is that you can see the "three-quarter space" effect before the exclamation mark and semi-colon that was discussed in [V.104].

The results of the OCR are:

Abbyy FineReader 6:

  " Ah me ! on what inhospitable coast,
  On Tvh.it new region is Ulysses toss'd ;
  Possess'd by wild barbarians fierce in arms ;
  Or men. whose bosom tender pity warms ?
  What sounds are these that gather from the shores ?
  The voice of nymphs that haunt the sylvan bowers,
  The fair-hair'd Pryads of the shady wood ;
  Or azure daughters of the silver flood ;
  Or human voir-e? but issuing1 from the shades,
  AVhv cease I straight to learn what sound invades?"

  " Ah me ! on what inhospitable coast,
  On what new region is Ulysses toss'd ;
  Possess'd by wild barbarians fierce in arms ;
  Or men, whose bosom tender pity warms '?
  "What sounds are these that gather from the shores ?
  The voice of nymphs that haunt the sylvan bowers,
  The fair-hair'd Dryads of the shady wood ;
  Or azure daughters of the silver flood ;
  Or human voice? but issuing from the shades,
  Why cease I straight to learn what sound invades?"

  " Ah me ! on what inhospitable coast,
  On what new region is Ulysses toss'd ;
  Possess'd by wild barbarians fierce in arms ;
  Or men, whose bosom tender pity warms ?
  "What sounds are these that gather from the shores ?
  The voice of nymphs that haunt the sylvan bowers,
  The fair-hair'd*Dryads of the slrady wood ;
  Or azure daughters of the silver flood ;
  Or human voice? but issuing from the shades,
  Why cease I straight to learn what sound invades?"


gocr 0.3.6:

[The 300 and 400 dpi scans produced nothing recognizable.
The result of the 600 dpi scan is below.]

    '' _hh i_3e ! o_1 ___l_at_ i__l__sl__ it_nble CoaSt_
  On ___l_,__ _)e_v i_e_io__ i__ ___ _._____ses toss'd ;
  _(3s3gs3_d l3.__ ___iii l3_3__b___i_c_i3_ fie_Ce in il__S- _
  Or i11pn, __-i)c3se l_osonl te_1de_ _it____ __ai_n3__ ?
  ___l_at __o__i1ds Qre tlipse tliat g__tl_p_r fE_oi33 the shoTes ?
  '_ilie __oi__e of i)____ E1)l3l3s tl3nT 1i_n__nt the s__l__inn bo_Ye_5_
  3'l_e fni___i____ir'd _____-ads of' il_e sli__d__ i___oOd _
  Op az(_pe da_____litc__s of _tlie sil __?r t1ood ;
  Or l___i31_nn ___)i___? l3__t i3____ii_6 fi_oi11 tlie __hiade__ _
  __'!3.__ _ea___e _ s_rai__li.t to l_ar_i1- i_--li__t so_nd- in__ad_S___''


Recognita Standard 3.2.7AK:

  .: lh nt"'. on w-hat inlu,;y:t, I,:e co;;~t,
  On ~cli^t ne~- re~ion i.. 1= 1-.-:.:e~ tm:'d ;
  Possea'd 1n- wil~l L;,rba~:c, .~ fierce in arm~ ;
  Or u.~u. w-Ln.e bossum tender pit~- warna'?
  ~l-u:lt .<,:~;;::;3s are tll~ce that ~atl:er from the shnre~ ?
  'I'l.e -;;o'.re :,; nwtthil: tW ,t l:aa;nt the s~-l:c 1llJOR'er5,
  'lhe :a,:~-h ~;r'd~It.wa~i~ ot' tl:e ~Il;;dv vood;
  Or az.lre dau~~l.ts~: oY tl:c :iv-~~r floo;:3 ;
  C?r humnn ~-<:i: e'? l,~:tt i~~; from tl:c ~had~~,
  11-lts- cea~e I ctrai rlit to learn ~s-l:, t socud incades %"


  " ~h me ! ou "-Mat iuMospita~le coast,
  On ~i-lmt ne~c reyion is L 1~-~ses to~s'd ;
  Pos:e;s'd 1"~ w-iMl lrvrbaria:ns fiet~ce in arms ;
  Or m~ n, "-hose hosom tender pit~- warm5 ?
  ~~~hat ~ounds are tlmse tMat ~;atMer from t:he shores ?
  ~t'I~e ~-oi~~e of n~-Inhhs t.hat liaunt the s~-l~~a n howers
  .
  Tlie fair-hnir'd D~ vads ot tl:e shad~- "-ood ;
  Or aznre dau~liters of tMe sil~-~r fiood ;
  Or lmman ~-oi:~e'? but iauin~ frotn the shades, a
  lVly cea.~e I straibht to learn "-Mat souud in~ads?"


  " Ah me ! on what inhospitable coast
  On ~~-hat new r e~ion is L;1 ~-sses toss'd ~
  ,
  Possess'd 1J~- "-ilil I:OII'uai'la ils fierce in arms_ 
  Or men, whose hosom tender pit~l ~varn~s ?
  ~'G'l~at somnds are these tliat ~atl~er from the shores ?
  ~I'Iie v oice of n~-mpl~S that ~munt the sy Ivan bowers,
  Tlie fair -hair'd D~~~-ads of tl~e slmdy wood ;
  Or azure daylltcrs of tlle silver flood ;
  Or lm:nan voice? uut issL~ing from the shades,
  ~~'lm cea~e I strai~ht to Iearn ~~-lmt so~nd inv ades ?"


OmniPage Pro 10:

  On "M.^t new reion is 1=1;-a:e~ to-s'd ;
  P"::e:~'d hw "ild Larba.:an~ fierce in arms ;
  Or inn. "-hnse bo.,om tender pity warms
  What <m-,n ds are thFSe that gather from the shores?
  '1-l.e vo_,e o2 u~vnhit: thm hn,,-,nt The sylvan bowers,
  The is ;r-ha;r'd h.-;-ads of the liz-Ay iNood
  Or azure dau_ht;- of tl:c o=1 cr flooj ;
  Or hnnmn wire? l,11t i--rii:g from the shadP3,
  Al-ly cease I straiAlit to learn what sound invades?"

      'Wh me ! on what inhospitable coast,
  On what new region is L fusses toss'd ;
  Possess'd br wild barbaric ns fierce in arms ;
  Or men, whose bosom tender pith- warms
  AN-hat sounds are these that gather from the shores ?
  The voice of nymphs that Haunt the sylvan bowers,
  The fair-hair'd IWvads of the shady -wood ;
  Or azure daughters of the silver flood ;
  Or human voice? bat iauina from the shades,
  Why cease I straight to learn what sound invades?"

      " Ah me! on what inhospitable coast,
  On what new region is Ll ysses toss'd ;
  Possess'd bv -wild barbarians fierce in arms ;
  Or men, whose bosom tender pity warnis ?
  AVlia sounds are these that gatller from the shores
  The voice of nYI11pliS that haunt the -sylvan bowers,
  The fair -hair'd D.-yads of the shady wood ;
  Or azure daughters of the silver flood ;
  Or human voice? lout issuing from the shades,
  Why cease I straight to learn what sound invades?"


OmniPage Pro 11:

  .` lh in-' on what inhospital,le co-st, 
  On xclznt near region is t 1:-sse~ toss'(: ; 
  Possess'd bY Mild barbarians fierce in aims ; 
  Or inn. whose boson tender pity warms
  What <m-,n ds are tlipse that gather from the shores ? 
  '_I-I.e 1-o=,- of nv:npii? that haunt the sylvan bowers, 
  She ra;r-ha;r'd 1):, ads of the shad- wood ;
  Or az.ire dau_lit~- of tl:e silo-:-r flood ;
  Or human voice? l,,tt i?snina from the shadpq, 
  Al-lry cease I straiAit to learn shat sound invades?"


  ''' :Ah me ! on what inhospitable coast, 
  On iyhat new region is Ulysses toss'd ; 
  Possess'd br wild barbarimis fierce in arms ; 
  Or men, whose bosom tender pity warms 
  AN-hat sounds are tliese that gather from the shores ? 
  The voice of nymphs that haunt the sylvan bowers, 
  The fair-hair'd D~ yads of the shady -wood
  ;
  Or azure dau.L-hters of the silver flood ;
  Or human voice? but issuing from the shades, 
  Why cease I straight to learn what sound invades?"


  " Ah me! on what inhospitable coast, 
  On what new region is Ulysses toss'd ; 
  Possess'd by -wild barbarians fierce in arms ; 
  Or n1en, whose bosom tender pity warnis ? 
  AVliat sounds are these that gather from the shores 
  The voice of nyniplis that haunt the sylvan bowers, 
  The fair-hair'd Dryads of the shady Wood ;
  Or azure daughters of the silver flood ;
  Or human voice? but issuing from the shades, 
  Why cease I straight to learn what sound invades?"


TextBridge Millennium Pro:

      no on what inhe~ptaEie coast,
  On what new realun is hivs,e' to5sd
  ,s~s -~d liv wild lie il)~m.ihI fir see in al-rn~
  Or u~,-n. w'linse bo,uuiu tender pity warnls
  Wl at ~ are t1ie~e that ~atler from the shores ?
  'n.e a oro of imvntpirs tint he~nt the sad van bowers,
  'flie tah'-ha~r'd D~vahs ct the shady wood
  1)1' az Ire dauul~t ~ of tl,e shvr flood
  Or liunian vi i 'I ? h'tt is- eng from the shades,
  \VIiv cea-~e I straight to learn w hat sound invades 1"


    Ah me on what inhospitable coast,
  On what new region is U vases toss'd
  Possess'd by wild barbarians fierce in arms
  Or men, whose bosom tender pity warms ~
  What sounds are these that gather from the shores?
  The voi'e of nymphs that haunt the sylvan bowers,
  The fair-baird Prvads of tl~e shady wood
  Or azure daughters of the silver flood
  Or human vuiae? but issuing fi'om the shades,
  Why cease I straigl~t to learn what sound invades?"


    Ah me on what inhospitable coast,
  On what new region is Ulysses toss'd
  Possess'd by wild barbarians fierce in arms
  Or men, whose bosom tender pity warms?
  What sounds are these that gather from the shores?
  rfhe voice of nymphs that haunt the sylvan bowers,
  The fair-hair'd Dtyads of the shady wood;
  Or azure daughters of 'the silver flood
  Or human voice? but issuing from the shades,
  Why cease I straigl~t to learn what sOund invades?"


What can we conclude from this?

Small mistakes in scanning, like letting too much light in, getting your scanner settings wrong for the page, or not pressing the paper flat enough, can make a major difference to the final quality of the text that you will have to correct.

Sometimes, no matter what you do with your scanner, problems with the paper or the print will make it difficult for your OCR package to give good output.

Generally, bigger is better within the range 300dpi-600dpi, but you only need higher resolution with more difficult material.

Different OCR packages will produce widely differing texts from the same images. Given a really good image, most OCR software will work acceptably, but when you have lower quality material to work with, the gap between OCR packages shows clearly.

Top