Originals for OCR

When you look at printed text, your eyes perceive a picture: a pattern of black and white areas over the page. Your brain then identifies these patterns as characters and symbols that have a meaning. It is a good analogy to a scanner and computer with OCR software (OCR stands for Optical Character Recognition). Scanners act like human eyes providing the picture of a printed page, while the computer acts like the human brain, interpreting the picture. Editing this picture with any kind of word processor cannot be done because the scanned picture is an image, and not recognized as text. A computer running OCR software then identifies the characters and symbols from the picture, and sends the recognized text to a word processor.

In terms of quality, the requirements for OCR originals are the same as for other scannable items. Just a few additional which will improve the quality of recognition:

  • Text on the originals should not be rotated: lines should go strictly horizontal, parallel to the top edge of the paper. If this requirement is not fulfilled we will have to straighten the scanned images ourselves. This may involve additional DTP time.
  • Text on the original should not be distorted. Such problem is impossible to fix and it has an extremely strong effect on the OCR quality.
  • Textured or colored paper of the originals may cause quality problems as well as print in color fonts.
  • Avoid submitting copies of the original documents for OCR. If you do, we won't be able to fine tune the contrast during scanning. This may lead to substancial quality decrease.

Most common problems which affect the OCR quality are given in the pictures below:

   

a) Text is too dark, b) text is too light, c) too much noise

 

a) Text is slanted, b) text is distorted

The most important aspect is to be aware of the limitations of OCR technology. One cannot expect that the recognized text will be free of minor mistakes and strictly preserved to be the same as the original layout. Some text on the original may have slight imperfections, thereby turning an "a" into "@", or "&" into "8".

The quality of recognition depends on the quality of the original document and the size of the characters. Cleaner originals with larger type are recognized easier and generate fewer mistakes. It is more desirable for OCR to have paper originals that do not have any handwritten marks, lines, corrections, etc. Each portion of marks or handwriting will be misinterpreted by the OCR software.

OCR software can recognize only basic text attributes and layout types. The following text attributes are supported:

  • Character size
  • Font type (serif/sans serif/monospaced)
  • Type style (regular/bold/italic/bold italic)
  • Size of page margins
  • Left and right paragraph indents
  • Columns

Though font size and type style are supported, they are often recognized inaccurately. Note that OCR software does not recognize any symbols other than characters and numerals. It means that you can not OCR complex layouts like legal forms (see picture), multicolumn layouts with pictures, etc.

   

Layouts that are impossible to preserve in OCR process

Simple tables are possible to preserve in the final layout; but it may require additional desktop publishing (DTP) time and charges on top of regular OCR price

© Alex & Sandy Tayts (graphics, text, design, HTML coding, scripts), 2003    © Copy Station Inc., 2003