|
||||||||||||||||||||||||||||||||||||||||||||||||
|
a) Text is too dark, b) text is too light, c) too much noise
a) Text is slanted, b) text is distorted The most important aspect is to be aware of the limitations of OCR technology. One cannot expect that the recognized text will be free of minor mistakes and strictly preserved to be the same as the original layout. Some text on the original may have slight imperfections, thereby turning an "a" into "@", or "&" into "8". The quality of recognition depends on the quality of the original document and the size of the characters. Cleaner originals with larger type are recognized easier and generate fewer mistakes. It is more desirable for OCR to have paper originals that do not have any handwritten marks, lines, corrections, etc. Each portion of marks or handwriting will be misinterpreted by the OCR software. OCR software can recognize only basic text attributes and layout types. The following text attributes are supported:
Though font size and type style are supported, they are often recognized inaccurately. Note that OCR software does not recognize any symbols other than characters and numerals. It means that you can not OCR complex layouts like legal forms (see picture), multicolumn layouts with pictures, etc.
Layouts that are impossible to preserve in OCR process Simple tables are possible to preserve in the final layout; but it may require additional desktop publishing (DTP) time and charges on top of regular OCR price |
|||||||||||||||||||||||||||||||||||||||||||||||