Google indexing OCR scanned documents

Google has started using OCR to index scanned documents stored in Adobe's PDF format. “While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of […]

Google has started using OCR to index scanned documents stored in Adobe's PDF format. “While we've indexed documents saved as PDFs for some time now, scanned documents are a lot more difficult for a computer to read. Scanning is the reverse of printing. Printing turns digital words into text on paper, while scanning makes a digital picture of the physical paper (and text) so you can store and view it on a computer. The scanned picture of the text is not quite the same as the original digital words, however -- it is a picture of the printed words. Often you can see telltale signs: the ring of a coffee cup, ink smudges, or even fold creases in the pages,” writes Evin Levey.

To see the new system at work, click on following search querie. Note the document excerpt in the search results, along with the full text presented after the 'View as HTML' link: repairing aluminum wiring.