How Text Works in OCR’d and Scanned PDF files
I recently asked our support team for some PDF-related concepts that they regularly found themselves explaining to customers. It turns out one of the most common discussions revolves around scanned and OCR’d PDF files. In particular, how text content works in both. Let’s take a look.
What is a scanned PDF file?
This question is not a trick. If you haven’t used a scanner before, it’s simply a PDF that’s been optically scanned and turned into a digitized replica of the original paper file. The obvious benefit of doing so is that once digitized it can be copied, shared and stored in multiple places.
In the case of text, however, for a simple scanned PDF document, it is not searchable and editable (with a PDF editor like Nitro Pro) the way you might expect — the text still part of the digitized image created during the scan.
What is OCR and what is an OCR’d PDF?
Optical character recognition (OCR) is the process of looking through one or more images, identifying patterns, and deducing from them where text-based content appears to be. The most common content to be OCR’d would be paper-based documents, publications, books, etc. that have been digitally scanned. But you could also OCR text-based electronic documents so as to make the text in the images in them searchable.
Once a PDF has been OCR’d, all (or almost all) of the text that was originally part of an image is ‘recognized’ and it becomes searchable and (at least somewhat) editable. For PDF files that have been OCR’d, there are two types of files you can encounter:
- Searchable images. In these PDF files, the appearance from the source file (normally a scanned file) is retained and looks the same (albeit a litt more grainy than the original as it’s a low quality scan), but an extra layer of text content has been added to the PDF. In the screenshot below, the grainy original image is still visible, yet the text can be searched (see how ‘download’ is highlighted). What’s really going on is that a transparent layer of text has been placed in the same position as the original scanned image that includes the text.
- Converted and formatted text. The other type of OCR’d PDF you might encounter is one that has discarded the original scanned image altogether. As a result, all text that was recognized, converted to real text, and formatted to look as close to the source file as possible. In the example below, you can see how much crisper and neater the text is. Editing text in this kind of OCR’d PDF is much easier than the type above, however the major trade-off you must accept is that it’s virtually impossible to get the content looking exactly the same as the source file — matching fonts, layout, etc. are all challenging. Also, retaining the scanned image layer means that less information will be lost from the original.
How to tell if a PDF is a scan (and just an image) or has been OCR’d
OCR’d files are never 100% accurate but these days they’re usually pretty close. This means that by searching for a keyword you can see in your PDF is the quickest way to tell if the file is scanned, or whether it has been OCR’d.
In Adobe Reader/Acrobat or Nitro PDF Professional, Press Ctrl+F, enter your keywords, and look for your keyword to be highlighted. If your search finds no words, it most probably indicates that the file is a scan that has not been OCR’d
The photo above was taken by npslibrarian and shared under a Creative Commons license.
Tidak ada komentar:
Posting Komentar