A scanned PDF is just images. Search returns nothing, screen readers see blank pages, and copy-paste produces empty strings. OCR adds a transparent text layer over each page so the document becomes searchable while looking identical to the original scan.
OCR Engine Comparison
| Engine | Type | Strengths | Best For |
|---|---|---|---|
| Tesseract 5 | Open source, local | 100+ languages, LSTM-based | Self-hosted batch pipelines |
| ABBYY FineReader | Commercial desktop | Layout retention, tables | Mixed-content business docs |
| Google Cloud Vision | Cloud API | Robust on noisy scans | Diverse, low-quality inputs |
| Azure Document Intelligence | Cloud API | Form & receipt extraction | Structured data capture |
| Apple Live Text | OS-level, local | Free, fast on Apple silicon | Casual macOS/iOS use |
Pre-Processing Matters More Than the Engine
A clean 300 DPI scan often beats the most expensive engine on a noisy 150 DPI image. Standard pre-processing steps:
- Deskew: rotate pages so text baselines are horizontal.
- Denoise: remove specks and JPEG artifacts.
- Binarize: convert to black-on-white using adaptive thresholding for uneven lighting.
- Despeckle and remove borders: drop scanner edges and dust.
- Set the correct DPI: Tesseract expects accurate DPI metadata; mismatched DPI lowers accuracy.
Building a Searchable PDF
- Scan or import at 300 DPI for body text, 600 DPI for small print or signatures.
- Pre-process: deskew, denoise, threshold.
- Run OCR with the correct language pack(s). Multi-language docs need each language enabled.
- Save as searchable PDF — image layer on top, invisible text layer below.
- Spot-check accuracy by searching for known words and copying paragraphs into a text editor.
- For archives, export as PDF/A-2u so the text layer is mandated and Unicode-mapped.
Common Failure Modes
- Wrong language: French text OCR'd in English produces garbage. Always set language explicitly.
- Two-column layouts read as one: use layout-aware OCR or split pages first.
- Handwriting: classical OCR collapses; use ICR (intelligent character recognition) or cloud APIs trained on handwriting.
- Tiny stamps and signatures: increase DPI for these regions or accept them as image-only.
Convert PDFs Back to Images
Export PDF pages as high-quality PNG or JPEG entirely in your browser.
PDF to Image →