How to OCR a PDF: Convert Scanned PDF to Searchable Text (2026)

📅 June 2026 ⏱ 6 min read 🗂 OCR & Scanned PDFs

You scan a 20-page contract, a stack of old invoices, or a handwritten form — and the resulting PDF is essentially a photograph. You can look at it, but you can't search it, copy text from it, or translate it. Ctrl+F finds nothing. Copy-paste produces gibberish or nothing at all. You're locked out of the information in your own document.

OCR — Optical Character Recognition — is the technology that fixes this. It analyses the image of each page, recognises the characters, and adds a searchable text layer to the PDF. After OCR, you can search the document, copy sentences, use it with screen readers, translate it, or extract its tables. This guide shows you exactly how to do it for free.

What Is OCR and How Does It Work?

Optical Character Recognition is a technology that converts images of text into machine-readable text. When you scan a document, you capture a photograph — the characters are shapes drawn in ink on paper, not digital text. OCR software analyses those shapes, compares them against known character patterns, and outputs the text equivalent.

Modern OCR uses deep learning models trained on millions of document images in dozens of scripts and languages. PDFTash uses Tesseract OCR (the world's most widely deployed open-source OCR engine, originally developed by HP and now maintained by Google) combined with image pre-processing that dramatically improves accuracy on low-quality scans, rotated pages, and low-contrast documents.

The output is a PDF with an invisible text layer behind the image. The page still looks like the original scan, but now the text is fully selectable, searchable, and copyable. This is called a "searchable PDF" or "PDF with OCR layer."

When You Need OCR

Scanned physical documents: Contracts, certificates, medical records, bank statements, and academic transcripts that were physically scanned.
PDFs generated by older software that saved pages as images rather than as text data (common in some government and legal systems).
Photographed documents: PDFs created from phone photos of documents.
Fax-converted PDFs: Documents received as fax often have no text layer.
Archives and research: Old books, newspapers, and historical documents digitised from microfilm or photography.

How to tell if your PDF needs OCR: Try to click and drag to select text on the page. If you can highlight individual words, the PDF already has a text layer and doesn't need OCR. If clicking selects the entire page like a photo, or if nothing highlights, OCR is needed.

Step-by-Step with PDFTash

Go to pdftash.com/ocr-pdf.
Upload your scanned PDF. Drag and drop or click to browse. Supported up to 10 MB on the free plan.
Select the language of the text in your document from the dropdown. Choosing the correct language significantly improves accuracy.
Click Run OCR. PDFTash pre-processes each page (deskewing, denoising, contrast enhancement) before running character recognition. Processing time is typically 5–30 seconds depending on the number of pages and scan quality.
Review the output in the preview panel. The text layer is now active — you can see and select text on each page.
Download your OCR PDF. The file now has full searchable text, looks identical to the original scan visually, and is compatible with all PDF readers.

PDFTash preserves the original page images exactly — your document looks identical. Only the invisible text layer is added, making it searchable without altering the visual appearance.

Supported Languages

PDFTash OCR supports the following languages with high accuracy:

Latin script: English, French, German, Spanish, Portuguese, Italian, Dutch, Polish, Romanian, Swedish, Norwegian, Danish
South Asian: Bengali (Bangla), Hindi (Devanagari), Tamil, Telugu, Gujarati, Punjabi, Urdu
Middle Eastern: Arabic, Persian (Farsi), Turkish
East Asian: Chinese (Simplified and Traditional), Japanese, Korean
Cyrillic: Russian, Ukrainian, Bulgarian, Serbian

Bengali OCR note: PDFTash has specific optimisation for Bengali script, which presents unique challenges due to matra (vowel diacritic) attachment and conjunct consonants. If you're working with Bangla scanned documents, PDFTash delivers superior accuracy compared to generic OCR tools. See the dedicated Bengali PDF OCR page for details.

Accuracy Tips

OCR accuracy depends heavily on scan quality. Here's how to get the best results:

Scan at 300 DPI minimum. At lower resolutions, characters become too small for reliable recognition. 300–600 DPI is the sweet spot for text documents.
Use black and white or grayscale scanning for text documents. Colour scanning increases file size without improving text recognition accuracy.
Ensure the document is flat and straight. Curved or angled pages reduce accuracy. PDFTash applies deskewing correction automatically, but severe angles (more than 15°) can still cause issues.
Good lighting, no shadows. If photographing a document with a phone, ensure even lighting with no shadow falling across the text.
Select the correct language. The single biggest accuracy improvement for non-English documents is selecting the correct source language in PDFTash.

What to Do After OCR

Once your PDF has an OCR text layer, a whole set of workflows become available:

Search: Use Ctrl+F (or Cmd+F) in any PDF reader to find words and phrases across all pages.
Copy text: Select and copy paragraphs to paste into Word, a spreadsheet, or an email.
Translate: Use PDFTash Translate to convert the entire document to another language — this only works on PDFs with a text layer.
Extract tables: Use PDFTash Table Extractor to pull data tables into CSV or Excel.
Summarise: Use PDFTash AI Summarise to get a concise summary of a long OCR-processed document.
Screen readers: Accessibility software can now read the document aloud to users with visual impairments.

Frequently Asked Questions

How accurate is PDFTash OCR?

For good-quality scans (300 DPI+, clean background, printed text), PDFTash achieves 97–99% character accuracy on English and major European languages. For handwritten text, heavily stylised fonts, or very low-quality scans, accuracy drops to 70–90%. Bengali, Arabic, and other complex scripts achieve 94–97% on well-scanned documents.

What languages does PDFTash OCR support?

Over 30 languages across Latin, Devanagari, Bengali, Arabic, Cyrillic, Chinese, Japanese, Korean, and other scripts. English, Bengali, Arabic, Hindi, Spanish, French, German, Russian, Chinese (Simplified and Traditional), Japanese, and Korean all receive specific model optimisation.

What is the difference between a scanned PDF and a digital PDF?

A digital PDF (sometimes called a "born-digital" PDF) is created directly from software — like exporting a Word document or saving a web page. Text is stored as actual character data. A scanned PDF is a photograph of a physical document; it contains image data only. You can tell the difference by trying to select text — digital PDFs allow text selection, scanned PDFs do not.

After OCR, can I edit the text in the PDF?

OCR adds a text layer that is searchable and copyable, but not directly editable within the PDF itself (that would require a full PDF editor). You can copy the recognised text and paste it into a word processor for editing. If you need an editable document, use PDFTash to extract the text layer first, then edit it in your preferred text editor.

Does PDFTash support Bengali OCR specifically?

Yes. PDFTash has specific Bengali (Bangla) OCR support with a model tuned for the Bangla script's unique characteristics — conjunct consonants, vowel matras, and the distinctive top line (মাত্রা) that many generic OCR tools mishandle. For dedicated Bengali OCR, visit pdftash.com/ocr-pdf-bengali.

Try it free on PDFTash →

No signup. No watermark. Results in seconds.

OCR PDF Free →

RELATED TOOLS

Bengali OCR PDF OCR PDF Online Free Extract Text from PDF Translate PDF Extract Tables from PDF