@d0paminedriven/pdfdown-ocr
v0.9.7
Published
Rust powered PDF extraction for Node with OCR fallback (requires system tesseract).
Downloads
606
Readme
@d0paminedriven/pdfdown-ocr
Rust-powered PDF extraction for Node.js with Tesseract OCR fallback for image-only pages. A superset of @d0paminedriven/pdfdown -- includes all base extraction APIs (text, images, annotations, structured text, metadata) plus OCR.
System requirement: Tesseract 5.x must be installed on the host.
Install
npm install @d0paminedriven/pdfdown-ocrTesseract setup
# Ubuntu/Debian (22.04 ships pre v5 > -- use the PPA for 5.x)
sudo add-apt-repository ppa:alex-p/tesseract-ocr5
sudo apt update
sudo apt install tesseract-ocr tesseract-ocr-eng -y
# Optional: all language packs
# sudo apt install tesseract-ocr-all
# macOS
brew install tesseract
# Arch
sudo pacman -S tesseract tesseract-data-engVerify with tesseract --version -- you should see 5.x.
Tessdata auto-detection
The package automatically detects the tessdata directory at runtime by parsing the output of tesseract --list-langs. The detected path is cached for the lifetime of the process using a OnceLock<Option<String>> -- no global environment mutation, fully thread-safe.
Resolution order:
TESSDATA_PREFIXenvironment variable (if set, used as-is -- no auto-detection runs)- Auto-detection via
tesseract --list-langs(parses the path fromList of available languages in "/path/to/tessdata/") - Tesseract's compiled-in default (if neither of the above yields a path)
Most users will not need to set TESSDATA_PREFIX at all. The auto-detection handles standard installations on Ubuntu (/usr/share/tesseract-ocr/5/tessdata/), macOS Homebrew (/opt/homebrew/share/tessdata/), Arch, and any other layout where tesseract is on PATH.
Set TESSDATA_PREFIX explicitly only if:
- Tesseract is not on
PATHbut the tessdata directory exists elsewhere - You want to override the detected path (e.g., pointing to a custom-trained data directory)
# Override example (not usually needed)
export TESSDATA_PREFIX="/opt/custom/tessdata"API
This package exports everything from @d0paminedriven/pdfdown (text, images, annotations, structured text, metadata -- both sync and async), plus the OCR-specific APIs below. See the base package docs for the full base API.
OCR standalone functions
// Per-page OCR text extraction
export declare function extractTextWithOcrPerPage(
buffer: Buffer,
opts?: OcrOptions,
): Array<OcrPageText>
export declare function extractTextWithOcrPerPageAsync(
buffer: Buffer,
opts?: OcrOptions,
): Promise<Array<OcrPageText>>
// Full document extraction with OCR text fallback
export declare function pdfDocumentOcr(
buffer: Buffer,
opts?: OcrOptions,
): PdfDocumentOcr
export declare function pdfDocumentOcrAsync(
buffer: Buffer,
opts?: OcrOptions,
): Promise<PdfDocumentOcr>PdfDown class (includes OCR methods)
export declare class PdfDown {
constructor(buffer: Buffer)
// ── Base methods ──
textPerPage(): Array<PageText>
textPerPageAsync(): Promise<Array<PageText>>
imagesPerPage(): Array<PageImage>
imagesPerPageAsync(): Promise<Array<PageImage>>
annotationsPerPage(): Array<PageAnnotation>
annotationsPerPageAsync(): Promise<Array<PageAnnotation>>
structuredText(): Array<StructuredPageText>
structuredTextAsync(): Promise<Array<StructuredPageText>>
metadata(): PdfMeta
metadataAsync(): Promise<PdfMeta>
document(): PdfDocument
documentAsync(): Promise<PdfDocument>
// ── OCR methods ──
textWithOcrPerPage(opts?: OcrOptions): Array<OcrPageText>
textWithOcrPerPageAsync(opts?: OcrOptions): Promise<Array<OcrPageText>>
documentOcr(opts?: OcrOptions): PdfDocumentOcr
documentOcrAsync(opts?: OcrOptions): Promise<PdfDocumentOcr>
}Types
export const enum TextSource {
Native = 'Native',
Ocr = 'Ocr',
}
export interface OcrPageText {
page: number
text: string
source: TextSource
}
export interface OcrStructuredPageText {
page: number
header: string
body: string
footer: string
source: TextSource
}
export interface OcrOptions {
lang?: string // Tesseract language code, default "eng"
minTextLength?: number // non-whitespace char threshold before OCR fallback, default 1
maxThreads?: number // cap on Rayon threads for OCR parallelism, default 4, clamped to [1, available CPUs]
}
export interface PdfDocumentOcr {
version: string
isLinearized: boolean
pageCount: number
creator?: string
producer?: string
creationDate?: string
modificationDate?: string
totalImages: number
totalAnnotations: number
imagePages: Array<number>
annotationPages: Array<number>
text: Array<OcrPageText>
structuredText: Array<OcrStructuredPageText>
images: Array<PageImage>
annotations: Array<PageAnnotation>
}Usage
Use the async API for OCR. The sync variants block the Node.js event loop for the duration of OCR processing, which can be significant for multi-page scanned documents.
Standalone
import { readFile } from 'fs/promises'
import { extractTextWithOcrPerPageAsync } from '@d0paminedriven/pdfdown-ocr'
const pdf = await readFile('scanned-document.pdf')
const pages = await extractTextWithOcrPerPageAsync(pdf, { lang: 'eng', minTextLength: 10 })
for (const { page, text, source } of pages) {
console.log(`Page ${page} [${source}]: ${text.slice(0, 100)}...`)
}Class-based (parse once, extract many)
import { readFile } from 'fs/promises'
import { PdfDown } from '@d0paminedriven/pdfdown-ocr'
const pdf = new PdfDown(await readFile('scanned-document.pdf'))
// OCR text extraction
const pages = await pdf.textWithOcrPerPageAsync({ lang: 'eng', minTextLength: 10 })
// All base methods work too
const images = await pdf.imagesPerPageAsync()
const meta = pdf.metadata()Extract everything with OCR in one call
import { readFile } from 'fs/promises'
import { PdfDown } from '@d0paminedriven/pdfdown-ocr'
const pdf = new PdfDown(await readFile('scanned-document.pdf'))
const result = await pdf.documentOcrAsync({ minTextLength: 10 })
// result.text — OcrPageText[] (page, text, source per page)
// result.structuredText — OcrStructuredPageText[] (header/body/footer + source per page)
// result.images — PageImage[] (decoded PNGs with dimensions and color space)
// result.annotations — PageAnnotation[] (links, destinations, rects)
// result.pageCount, result.version, result.creator, ...Combined: OCR text + images for multimodal pipelines
import { readFile } from 'fs/promises'
import { PdfDown } from '@d0paminedriven/pdfdown-ocr'
const pdf = new PdfDown(await readFile('scanned-document.pdf'))
const [ocrText, images] = await Promise.all([
pdf.textWithOcrPerPageAsync({ minTextLength: 10 }),
pdf.imagesPerPageAsync(),
])
const imagesByPage = Map.groupBy(images, (img) => img.page)
for (const { page, text, source } of ocrText) {
const pageImages = (imagesByPage.get(page) ?? []).map((img) => ({
dataUrl: `data:image/png;base64,${img.data.toString('base64')}`,
width: img.width,
height: img.height,
}))
// Send { page, text, source, images: pageImages } to your embedding pipeline
}document() vs documentOcr()
Both methods extract everything from a PDF in a single call. The difference is how text is extracted:
| Method | Text extraction | Return type | Use when |
|--------|----------------|-------------|----------|
| document() / documentAsync() | Native PDF text only | PdfDocument | PDF has selectable text |
| documentOcr() / documentOcrAsync() | Native with OCR fallback | PdfDocumentOcr | PDF may contain scanned/image-only pages |
PdfDocumentOcr uses OcrPageText (with source: 'Native' | 'Ocr') and OcrStructuredPageText (with header/body/footer split plus source) instead of the base PageText and StructuredPageText types. Images, annotations, and metadata are identical in both.
How it works
Text extraction: Each page is first attempted with native PDF text extraction. If a page yields fewer non-whitespace characters than
minTextLength, its embedded images are decoded and fed to Tesseract for OCR. Each result is tagged withsource: 'Native'orsource: 'Ocr'.Structured text: After text extraction, repeated header/footer lines are detected across pages using frequency analysis (requires 3+ pages). Each page's text is split into
header,body, andfootersections. For OCR results, thesourcetag is preserved so you know whether each page's content came from native extraction or OCR.Parallelism: OCR runs on a dedicated capped Rayon thread pool (default 4 threads, configurable via
maxThreads) to prevent CPU oversubscription. Text extraction, image extraction, and annotation extraction run concurrently viarayon::joinwhen usingdocumentOcr/documentOcrAsync.Tessdata discovery: On first OCR invocation, the tessdata path is resolved once and cached in a
OnceLock. TheTESSDATA_PREFIXenvironment variable is checked first; if unset,tesseract --list-langsis executed and its output is parsed to extract the path. No environment variables are mutated -- the path is passed directly to Tesseract's init function.
Supported platforms
Prebuilt binaries are provided for:
- macOS (x64, ARM64)
- Linux glibc (x64, ARM64)
Relationship to @d0paminedriven/pdfdown
Same Rust codebase, compiled with the ocr Cargo feature flag enabled. This package is a strict superset -- you can use it as a drop-in replacement for the base package if you need OCR capabilities.
License
MIT
