firecrawl-pdf-inspector
v0.3.6
Published
Fast PDF classification and text extraction. Detect text-based vs scanned PDFs, extract text by region with quality checks. Native Rust performance via napi-rs.
Downloads
1,306
Maintainers
Readme
firecrawl-pdf-inspector
Fast PDF classification and region-based text extraction for Node.js/Bun. Native Rust performance via napi-rs.
Built by Firecrawl for hybrid OCR pipelines — extract text from PDF structure where possible, fall back to OCR only when needed.
Install
npm install firecrawl-pdf-inspector
# or
bun add firecrawl-pdf-inspectorPrebuilt binaries included for linux-x64 and macOS ARM64. No Rust toolchain needed.
API
classifyPdf(buffer: Buffer): PdfClassification
Classify a PDF as TextBased, Scanned, Mixed, or ImageBased (~10-50ms). Returns which pages need OCR.
import { classifyPdf } from 'firecrawl-pdf-inspector'
import { readFileSync } from 'fs'
const pdf = readFileSync('document.pdf')
const result = classifyPdf(pdf)
console.log(result.pdfType) // "TextBased" | "Scanned" | "Mixed" | "ImageBased"
console.log(result.pageCount) // 42
console.log(result.pagesNeedingOcr) // [5, 12, 15] (0-indexed)
console.log(result.confidence) // 0.875extractTextInRegions(buffer: Buffer, pageRegions: PageRegions[]): PageRegionTexts[]
Extract text within bounding-box regions from a PDF. Designed for hybrid OCR pipelines where a layout model detects regions in rendered page images, and this function extracts text from the PDF structure for text-based pages — skipping GPU OCR.
Each region result includes a needsOcr flag that signals unreliable extraction (empty text, GID-encoded fonts, garbage text, encoding issues).
import { extractTextInRegions } from 'firecrawl-pdf-inspector'
const result = extractTextInRegions(pdf, [
{
page: 0, // 0-indexed
regions: [
[0, 0, 300, 400], // [x1, y1, x2, y2] in PDF points, top-left origin
[300, 0, 612, 400],
]
}
])
for (const region of result[0].regions) {
if (region.needsOcr) {
// Unreliable text — send this region to OCR instead
} else {
console.log(region.text) // Extracted text in reading order
}
}Types
interface PdfClassification {
pdfType: string // "TextBased" | "Scanned" | "Mixed" | "ImageBased"
pageCount: number
pagesNeedingOcr: number[] // 0-indexed page numbers
confidence: number // 0.0 - 1.0
}
interface PageRegions {
page: number // 0-indexed
regions: number[][] // [[x1, y1, x2, y2], ...] in PDF points, top-left origin
}
interface PageRegionTexts {
page: number
regions: RegionText[]
}
interface RegionText {
text: string
needsOcr: boolean // true when text is unreliable
}Platforms
| Platform | Architecture | Supported | |----------|-------------|-----------| | Linux | x64 | Yes | | macOS | ARM64 | Yes |
License
MIT
