@opendataloader/pdf

v1.12.0

Published

16 hours ago

A Node.js wrapper for the opendataloader-pdf Java CLI.

0High
0Medium
0Low

hnc-leebd

pdf markdown html convert pdf-convert pdf-parser pdf-parsing pdf-to-json pdf-to-markdown pdf-to-html

OpenDataLoader PDF

PDF Parsing for RAG — Convert to Markdown & JSON, Fast, Local, No GPU

Convert PDFs into LLM-ready Markdown and JSON with accurate reading order, table extraction, and bounding boxes — all running locally on your machine.

Why developers choose OpenDataLoader:

Deterministic — Same input always produces same output (no LLM hallucinations)
Fast — Process 100+ pages per second on CPU
Private — 100% local, zero data transmission
Accurate — Bounding boxes for every element, correct multi-column reading order

pip install -U opendataloader-pdf

import opendataloader_pdf

# PDF to Markdown for RAG
opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="output/",
    format="markdown,json"
)

Why OpenDataLoader?

Building RAG pipelines? You've probably hit these problems:

| Problem | How We Solve It | |---------|-----------------| | Multi-column text reads left-to-right incorrectly | XY-Cut++ algorithm preserves correct reading order | | Tables lose structure | Border + cluster detection keeps rows/columns intact | | Headers/footers pollute context | Auto-filtered before output | | No coordinates for citations | Bounding box for every element | | Cloud APIs = privacy concerns | 100% local, no data leaves your machine | | GPU required | Pure CPU, rule-based — runs anywhere |

Key Features

For RAG & LLM Pipelines

Structured Output — JSON with semantic types (heading, paragraph, table, list, caption)
Bounding Boxes — Every element includes [x1, y1, x2, y2] coordinates for citations
Reading Order — XY-Cut++ algorithm handles multi-column layouts correctly
Noise Filtering — Headers, footers, hidden text, watermarks auto-removed
LangChain Integration — Official document loader

Performance & Privacy

No GPU — Fast, rule-based heuristics
Local-First — Your documents never leave your machine
High Throughput — Process thousands of PDFs efficiently
Multi-Language SDK — Python, Node.js, Java

Document Understanding

Tables — Detects borders, handles merged cells
Lists — Numbered, bulleted, nested
Headings — Auto-detects hierarchy levels
Images — Extracts with captions linked
Tagged PDF Support — Uses native PDF structure when available
AI Safety — Auto-filters prompt injection content

Which Mode Should I Use?

| Your Document | Mode | Setup | |---------------|------|-------| | Standard digital PDF | Fast (default) | pip install opendataloader-pdf | | Complex or nested tables | Hybrid | + start hybrid server | | Scanned / image-based PDF | Hybrid + OCR | + --force-ocr on server | | Charts / figures needing text description | Hybrid + picture description | + --enrich-picture-description on server | | Mathematical formulas (LaTeX) | Hybrid + formula | + --enrich-formula on server |

Output Formats

| Format | Use Case | |--------|----------| | JSON | Structured data with bounding boxes, semantic types | | Markdown | Clean text for LLM context, RAG chunks | | HTML | Web display with styling | | Annotated PDF | Visual debugging — see detected structures (sample) |

JSON Output Example

{
  "type": "heading",
  "id": 42,
  "level": "Title",
  "page number": 1,
  "bounding box": [72.0, 700.0, 540.0, 730.0],
  "heading level": 1,
  "font": "Helvetica-Bold",
  "font size": 24.0,
  "text color": "[0.0]",
  "content": "Introduction"
}

| Field | Description | |-------|-------------| | type | Element type: heading, paragraph, table, list, image, caption | | id | Unique identifier for cross-referencing | | page number | 1-indexed page reference | | bounding box | [left, bottom, right, top] in PDF points | | heading level | Heading depth (1+) | | font, font size | Typography info | | content | Extracted text |

Full JSON Schema →

Quick Start

Advanced Options

opendataloader_pdf.convert(
    input_path="document.pdf",
    output_dir="output/",
    format="json,markdown,pdf",

    # Image output mode: "off", "embedded" (Base64), or "external" (default)
    image_output="embedded",

    # Image format: "png" or "jpeg"
    image_format="jpeg",

    # Tagged PDF
    use_struct_tree=True,            # Use native PDF structure
)

Full CLI Options Reference →

AI Safety

PDFs can contain hidden prompt injection attacks. OpenDataLoader automatically filters:

Hidden text (transparent, zero-size)
Off-page content
Suspicious invisible layers

This is enabled by default. Learn more →

Tagged PDF Support

Why it matters: The European Accessibility Act (EAA) took effect June 28, 2025, requiring accessible digital documents across the EU. This means more PDFs will be properly tagged with semantic structure.

OpenDataLoader leverages this:

When a PDF has structure tags, we extract the exact layout the author intended
Headings, lists, tables, reading order — all preserved from the source
No guessing, no heuristics needed — pixel-perfect semantic extraction

opendataloader_pdf.convert(
    input_path="accessible_document.pdf",
    use_struct_tree=True  # Use native PDF structure tags
)

Most PDF parsers ignore structure tags entirely. We're one of the few that fully support them.

Learn more about Tagged PDF →

Hybrid Mode

For documents with complex tables or OCR needs, enable hybrid mode to route challenging pages to an AI backend while keeping simple pages fast and local.

Results: Table accuracy jumps from 0.49 → 0.93 (+90%) with acceptable speed trade-off.

pip install -U "opendataloader-pdf[hybrid]"

Terminal 1: Start the backend server

opendataloader-pdf-hybrid --port 5002

Terminal 2: Process PDFs with hybrid mode

opendataloader-pdf --hybrid docling-fast input.pdf

Or use in Python:

opendataloader_pdf.convert(
    input_path="complex_tables.pdf",
    output_dir="output/",
    hybrid="docling-fast"  # Routes complex pages to AI backend
)

Local-first: Simple pages processed locally, complex pages routed to backend
Fallback: If backend unavailable, gracefully falls back to local processing
Privacy: Run the backend locally for 100% on-premise

Formula Extraction (LaTeX)

For PDFs containing mathematical formulas, enable formula enrichment to extract LaTeX representations:

# Start backend with formula enrichment
opendataloader-pdf-hybrid --enrich-formula

# Process with full backend mode (required for formula extraction)
opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf

Output in JSON:

{
  "type": "formula",
  "page number": 1,
  "bounding box": [226.2, 144.7, 377.1, 168.7],
  "content": "\\frac{f(x+h) - f(x)}{h}"
}

Output in Markdown:

$$
\frac{f(x+h) - f(x)}{h}
$$

Output in HTML (MathJax/KaTeX compatible):

<div class="math-display">\[\frac{f(x+h) - f(x)}{h}\]</div>

Note: Formula extraction requires --hybrid-mode full to route all pages to the backend where the formula enrichment model runs.

Scanned PDFs (OCR)

For image-based or scanned PDFs that contain no selectable text, enable OCR on the hybrid backend:

# Start backend with OCR enabled
opendataloader-pdf-hybrid --port 5002 --force-ocr

# Process scanned PDF
opendataloader-pdf --hybrid docling-fast input-scanned.pdf

For non-English documents, specify the OCR language:

opendataloader-pdf-hybrid --port 5002 --force-ocr --ocr-lang "ko,en"

Note: Standard digital PDFs do not need --force-ocr. Use it only for scanned or image-based PDFs.

Timeout: OCR is CPU-intensive. For large scanned documents, increase the timeout: opendataloader-pdf --hybrid docling-fast --hybrid-timeout 120000 input-scanned.pdf

Picture / Chart Description (Alt Text)

Generate AI-powered descriptions for images and charts in your PDFs. Useful for accessibility (alt text) and making visual content searchable in RAG pipelines.

# Start backend with picture description
opendataloader-pdf-hybrid --enrich-picture-description

# Process with full backend mode (required for picture description)
opendataloader-pdf --hybrid docling-fast --hybrid-mode full input.pdf

Output in JSON:

{
  "type": "picture",
  "page number": 1,
  "bounding box": [72.0, 400.0, 540.0, 650.0],
  "description": "A bar chart showing waste generation by region from 2016 to 2030..."
}

Output in Markdown:

![image 1](document_images/imageFile1.png)

*A bar chart showing waste generation by region from 2016 to 2030...*

Output in HTML:

<figure>
<img src="document_images/imageFile1.png" alt="figure1">
<figcaption>A bar chart showing waste generation by region from 2016 to 2030...</figcaption>
</figure>

You can also customize the prompt for better results with specific document types:

opendataloader-pdf-hybrid --enrich-picture-description \
  --picture-description-prompt "Describe this scientific figure in detail."

Note: Picture description uses SmolVLM (256M), a lightweight vision model. Results are suitable for general context but may not capture precise data values from complex charts.

Hybrid Mode Guide →

LangChain Integration

OpenDataLoader PDF has an official LangChain integration for seamless RAG pipeline development.

pip install -U langchain-opendataloader-pdf

from langchain_opendataloader_pdf import OpenDataLoaderPDFLoader

loader = OpenDataLoaderPDFLoader(
    file_path=["document.pdf"],
    format="text"
)
documents = loader.load()

# Use with any LangChain pipeline
for doc in documents:
    print(doc.page_content[:100])

Benchmarks

We continuously benchmark against real-world documents.

View full benchmark results →

Quick Comparison

| Engine | Overall | Reading Order | Table | Heading | Speed (s/page) | |-----------------------------|----------|---------------|----------|----------|----------------| | opendataloader | 0.72 | 0.91 | 0.49 | 0.76 | 0.05 | | opendataloader [hybrid] | 0.90 | 0.94 | 0.93 | 0.83 | 0.43 | | docling | 0.86 | 0.90 | 0.89 | 0.80 | 0.73 | | marker | 0.83 | 0.89 | 0.81 | 0.80 | 53.93 | | mineru | 0.82 | 0.86 | 0.87 | 0.74 | 5.96 | | pymupdf4llm | 0.57 | 0.89 | 0.40 | 0.41 | 0.09 | | markitdown | 0.29 | 0.88 | 0.00 | 0.00 | 0.04 |

Scores are normalized to [0, 1]. Higher is better for accuracy metrics; lower is better for speed. Bold indicates best performance.

Visual Comparison

Roadmap

See our upcoming features and priorities →

Documentation

Frequently Asked Questions

What is the best PDF parser for RAG?

For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. OpenDataLoader is designed specifically for this use case — it outputs structured JSON with bounding boxes, handles multi-column layouts correctly with XY-Cut++, and runs locally without GPU requirements.

How do I extract tables from PDF for LLM?

OpenDataLoader detects tables using both border analysis and text clustering, preserving row/column structure in the output. Tables are exported as structured data in JSON or as formatted Markdown tables, ready for LLM consumption.

Can I use this without sending data to the cloud?

Yes. OpenDataLoader runs 100% locally on your machine. No API calls, no data transmission — your documents never leave your environment. This makes it ideal for sensitive documents in legal, healthcare, and financial industries.

What makes OpenDataLoader unique?

OpenDataLoader takes a different approach from many PDF parsers:

Rule-based extraction — Deterministic output without GPU requirements
Bounding boxes for all elements — Essential for citation systems
XY-Cut++ reading order — Handles multi-column layouts correctly
Built-in AI safety filters — Protects against prompt injection
Native Tagged PDF support — Leverages accessibility metadata

This means: consistent output (same input = same output), no GPU required, faster processing, and no model hallucinations.

How do I get better accuracy for complex tables?

Enable hybrid mode with pip install -U "opendataloader-pdf[hybrid]". This routes pages with complex tables to an AI backend (like docling-serve) while keeping simple pages fast and local. Table accuracy improves from 0.49 to 0.93 — matching or exceeding dedicated AI parsers while remaining faster and more cost-effective.

Does it work with scanned PDFs?

Yes, via hybrid mode with OCR. Start the backend server with --force-ocr:

Terminal 1: Start backend with OCR enabled

opendataloader-pdf-hybrid --port 5002 --force-ocr

Terminal 2: Process scanned PDF

opendataloader-pdf --hybrid docling-fast input-scanned.pdf