edgeparse-wasm

v0.2.5

Published

a month ago

EdgeParse PDF parser — WebAssembly build for browsers

0High
0Medium
0Low

raphael.mansuy

pdf parser wasm webassembly browser extraction markdown

EdgeParse

Fastest PDF extraction engine. Rust-native. Zero GPU, zero JVM, zero OCR models.

Extract Markdown, JSON (with bounding boxes), and HTML from any born-digital PDF — deterministically, in milliseconds, on CPU.

How accurate is it? — 0.787 overall on the latest opendataloader.org PDF-to-Markdown benchmark, with the best score in every reported metric: reading order, tables, headings, paragraphs, text quality, table detection, and speed. Benchmark details
How fast? — 0.064 s/doc on the 200-document benchmark corpus on Apple M4 Max. Faster than OpenDataLoader, Docling, PyMuPDF4LLM, MarkItDown, and LiteParse.
Does it need a GPU or Java? — No. No JVM, no GPU, no OCR models, no Python runtime for the CLI. Single ~15 MB binary.
RAG / LLM pipelines? — Yes. Outputs structured Markdown for chunking, JSON with bounding boxes for citations, preserves reading order across multi-column layouts. See integration examples

Available as a Rust library, CLI binary, Python package, Node.js package, and WebAssembly module for in-browser PDF parsing.

Get Started in 30 Seconds

Python (Python 3.9+):

pip install edgeparse

import edgeparse

# Markdown — ready for LLM context or RAG chunking
md = edgeparse.convert("report.pdf", format="markdown")

# JSON with bounding boxes — for citations and element-level control
data = edgeparse.convert("report.pdf", format="json")

CLI (Rust 1.85+):

cargo install edgeparse-cli
edgeparse report.pdf --format markdown --output-dir output/

Node.js (Node 18+):

npm install edgeparse

import { convert } from 'edgeparse';
const md = convert('report.pdf', { format: 'markdown' });

Release Channels

Tagged releases publish every supported distribution target through GitHub Actions:

| Channel | Artifact | Install / Pull | |---------|----------|----------------| | Rust crates | pdf-cos, edgeparse-core, edgeparse-cli | cargo install edgeparse-cli | | Python SDK | edgeparse wheels + sdist | pip install edgeparse | | Node.js SDK | edgeparse + 5 platform addons | npm install edgeparse | | WebAssembly SDK | edgeparse-wasm | npm install edgeparse-wasm | | CLI binaries | GitHub Release archives for macOS, Linux, Windows | GitHub Releases | | Homebrew | raphaelmansuy/edgeparse tap | brew tap raphaelmansuy/edgeparse && brew install edgeparse | | Containers | GHCR + Docker Hub multi-arch images | docker pull ghcr.io/raphaelmansuy/edgeparse:0.2.1 |

Release automation and registry details: docs/07-cicd-publishing.md

What Problems Does This Solve?

| Problem | EdgeParse Solution | Status | |---------|--------------------|--------| | PDF text loses reading order in multi-column layouts | XY-Cut++ algorithm preserves correct reading sequence across columns, sidebars, and mixed layouts | ✅ Shipped | | Table extraction is broken (merged cells, borderless tables) | Ruling-line table detection + borderless cluster method; --table-method cluster for complex cases | ✅ Shipped | | OCR/ML tools add 500 MB+ of dependencies to a simple PDF pipeline | Zero GPU, zero OCR models, zero JVM — single 15 MB binary, pure Rust | ✅ Shipped | | Heading hierarchy is lost (all text looks the same) | Font-metric + geometry-based heading classifier; MHS score 0.553 on the current benchmark | ✅ Shipped | | PDFs can carry hidden prompt injection payloads | AI safety filters: hidden text, off-page content, tiny-text, invisible OCG layers detected and stripped | ✅ Shipped | | Need bounding boxes to cite sources in RAG answers | Every element (paragraph, heading, table, image) has [left, bottom, right, top] coordinates in PDF points | ✅ Shipped | | In-browser PDF parsing uploads data to a server | WebAssembly build — full Rust engine in the browser, PDF data never leaves the device | ✅ Shipped |

Benchmark

Evaluated on 200 real-world PDFs — academic papers, financial reports, multi-column layouts, complex tables, mixed-language documents — running on Apple M4 Max.

Current comparison set

| Engine | NID ↑ | TEDS ↑ | MHS ↑ | PBF ↑ | TQS ↑ | TD F1 ↑ | Speed ↓ | Overall ↑ | |--------|------:|-------:|------:|------:|------:|--------:|--------:|----------:| | EdgeParse | 0.889 | 0.596 | 0.553 | 0.559 | 0.920 | 0.901 | 0.064 s/doc | 0.787 | | OpenDataLoader | 0.873 | 0.326 | 0.442 | 0.544 | 0.916 | 0.636 | 0.094 s/doc | 0.733 | | Docling | 0.867 | 0.540 | 0.438 | 0.530 | 0.908 | 0.891 | 0.768 s/doc | 0.745 | | PyMuPDF4LLM | 0.852 | 0.323 | 0.407 | 0.538 | 0.888 | 0.744 | 0.439 s/doc | 0.710 | | EdgeParse (pre-frontier baseline) | 0.859 | 0.493 | 0.500 | 0.482 | 0.891 | 0.849 | 0.232 s/doc | 0.751 | | MarkItDown | 0.808 | 0.193 | 0.001 | 0.362 | 0.861 | 0.558 | 0.149 s/doc | 0.564 | | LiteParse | 0.815 | 0.000 | 0.001 | 0.383 | 0.887 | N/A | 0.196 s/doc | 0.564 |

EdgeParse now leads the entire comparison set on every reported benchmark metric, including speed. Relative to the previous EdgeParse baseline, the current pipeline increases reading-order accuracy, table structure similarity, paragraph boundaries, text quality, table-detection F1, and overall score while cutting latency from 0.232 to 0.064 s/doc.

When to choose what:

| Use case | Recommendation | |----------|---------------| | Born-digital PDFs, latency matters, production deployment | EdgeParse — best accuracy/speed, zero dependencies | | Complex scanned tables, GPU available, batch offline | Consider Docling or MinerU | | Scanned documents requiring full OCR | Dedicated OCR pipeline |

Metrics

| Metric | What it measures | |--------|-----------------| | NID | Reading order accuracy — normalised index distance | | TEDS | Table structure accuracy — tree-edit distance vs. ground truth | | MHS | Heading hierarchy accuracy | | PBF | Paragraph boundary F1 | | TQS | Text quality score | | TD F1 | Table detection F1 | | Overall | Normalized aggregate benchmark score | | Speed | Wall-clock seconds per document (full pipeline, 200 docs, parallel) |

Running the benchmark

cargo build --release
cd benchmark
uv sync
uv run python run.py          # EdgeParse on all 200 docs
uv run python compare_all.py  # Compare against 9 engines

Results → benchmark/prediction/edgeparse/ · HTML reports → benchmark/reports/

Regression thresholds

benchmark/thresholds.json defines minimum acceptable scores for CI:

{
  "nid": 0.85,
  "teds": 0.40,
  "mhs": 0.55,
  "table_detection_f1": 0.55,
  "elapsed_per_doc": 2.0
}

Capability Matrix

| Capability | Available | Notes | |-----------|-----------|-------| | Extraction | | | | Text with correct reading order | ✅ | XY-Cut++ across columns and sidebars | | Bounding boxes for every element | ✅ | [left, bottom, right, top] in PDF points | | Table extraction — ruling-line borders | ✅ | Default mode | | Table extraction — borderless/cluster | ✅ | --table-method cluster | | Heading hierarchy detection | ✅ | Numbered + unnumbered, all levels | | List detection (numbered, bulleted, nested) | ✅ | | | Image extraction with coordinates | ✅ | PNG or JPEG, embedded or external | | Header / footer / watermark filtering | ✅ | | | Multi-column layout support | ✅ | | | CMap / ToUnicode font decoding | ✅ | Handles non-standard encodings | | Tagged PDF structure tree | ✅ | --use-struct-tree preserves author intent | | Safety | | | | Hidden text filtering | ✅ | Prompt injection protection | | Off-page content filtering | ✅ | | | Tiny-text / invisible OCG layer filtering | ✅ | | | PII sanitization | ✅ | --sanitize flag | | Output | | | | Markdown (GFM tables) | ✅ | | | JSON with bounding boxes | ✅ | | | HTML5 | ✅ | | | Plain text (reading order preserved) | ✅ | | | SDKs | | | | Python (PyO3 native extension) | ✅ | Python 3.9+, pre-built wheels | | Node.js (NAPI-RS native addon) | ✅ | Node 18+, pre-built addons | | WebAssembly | ✅ | In-browser, no server required | | Rust library | ✅ | edgeparse-core crate | | CLI binary | ✅ | edgeparse-cli on crates.io | | Runtime | | | | GPU required | ❌ No | CPU only | | JVM required | ❌ No | Pure Rust | | OCR models required | ❌ No | Born-digital PDFs only | | Parallel processing | ✅ | Rayon per-page parallelism | | Deterministic / reproducible | ✅ | Same input → same output, always |

Installation

Python

pip install edgeparse

Python 3.9+. Pre-built wheels for macOS (arm64, x64), Linux (x64, arm64), Windows (x64).

CLI

cargo install edgeparse-cli

Requires Rust 1.85+.

Rust library

[dependencies]
edgeparse-core = "0.2.1"

Docs: docs.rs/edgeparse-core · docs.rs/edgeparse-cli

Node.js

npm install edgeparse

Node 18+. Pre-built native addons for macOS (arm64, x64), Linux (x64, arm64), Windows (x64).

Build from source

git clone https://github.com/raphaelmansuy/edgeparse.git
cd edgeparse
cargo build --release
# Binary: target/release/edgeparse

System requirements

macOS 12+, Linux (glibc 2.17+), or Windows 10+
~15 MB binary (stripped release build)
No Java, no Python (for the CLI), no GPU

Python SDK

Package: edgeparse · Requires: Python 3.9+ · Source: sdks/python/

`edgeparse.convert()`

def convert(
    input_path: str | Path,
    *,
    format: str = "markdown",       # "markdown", "json", "html", "text"
    pages: str | None = None,        # e.g. "1,3,5-7"
    password: str | None = None,
    reading_order: str = "xycut",   # "xycut" or "off"
    table_method: str = "default",  # "default" or "cluster"
    image_output: str = "off",      # "off", "embedded", "external"
) -> str: ...

`edgeparse.convert_file()`

def convert_file(
    input_path: str | Path,
    output_dir: str | Path = "output",
    *,
    format: str = "markdown",
    pages: str | None = None,
    password: str | None = None,
) -> str: ...  # returns output file path

Examples

import edgeparse

# Basic Markdown extraction
md = edgeparse.convert("report.pdf", format="markdown")

# JSON with bounding boxes
json_str = edgeparse.convert("report.pdf", format="json")

# Password-protected, specific pages, borderless tables
md = edgeparse.convert(
    "secure.pdf",
    format="markdown",
    pages="1,3,5-7",
    password="secret",
    reading_order="xycut",
    table_method="cluster",
)

# Write output file  
out_path = edgeparse.convert_file("report.pdf", output_dir="output/", format="markdown")

CLI entry point (Python package)

edgeparse report.pdf -f markdown -o output/
edgeparse *.pdf --format json --output-dir out/ --pages "1-3"

Node.js SDK

Package: edgeparse · Requires: Node.js 18+ · Source: sdks/node/

`convert()`

import { convert } from 'edgeparse';

// Markdown
const md = convert('report.pdf', { format: 'markdown' });

// JSON with bounding boxes
const json = convert('report.pdf', { format: 'json' });

// With options
const result = convert('report.pdf', {
  format: 'markdown',
  pages: '1-5',
  readingOrder: 'xycut',
  tableMethod: 'cluster',
});

`ConvertOptions`

interface ConvertOptions {
  format?: 'markdown' | 'json' | 'html' | 'text';  // default: "markdown"
  pages?: string;         // e.g. "1,3,5-7"
  password?: string;
  readingOrder?: 'xycut' | 'off';        // default: "xycut"
  tableMethod?: 'default' | 'cluster';   // default: "default"
  imageOutput?: 'off' | 'embedded' | 'external';  // default: "off"
}

CLI (npm)

npx edgeparse report.pdf -f markdown -o output.md
npx edgeparse report.pdf --format json --pages "1-5"

WebAssembly SDK

EdgeParse compiles to WebAssembly — client-side PDF extraction in any modern browser with no server, no uploads, and no backend infrastructure.

Same Rust engine, identical output to CLI/Python/Node
PDF data never leaves the user's device (privacy by design)
Works offline after initial WASM load (~4 MB cached)
Zero infrastructure cost — static hosting only

Quick start

import init, { convert_to_string } from 'edgeparse-wasm';

await init();  // load WASM binary once

const bytes = new Uint8Array(await file.arrayBuffer());

const markdown = convert_to_string(bytes, 'markdown');
const json     = convert_to_string(bytes, 'json');
const html     = convert_to_string(bytes, 'html');

API

| Function | Returns | Description | |----------|---------|-------------| | convert(bytes, format?, pages?, readingOrder?, tableMethod?) | JS object | Structured PdfDocument with pages, elements, bounding boxes | | convert_to_string(bytes, format?, pages?, readingOrder?, tableMethod?) | string | Formatted output (Markdown, JSON, HTML, or text) | | version() | string | EdgeParse version |

Live demo

edgeparse.com/demo/ — drag-and-drop any PDF, all processing runs locally in your browser.

Build from source

curl https://rustwasm.github.io/wasm-pack/installer/init.sh -sSf | sh
cd crates/edgeparse-wasm
wasm-pack build --target web --release
# Output: crates/edgeparse-wasm/pkg/

Full documentation: docs/09-wasm-sdk.md

CLI Reference

edgeparse [OPTIONS] <PDF_FILE>...

Core options

| Flag | Default | Description | |------|---------|-------------| | -o, --output-dir <DIR> | — | Write output files to this directory | | -f, --format <FMT> | json | Output format(s), comma-separated | | -p, --password <PW> | — | Password for encrypted PDFs | | --pages <RANGE> | — | Page range e.g. "1,3,5-7" | | -q, --quiet | false | Suppress log output |

Output format values

| Value | Description | |-------|-------------| | json | Structured JSON with bounding boxes and element types (default) | | markdown | Standard Markdown with GFM tables | | markdown-with-html | Markdown with HTML table fallback for complex tables | | markdown-with-images | Markdown with embedded or linked images | | html | Full HTML5 document with semantic elements | | text | Plain UTF-8 text, reading order preserved |

Multiple formats: --format markdown,json

Layout & extraction options

| Flag | Default | Description | |------|---------|-------------| | --reading-order <ALGO> | xycut | Reading order: xycut or off | | --table-method <METHOD> | default | Table detection: default (ruling lines) or cluster (borderless) | | --keep-line-breaks | false | Preserve original line breaks within paragraphs | | --use-struct-tree | false | Use tagged PDF structure tree when available | | --include-header-footer | false | Include headers and footers in output | | --sanitize | false | Enable PII sanitization | | --replace-invalid-chars <CH> | " " | Replacement for invalid Unicode characters |

Image options

| Flag | Default | Description | |------|---------|-------------| | --image-output <MODE> | external | off, embedded (base64), or external (files) | | --image-format <FMT> | png | png or jpeg | | --image-dir <DIR> | — | Directory for extracted image files |

Separator options

| Flag | Description | |------|-------------| | --markdown-page-separator <STR> | String inserted between Markdown pages | | --text-page-separator <STR> | String inserted between plain-text pages | | --html-page-separator <STR> | String inserted between HTML pages |

Content safety options

| Flag | Default | Description | |------|---------|-------------| | --content-safety-off <FLAGS> | — | Disable safety filters: all, hidden-text, off-page, tiny, hidden-ocg |

Hybrid backend options

| Flag | Default | Description | |------|---------|-------------| | --hybrid <BACKEND> | off | Hybrid backend: off or docling-fast | | --hybrid-mode <MODE> | auto | Triage mode: auto or full | | --hybrid-url <URL> | — | Hybrid backend service URL | | --hybrid-timeout <MS> | 30000 | Timeout in milliseconds | | --hybrid-fallback | false | Fall back to local extraction on hybrid error |

Output Formats

| Format | Flag value | Best for | |--------|-----------|----------| | JSON with bounding boxes | json | RAG citations, element-level processing, source highlighting | | Markdown (GFM) | markdown | LLM context windows, chunking pipelines, readable output | | Markdown + HTML tables | markdown-with-html | Complex tables that don't render well in pure Markdown | | Markdown + images | markdown-with-images | Documents where figures matter | | HTML5 | html | Web display, accessibility pipelines | | Plain text | text | Keyword search, simple NLP, legacy pipelines |

Combine formats: --format markdown,json

JSON output example

{
  "type": "heading",
  "id": 42,
  "level": "Title",
  "page number": 1,
  "bounding box": [72.0, 700.0, 540.0, 730.0],
  "heading level": 1,
  "font": "Helvetica-Bold",
  "font size": 24.0,
  "content": "Introduction"
}

| Field | Description | |-------|-------------| | type | Element type: heading, paragraph, table, list, image, caption | | id | Unique identifier for cross-referencing | | page number | 1-indexed page reference | | bounding box | [left, bottom, right, top] in PDF points (72 pt = 1 inch) | | heading level | Heading depth (1+) | | content | Extracted text |

RAG / LLM Integration

EdgeParse is designed for AI pipelines. Every element has a bounding box and page number, so you can cite exact sources in answers.

Extract Markdown for chunking

import edgeparse

md = edgeparse.convert("report.pdf", format="markdown")

# Feed directly into an LLM
response = llm.invoke(f"Summarize this document:\n\n{md}")

Extract JSON for source citations

import json, edgeparse

data = json.loads(edgeparse.convert("report.pdf", format="json"))

for element in data["kids"]:
    if element["type"] == "paragraph":
        # element["bounding box"] → highlight location in original PDF
        # element["page number"] → link back to source page
        print(f"p.{element['page number']}: {element['content'][:80]}")

LangChain integration

EdgeParse has no official LangChain loader yet, but integrates trivially:

from langchain.schema import Document
import edgeparse, json

def load_pdf(path: str) -> list[Document]:
    data = json.loads(edgeparse.convert(path, format="json"))
    docs = []
    for el in data["kids"]:
        if el["type"] in ("paragraph", "heading"):
            docs.append(Document(
                page_content=el["content"],
                metadata={
                    "source": path,
                    "page": el["page number"],
                    "bbox": el["bounding box"],
                    "type": el["type"],
                }
            ))
    return docs

LlamaIndex integration

from llama_index.core import Document
import edgeparse, json

def edgeparse_reader(path: str) -> list[Document]:
    data = json.loads(edgeparse.convert(path, format="json"))
    return [
        Document(
            text=el["content"],
            metadata={"page": el["page number"], "source": path}
        )
        for el in data["kids"]
        if el.get("content")
    ]

Which output format for which use case?

| Use case | Recommended format | Why | |----------|--------------------|-----| | Feed PDF to LLM | markdown | Clean structure, fits in context window | | RAG with source citations | json | Bounding boxes enable "click-to-source" UX | | Semantic chunking by section | markdown | Headings make natural chunk boundaries | | Element-level filtering | json | Filter by type, page number, heading level | | Web display | html | Styled output with semantic elements |

Agent Skill

EdgeParse ships as a Claude agent skill — a structured description that teaches Claude (and any compatible AI agent) how to extract PDF content on behalf of users.

# Add the EdgeParse skill to your agent environment
npx skills add raphaelmansuy/edgeparse --skill edgeparse

# Install the Python package
pip install edgeparse

The npx skills add command registers the skill in skills-lock.json:

{
  "version": 1,
  "skills": {
    "edgeparse": {
      "source": "raphaelmansuy/edgeparse",
      "sourceType": "github"
    }
  }
}

Once installed, the agent reads skills/edgeparse/SKILL.md and knows when to call edgeparse.convert(), which format to use for different tasks, and how to handle edge cases like encrypted PDFs, borderless tables, and multi-column layouts.

See docs/08-agent-skill.md for the full integration guide (LangChain, LlamaIndex, MCP, CrewAI patterns).

Architecture

Crate structure

edgeparse/
├── crates/
│   ├── pdf-cos/            # Low-level PDF object model (fork of lopdf 0.39)
│   ├── edgeparse-core/     # Core extraction engine (~90 source files)
│   │   └── src/
│   │       ├── api/        # ProcessingConfig, FilterConfig, BatchResult
│   │       ├── pdf/        # Loader, content stream parser, font/CMap decoding
│   │       ├── models/     # ContentElement, BoundingBox, TextChunk, PdfDocument
│   │       ├── pipeline/   # 20-stage orchestrator + Rayon parallel helpers
│   │       ├── output/     # Renderers: JSON, Markdown, HTML, text, CSV
│   │       ├── tagged/     # Tagged-PDF structure tree → McidMap
│   │       └── utils/      # XY-Cut++ algorithm, sanitizer, layout analysis
│   ├── edgeparse-cli/      # CLI binary (clap derive, 25+ flags)
│   ├── edgeparse-python/   # PyO3 native extension
│   └── edgeparse-node/     # NAPI-RS native addon
└── sdks/
    ├── python/             # Python packaging (maturin, pyproject.toml)
    └── node/               # npm packaging (TypeScript wrapper, tsup)

Processing pipeline (20 stages)

PDF file
    │
    ▼
pdf-cos                   ← xref parsing, object graph, encrypted streams
    │
    ▼
edgeparse-core::pdf       ← page tree, content stream operators, font decoding,
    │                       CMap/ToUnicode, image extraction, tagged PDF
    ▼
edgeparse-core::pipeline  ← 20 sequential/parallel stages:
    │  [page range] → [watermark] → [filter] → [table borders] →
    │  [cell assignment] → [boxed headings] → [column detection] →
    │  [TextLine assembly] → [TextBlock grouping] → [table clustering] →
    │  [header/footer] → [list detection] → [paragraph] → [figure] →
    │  [heading classification] → [XY-Cut++ reading order] →
    │  [list pass 2] → [caption/footnote/TOC] → [cross-page tables] →
    │  [element nesting] → [final reading order] → [sanitize]
    ▼
edgeparse-core::output    ← render to JSON / Markdown / HTML / text

Stages marked par_map_pages run in parallel via Rayon; cross-page stages run sequentially.

FAQ

What is the best PDF parser for RAG?

For RAG pipelines, you need a parser that preserves document structure, maintains correct reading order, and provides element coordinates for citations. EdgeParse outputs structured JSON with bounding boxes for every element, handles multi-column layouts with XY-Cut++, and runs locally on CPU without a GPU or JVM. On the current 200-document benchmark it leads the comparison set in both overall score (0.787) and latency (0.064 s/doc). See RAG integration examples.

How do I cite PDF sources in RAG answers?

Every element in JSON output includes a bounding box ([left, bottom, right, top] in PDF points, 72 pt = 1 inch) and page number. Map the source chunk back to its bounding box to highlight the exact location in the original PDF — enabling "click-to-source" UX. No other non-OCR open-source parser provides bounding boxes for every element by default.

How do I extract tables from PDF?

EdgeParse detects tables using border (ruling-line) analysis by default. For complex or borderless tables, add --table-method cluster (CLI) or table_method="cluster" (Python). This uses a text-clustering algorithm to detect table structure without visible borders. On the current benchmark it reaches 0.596 TEDS and 0.901 table-detection F1, both best in the published comparison set.

Does it work without sending data to the cloud?

Yes. EdgeParse runs 100% locally. No API calls, no data transmission — your documents never leave your environment. The WebAssembly SDK also runs entirely in the browser: the PDF is processed client-side and never uploaded to any server. Ideal for legal, healthcare, and financial documents.

Does it handle multi-column layouts?

Yes. XY-Cut++ reading order analysis correctly sequences text across multi-column pages, sidebars, and mixed layouts. This works without any configuration change; it is the default reading order algorithm.

Does it need a GPU or Java?

No. EdgeParse is a pure Rust implementation. It requires no JVM, no GPU, no OCR models, and no Python runtime for the CLI binary. The CLI binary is ~15 MB stripped. On Apple M4 Max, it processes the 200-document benchmark corpus in about 12.7 seconds total.

How does it compare to Docling, MinerU, and Marker?

| vs. | EdgeParse advantage | Tradeoff | |-----|---------------------|----------| | OpenDataLoader | Faster (0.064 vs 0.094 s/doc) with stronger table structure and heading recovery | OpenDataLoader remains close on text quality and paragraph boundaries | | IBM Docling | Faster (0.064 vs 0.768 s/doc) with better TEDS and overall score in the current benchmark snapshot | Docling remains a viable OCR-heavy fallback for scanned documents | | Marker | Faster and materially better on every published metric in this benchmark family | Marker supports scanned PDFs via Surya OCR | | PyMuPDF4LLM | Faster (0.064 vs 0.439 s/doc) with stronger tables, headings, and reading order | PyMuPDF4LLM is simpler if you only need lightweight text extraction |

Does it support scanned PDFs?

Not directly — EdgeParse is built for born-digital PDFs with embedded fonts. It does not include an OCR engine. For scanned documents, use EdgeParse's hybrid backend option (--hybrid docling-fast) which routes complex pages to a Docling-Fast backend running locally.

What does "deterministic" mean?

Same input PDF → same output, every time. No stochastic models, no floating-point non-determinism from ML inference. This makes EdgeParse safe to use in CI pipelines, regression tests, and compliance workflows where reproducibility is required.

How do I chunk PDFs for semantic search?

Use format="markdown". EdgeParse preserves heading hierarchy and table structure in Markdown output — headings make natural chunk boundaries for RecursiveCharacterTextSplitter (LangChain) or heading-based splitters. For element-level control, use format="json" and split on heading level boundaries or page number changes.

Does the Python SDK run on Windows?

Yes. Pre-built wheels are available for Windows (x64). The Python package installs from PyPI with no compilation needed:

pip install edgeparse

Tutorials

Step-by-step guides with working examples live in tutorials/:

| Tutorial | Description | |----------|-------------| | tutorials/01-cli.md | All CLI flags with working examples and output samples | | tutorials/02-python-sdk.md | pip install edgeparse — full API, batch processing, JSON parsing | | tutorials/03-nodejs-sdk.md | npm install edgeparse — TypeScript, CJS, and worker threads | | tutorials/04-rust-library.md | edgeparse-core in your Rust project — config, models, Rayon | | tutorials/05-output-formats.md | JSON schema, bounding boxes, Markdown variants, HTML, plain text |

Documentation

Technical documentation lives in docs/:

| Document | Description | |----------|-------------| | docs/00-overview.md | Project overview, goals, and design philosophy | | docs/01-architecture.md | Crate structure, module map, data-flow diagram | | docs/02-pipeline.md | All 20 pipeline stages with ASCII diagrams | | docs/03-data-model.md | Type hierarchy: ContentElement, BoundingBox, PdfDocument | | docs/04-pdf-extraction.md | PDF loader, chunk parser, font/CMap decoding | | docs/05-output-formats.md | JSON schema, Markdown renderer, HTML/text/CSV output | | docs/06-sdk-integration.md | CLI flag reference, Python SDK API, Node.js SDK API, Batch API | | docs/07-cicd-publishing.md | CI/CD publishing pipeline — how it works and how to configure it | | docs/08-agent-skill.md | EdgeParse agent skill — npx skills add, SKILL.md structure, SDK patterns | | docs/09-wasm-sdk.md | WebAssembly SDK — objectives, API, use cases, build instructions |

Project Layout

edgeparse/
├── LICENSE
├── CONTRIBUTING.md
├── README.md
├── Cargo.toml               # Rust workspace (5 members)
├── Cargo.lock
│
├── crates/
│   ├── pdf-cos/             # lopdf 0.39 fork — low-level PDF object model
│   ├── edgeparse-core/      # Core extraction engine (~90 source files)
│   ├── edgeparse-cli/       # CLI binary (clap, 25+ flags)
│   ├── edgeparse-wasm/      # WebAssembly build for browsers (wasm-bindgen)
│   ├── edgeparse-python/    # PyO3 native Python extension
│   └── edgeparse-node/      # NAPI-RS native Node.js addon
│
├── sdks/
│   ├── python/              # Python wheel packaging (maturin + pyproject.toml)
│   │   └── edgeparse/       # Pure-Python wrapper + CLI entry point
│   └── node/                # npm packaging (TypeScript + tsup + vitest)
│       └── src/             # index.ts, types.ts, cli.ts
│
├── benchmark/               # Evaluation suite
│   ├── run.py               # Benchmark runner (EdgeParse)
│   ├── compare_all.py       # Multi-engine comparison (9 engines)
│   ├── pyproject.toml
│   ├── thresholds.json      # Regression thresholds
│   ├── pdfs/                # Benchmark PDFs (200 docs)
│   ├── ground-truth/        # Reference Markdown and JSON annotations
│   ├── prediction/          # Per-engine output directories
│   ├── reports/             # HTML benchmark reports
│   └── src/                 # Python evaluators and engine adapters
│
├── docs/                    # Technical documentation (Markdown)
│
├── demo/                    # Interactive WASM demo (Vite + TypeScript)
│   └── src/                 # Demo application source
│
├── examples/
│   └── pdf/                 # Sample PDFs for quick testing
│       ├── lorem.pdf
│       ├── 1901.03003.pdf   # Academic paper (multi-column)
│       ├── 2408.02509v1.pdf # Academic paper
│       └── chinese_scan.pdf # CJK + scan example
│
├── benches/                 # Rust micro-benchmarks (criterion)
├── docker/                  # Dockerfile and Dockerfile.dev
├── scripts/                 # bench.sh, publish-crates.sh
└── tests/
    └── fixtures/            # Rust integration test fixtures

Contributing

See CONTRIBUTING.md. In short:

Fork, branch from main
cargo fmt && cargo clippy -- -D warnings
Run the benchmark to check for regressions: cd benchmark && uv run python run.py --engine edgeparse
Open a PR

Star History

License

EdgeParse is licensed under the Apache License 2.0. See LICENSE for the full text.

The crates/pdf-cos/ directory is a fork of lopdf (MIT/Apache-2.0 dual-licensed).
Benchmark PDF documents (benchmark/pdfs/) are sourced from publicly available documents and are used solely for evaluation purposes.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

EdgeParse

Table of Contents

Get Started in 30 Seconds

Release Channels

What Problems Does This Solve?

Benchmark

Current comparison set

Metrics

Running the benchmark

Regression thresholds

Capability Matrix

Installation

Python

CLI

Rust library

Node.js

Build from source

System requirements

Python SDK

edgeparse.convert()

edgeparse.convert_file()

Examples

CLI entry point (Python package)

Node.js SDK

convert()

ConvertOptions

CLI (npm)

WebAssembly SDK

Quick start

API

Live demo

Build from source

CLI Reference

Core options

Output format values

Layout & extraction options

Image options

Separator options

Content safety options

Hybrid backend options

Output Formats

JSON output example

RAG / LLM Integration

Extract Markdown for chunking

Extract JSON for source citations

LangChain integration

LlamaIndex integration

Which output format for which use case?

Agent Skill

Architecture

Crate structure

Processing pipeline (20 stages)

FAQ

What is the best PDF parser for RAG?

How do I cite PDF sources in RAG answers?

How do I extract tables from PDF?

Does it work without sending data to the cloud?

Does it handle multi-column layouts?

Does it need a GPU or Java?

How does it compare to Docling, MinerU, and Marker?

Does it support scanned PDFs?

What does "deterministic" mean?

How do I chunk PDFs for semantic search?

Does the Python SDK run on Windows?

Tutorials

Documentation

Project Layout

Contributing

Star History

License

`edgeparse.convert()`

`edgeparse.convert_file()`

`convert()`

`ConvertOptions`