edgeparse

v0.2.3

Published

3 months ago

High-performance PDF extraction — Rust engine, Node.js interface

0High
0Medium
0Low

raphael.mansuy

pdf extraction markdown json rag llm rust

edgeparse

High-performance PDF extraction for Node.js — Rust engine, JavaScript/TypeScript interface.

EdgeParse converts PDF documents to Markdown, JSON, HTML, or plain text. It is powered by a native Rust engine (via N-API) with pre-built binaries — no compilation required.

Install

npm install edgeparse
# or
pnpm add edgeparse
# or
yarn add edgeparse

Pre-built binaries are available for:

| Platform | Architecture | |---|---| | macOS | x64, arm64 (Apple Silicon) | | Linux | x64-gnu, arm64-gnu | | Windows | x64-msvc |

Quick Start

import { convert } from 'edgeparse';

// Convert a PDF to Markdown
const markdown = convert('report.pdf');
console.log(markdown);

// Convert to JSON
const json = convert('report.pdf', { format: 'json' });

// Convert specific pages to HTML
const html = convert('report.pdf', {
  format: 'html',
  pages: [0, 1, 2],   // pages 1–3 (0-indexed)
});

// Password-protected PDF
const text = convert('secure.pdf', {
  format: 'markdown',
  password: 'secret',
});

API

`convert(inputPath, options?): string`

Converts a PDF file and returns the content as a string.

| Parameter | Type | Description | |---|---|---| | inputPath | string | Absolute or relative path to the PDF file | | options.format | 'markdown' \| 'json' \| 'html' \| 'text' | Output format (default: 'markdown') | | options.pages | number[] | Zero-indexed page numbers to extract (default: all) | | options.password | string | Password for encrypted PDFs | | options.readingOrder | 'xycut' \| 'default' | Reading order algorithm (default: 'xycut') | | options.tableMethod | 'border' \| 'cluster' | Table detection method (default: 'border') | | options.imageOutput | 'embedded' \| 'external' \| 'none' | Image handling (default: 'none') |

`version(): string`

Returns the edgeparse engine version string.

import { version } from 'edgeparse';
console.log(version()); // e.g. "0.2.2"

CLI

The package also ships an edgeparse CLI binary:

npx edgeparse document.pdf
npx edgeparse document.pdf --format json
npx edgeparse document.pdf --format html --output output/

TypeScript

Full TypeScript support is included — no @types package needed.

import { convert, version } from 'edgeparse';
import type { ConvertOptions } from 'edgeparse';

Performance

EdgeParse consistently processes 40+ pages/second on a modern machine and achieves 88%+ extraction accuracy on diverse real-world PDFs — dramatically faster than Python-based alternatives.

License

Apache-2.0 — see LICENSE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

edgeparse

Install

Quick Start

API

convert(inputPath, options?): string

version(): string

CLI

TypeScript

Performance

Links

License

`convert(inputPath, options?): string`

`version(): string`