@rs-pdf/core

v0.1.25

Published

2 months ago

High-performance PDF to HTML converter — MuPDF via Rust/napi-rs, Node.js bindings

0High
0Medium
0Low

pdf html pdf-to-html converter mupdf rust napi native svg render pdf-render text-extraction seo streaming high-performance bindings addon

Why

MuPDF is the fastest PDF renderer available - same engine used by Foxit, Chrome, and Kindle
Pixel-perfect SVG output - text rendered as vector paths, no rasterization artifacts
Optional SEO text layer - transparent HTML text overlay, crawlable by search engines and copy-pasteable by users
Zero runtime dependencies - MuPDF is statically linked into the .node binary
Non-blocking - all rendering runs on Tokio's blocking thread pool, never blocking the Node.js event loop

Installation

npm install @rs-pdf/core
# or
pnpm add @rs-pdf/core

The correct native binary for your platform is installed automatically via optionalDependencies.

Supported platforms: macOS (arm64, x64) · Linux (x64, arm64 glibc) · Windows (x64)

Usage

All functions accept a single input object with either path (local file) or url (remote file). When url is given, the PDF is downloaded to a temporary location and cleaned up automatically.

Convert entire PDF

import { pdfToHtml } from '@rs-pdf/core';

// from local file
const result = await pdfToHtml({ path: '/path/to/file.pdf' });

// from URL
const result = await pdfToHtml({ url: 'https://example.com/document.pdf' });

console.log(result.pageCount); // total pages
console.log(result.pagesConverted); // pages actually converted
console.log(result.html); // self-contained HTML document

Page range & DPI

const result = await pdfToHtml({
  path: '/path/to/file.pdf',
  startPage: 0, // 0-based, default: 0
  endPage: 9,   // 0-based inclusive, default: last page
  dpi: 200,     // render quality, default: 150
});

SEO text layer

Adds a transparent HTML text overlay on top of the SVG - invisible to users, but indexed by search engine crawlers and copy-pasteable.

const result = await pdfToHtml({ path: '/path/to/file.pdf', seoTextLayer: true });
// result.html contains: SVG visual layer + <div class="tl"> text overlay

DRM-protected PDFs

const result = await pdfToHtml({ path: '/path/to/protected.pdf', password: 'secret' });

Stream page by page

Yields pages as they are converted - useful for large PDFs or when you want to process/save pages without waiting for the entire document.

import { pdfToHtmlStream } from '@rs-pdf/core';

for await (const page of pdfToHtmlStream({ path: '/large.pdf' })) {
  console.log(`Page ${page.pageIndex + 1}/${page.pageCount}`);
  await saveToDatabase(page.html);
}

Use concurrency to prefetch multiple pages in parallel:

for await (const page of pdfToHtmlStream({ url: 'https://example.com/doc.pdf', concurrency: 4 })) {
  process(page);
}

Single page

import { pdfPageToHtml } from '@rs-pdf/core';

const page = await pdfPageToHtml({ path: '/path/to/file.pdf', pageIndex: 3 });
// page.html is a fragment - no DOCTYPE/html/head/body

Metadata only

import { pdfInfo } from '@rs-pdf/core';

const info = await pdfInfo({ path: '/path/to/file.pdf' });
// or: await pdfInfo({ url: 'https://example.com/doc.pdf' })
// { pageCount, isDrmProtected, title, author, subject, creator }

Worker pool

Limit concurrent PDF conversions when processing large batches:

import { PdfWorkerPool } from '@rs-pdf/core';

const pool = new PdfWorkerPool({ concurrency: 4 });

const results = await Promise.all(
  pdfPaths.map((p) => pool.convert({ path: p, dpi: 150 }))
);

// stream via pool
for await (const page of pool.stream({ url: 'https://example.com/large.pdf' })) {
  process(page);
}

pool.destroy();

API

All functions accept a single input object. Provide either path or url — not both.

`pdfToHtml(input): Promise<PdfConvertResult>`

Converts all (or a range of) pages to a self-contained HTML document.

`pdfPageToHtml(input): Promise<PdfPageResult>`

Converts a single page to an HTML fragment (no DOCTYPE/html/head/body).

`pdfToHtmlStream(input): AsyncGenerator<PdfPageResult>`

Yields pages one by one as they are converted.

`pdfInfo(input): Promise<PdfInfo>`

Returns document metadata without rendering. Safe to call on DRM-protected PDFs.

`PdfWorkerPool`

Concurrency-limited pool. See Worker pool above.

Input fields

| Field | Type | Default | Applies to | Description | | -------------- | --------- | --------- | ------------------- | ------------------------------------------- | | path | string | - | all | Local file path (mutually exclusive with url) | | url | string | - | all | Remote URL — downloaded automatically | | pageIndex | number | - | pdfPageToHtml | 0-based page index (required) | | startPage | number | 0 | all except pdfInfo| First page to convert (0-based) | | endPage | number | last page | all except pdfInfo| Last page to convert (0-based, inclusive) | | password | string | - | all | Password for DRM-protected PDFs | | dpi | number | 150 | all except pdfInfo| Render quality (higher = larger output) | | seoTextLayer | boolean | false | all except pdfInfo| Add transparent HTML text overlay for SEO | | concurrency | number | 1 | pdfToHtmlStream | Pages to prefetch in parallel |

HTML output structure

<!-- Full document (pdfToHtml) -->
<!DOCTYPE html>
<html>
  <head>
    ...
  </head>
  <body>
    <div class="page" id="page-1" data-page="1" data-total="42">
      <div style="position:relative; width:...px; height:...px">
        <!-- Visual layer: pixel-perfect SVG (text as vector paths) -->
        <svg>...</svg>

        <!-- SEO text layer (only when seoTextLayer: true) -->
        <!-- Invisible to users, readable by crawlers, copy-pasteable -->
        <div class="tl" style="color:transparent; ...">
          <p><span>Actual text content from PDF</span></p>
        </div>
      </div>
    </div>
  </body>
</html>

Development

# Install dependencies
pnpm install

# Build native addon (Rust → .node)
pnpm build:native

# Build TypeScript
pnpm build:ts

# Run tests
pnpm test

# Build everything
pnpm build

Requirements: Rust stable, Node.js 18+, pnpm

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Why