extract-pdf

v0.1.0

Published

17 days ago

Convert a PDF (URL or ArrayBuffer) into clean HTML with structural tagging — headings, lists, footnotes, code blocks, bold/italic. Works in Node.js, Cloudflare Workers, and browser environments.

0High
0Medium
0Low

vtempest

pdf html pdf-parser pdfjs pdf-to-text document-extraction

pdf-to-html

Converts a PDF (URL or ArrayBuffer) into clean HTML with structural tagging — headings, lists, footnotes, code blocks, bold/italic, and Table of Contents entries. Works in Node.js, Cloudflare Workers, and browser environments via pdfjs-serverless.

Usage

import { convertPDFToHTML } from "./pdf-to-html";

const { html, title, author } = await convertPDFToHTML(
  "https://example.com/paper.pdf",
);
// or pass an ArrayBuffer from fs.readFile / fetch
const { html } = await convertPDFToHTML(buffer, { addPageNumbers: true });

Options

| Option | Default | Description | | ---------------- | ------- | ------------------------------------------------------------------------------------------ | | addPageNumbers | false | Inserts [n] markers at each page boundary | | addCitation | true | Reads PDF metadata and first-page heading to populate title/author in the return value |

Return value

{ html: string, title?: string, author?: string, format: "pdf" }

Pipeline

The conversion runs a sequential chain of transformations on a ParseResult (pages → items):

Raw pdfjs text spans
  → CalculateGlobalStats   — font heights, distances, format map
  → CompactLines           — merge spans on the same y-line into LineItems
  → RemoveRepetitiveElements — strip recurring page headers/footers
  → VerticalToHorizontal   — rotate vertical character runs
  → DetectTOC              — identify Table of Contents pages, link headings
  → DetectHeaders          — classify items as H1–H6 by font height
  → DetectListItems        — detect bullet/numbered list items
  → GatherBlocks           — group adjacent same-type lines into blocks
  → DetectCodeQuoteBlocks  — mark indented blocks as CODE
  → DetectListLevels       — add indentation for nested list levels
  → ToTextBlocks           — flatten blocks to { category, text } pairs
  → ToHTML                 — render pairs as <p>, <h1>–<h6>, <ul>, <code>

Folder structure

pdf-to-html/
  pdf-to-html.ts          — main entry point (convertPDFToHTML)
  models/                 — data classes: Page, ParseResult, TextItem,
  │                         LineItem, LineItemBlock, Word, BlockType, …
  transformations/
  │  Transformation.ts              — abstract base
  │  CalculateGlobalStats.ts
  │  ToTextBlocks.ts
  │  ToHTML.ts
  │  line-item/           — per-line-item transformations
  │  line-item-block/     — per-block transformations
  util/
     string-functions.ts
     page-item-functions.ts
     page-number-functions.ts

Published

Vulnerabilities

Links

Maintainers

Keywords