npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

pdf2html-client

v0.1.4

Published

Advanced client-side PDF to HTML converter with WASM parsing, OCR support, and intelligent text layout reconstruction. Perfect for document management systems and web applications.

Readme

pdf2html-client

npm version License: MIT TypeScript

Advanced client-side PDF to HTML converter with WASM parsing, OCR support, and intelligent text layout reconstruction. Perfect for document management systems and web applications.

Key Features

This library was built primarily to support high fidelity PDF/DOCX imports in Venmail Drive. Most PDF-to-HTML pipelines pick one tradeoff: either pixel-perfect output that is hard to edit, or “flow” output that drifts and overlaps. The goal is to provide a one-stop simple workflow for document imports that works with offline-first applications.

pdf2html-client is built around a multi-mode text layout engine:

  • High fidelity when you need it (absolute/smart positioned text)
  • Editability when you want it (flow/outline-flow)
  • Semantic structure with layout awareness (semantic regions + flexbox)
  • Overlap-aware fallbacks for sensitive areas where reflow would break readability

All of this runs in the browser (via pdfium or unpdf).

Core capabilities

  • WASM PDF parsing
    • Primary: PDFium (WebAssembly)
    • Fallback/alternative: unpdf
    • Select via parserStrategy: 'auto' | 'pdfium' | 'unpdf'
  • Multiple text layout modes (see below)
  • Optional OCR for scanned PDFs
    • Uses onnxruntime-web + OpenCV.js
    • Automatically detects scanned PDFs and only runs OCR when it makes sense
  • Font detection + mapping
    • Detects fonts from extracted text
    • Maps fonts using an internal font catalog
  • Output formats
    • html, css, plus metadata (processing time, page count, OCR used, font mappings, image stats)

Installation

pnpm add pdf2html-client

Bundling & externals

  • Bundled by default: PDFium (primary) and unpdf (fallback) so core parsing works out-of-the-box.
  • External (you must provide):
    • pdfjs-dist (used as an additional parser path)
    • onnxruntime-web and @techstark/opencv-js (used only when OCR is enabled)

For bundlers, mark these as externals/peer-like. For UMD/CDN usage, ensure these scripts are available globally before loading the library.

OCR models (optional)

If you enable OCR, you should download the lightweight OCR models ahead of time:

pnpm run download-models

This downloads models into models/.

Getting Started

Choose your use case

I want to edit documents and extract content

import { PDF2HTML } from 'pdf2html-client';
const result = await PDF2HTML.convertForEditing(pdfFile);

I want pixel-perfect document display

import { PDF2HTML } from 'pdf2html-client';
const result = await PDF2HTML.convertForFidelity(pdfFile);

I want responsive web documents

import { PDF2HTML } from 'pdf2html-client';
const result = await PDF2HTML.convertForWeb(pdfFile);

Let the library choose the best option

import { PDF2HTML } from 'pdf2html-client';
const { result, presetUsed, reason } = await PDF2HTML.convertAuto(pdfFile);
console.log(`Used ${presetUsed} preset: ${reason}`);

What you get

console.log(result.html);  // HTML markup
console.log(result.css);   // CSS styles
console.log(result.text);  // Extracted text (if enabled)
console.log(result.metadata); // Processing info

That's it! The library handles everything else automatically.

Usage with externals

ESM / bundlers

Install peer dependencies and mark them as externals in your bundler config:

pnpm add pdfjs-dist onnxruntime-web @techstark/opencv-js
// Vite example
export default {
  build: {
    rollupOptions: {
      external: ['pdfjs-dist', 'onnxruntime-web', '@techstark/opencv-js']
    }
  }
}
import { PDF2HTML } from 'pdf2html-client';

const converter = new PDF2HTML({
  parserStrategy: 'auto', // or 'pdfjs' to explicitly use pdfjs-dist
  enableOCR: true // requires onnxruntime-web and @techstark/opencv-js
});

UMD / CDN

Load the external scripts before the library:

<!-- PDF.js -->
<script src="https://cdn.jsdelivr.net/npm/[email protected]/build/pdf.min.js"></script>
<script>pdfjsLib.GlobalWorkerOptions.workerSrc = 'https://cdn.jsdelivr.net/npm/[email protected]/build/pdf.worker.min.js';</script>
<!-- OCR deps (only needed if enableOCR=true) -->
<script src="https://cdn.jsdelivr.net/npm/[email protected]/dist/ort.min.js"></script>
<script src="https://cdn.jsdelivr.net/npm/@techstark/[email protected]/opencv.js"></script>
<!-- Then load pdf2html-client -->
<script src="./pdf2html-client.umd.js"></script>
<script>
  const { PDF2HTML } = window.PDF2HTML;
  const converter = new PDF2HTML({ enableOCR: true });
</script>

Notes:

  • PDF.js is only required when using parserStrategy: 'pdfjs' or when auto chooses it. If missing, the library falls back to bundled PDFium/unpdf.
  • OCR dependencies are only required when enableOCR: true. They are lazy-loaded on first OCR use.
  • Ensure OCR models are available at models/ or provide custom URLs via ocrConfig.

Quick start

Simple one-liner conversion

import { PDF2HTML } from 'pdf2html-client';

// For document editing (most common use case)
const result = await PDF2HTML.convertForEditing(pdfFile);

// For high-fidelity display
const result = await PDF2HTML.convertForFidelity(pdfFile);

// For web-optimized responsive output
const result = await PDF2HTML.convertForWeb(pdfFile);

Using factory methods

import { PDF2HTML } from 'pdf2html-client';

// Create converter for specific use case
const converter = PDF2HTML.forEditing();
const result = await converter.convert(pdfFile);
converter.dispose();

Chainable configuration

import { PDF2HTML } from 'pdf2html-client';

// Build configuration step by step
const converter = new PDF2HTML()
  .enableOCR(true)
  .enableFontMapping(true)
  .setTextLayout('semantic')
  .setPreserveLayout(true)
  .setResponsive(true)
  .setDarkMode(false)
  .setImageFormat('base64')
  .includeExtractedText(true)
  .setMaxConcurrentPages(2);

const result = await converter.convert(pdfFile);
converter.dispose();

Apply presets and customize

import { PDF2HTML } from 'pdf2html-client';

// Start with a preset and customize
const converter = new PDF2HTML()
  .applyPreset('editing')
  .setDarkMode(true)  // Override preset setting
  .setImageFormat('url');  // Override preset setting

const result = await converter.convert(pdfFile);
converter.dispose();

Advanced configuration

import { PDF2HTML } from 'pdf2html-client';

const converter = new PDF2HTML({
  enableOCR: false,
  enableFontMapping: false,
  parserStrategy: 'auto',
  htmlOptions: {
    format: 'html+inline-css',
    preserveLayout: true,
    responsive: false,
    darkMode: false,
    imageFormat: 'base64',
    textLayout: 'semantic', // Default mode - flow semantic with layout awareness
    textLayoutPasses: 2,
    textPipeline: 'v2',
    includeExtractedText: true
  }
});

const out = await converter.convert(pdfFile, (p) => {
  console.log(`${p.stage}: ${p.progress}%`);
});

console.log(out.html);
console.log(out.css);
console.log(out.metadata);

converter.dispose();

Common use cases

1. Document editing and content extraction

Perfect for document management systems where users need to edit and extract content from PDFs:

import { PDF2HTML } from 'pdf2html-client';

const converter = new PDF2HTML({
  enableOCR: true, // Handle scanned documents
  enableFontMapping: true, // Better font fidelity
  htmlOptions: {
    textLayout: 'flow', // Maximum editability
    preserveLayout: false, // Semantic HTML structure
    format: 'html+inline-css',
    responsive: true,
    includeExtractedText: true, // Easy copy-paste
    imageFormat: 'base64'
  }
});

const result = await converter.convert(pdfFile);
// result.html contains clean, editable semantic HTML
// result.text contains extracted text for search/indexing

2. High-fidelity document display

Ideal for document viewers and archival systems where visual accuracy is paramount:

import { PDF2HTML } from 'pdf2html-client';

const converter = new PDF2HTML({
  enableOCR: false, // Skip OCR for text PDFs
  enableFontMapping: true,
  htmlOptions: {
    textLayout: 'absolute', // Pixel-perfect positioning
    preserveLayout: true,
    format: 'html+inline-css',
    responsive: false,
    darkMode: false,
    imageFormat: 'base64',
    textLayoutPasses: 1, // Faster processing
    textPipeline: 'legacy' // Proven stability
  }
});

const result = await converter.convert(pdfFile);
// result.html maintains exact PDF visual layout

3. Web-optimized responsive documents

Best for web applications that need responsive, accessible documents:

import { PDF2HTML } from 'pdf2html-client';

const converter = new PDF2HTML({
  enableOCR: true,
  enableFontMapping: false, // Faster loading
  htmlOptions: {
    textLayout: 'semantic', // Best of both worlds
    preserveLayout: true,
    format: 'html+css', // Separate CSS for caching
    responsive: true,
    darkMode: true, // Support dark theme
    imageFormat: 'url', // Better performance
    useFlexboxLayout: true, // Modern layout
    semanticLayout: {
      blockGapFactor: 1.2,
      headingThreshold: 0.8
    }
  }
});

const result = await converter.convert(pdfFile);
// Responsive HTML that adapts to screen sizes
// Semantic structure for accessibility

Using configuration presets

For convenience, you can use pre-configured presets:

import { PDF2HTML, ConfigPresets } from 'pdf2html-client';

// Document editing preset
const editingConverter = new PDF2HTML(ConfigPresets.editing);

// High-fidelity display preset  
const fidelityConverter = new PDF2HTML(ConfigPresets.fidelity);

// Web-optimized preset
const webConverter = new PDF2HTML(ConfigPresets.web);

// You can also customize presets
const customConverter = new PDF2HTML({
  ...ConfigPresets.editing,
  htmlOptions: {
    ...ConfigPresets.editing.htmlOptions,
    darkMode: true // Override preset setting
  }
});

Output

convert() returns an HTMLOutput:

  • html: Generated markup
  • css: Generated styles
  • metadata: Page count, processing time, OCR usage, font mapping count, scan detection, and image stats
  • fonts: Font families referenced by output
  • text (optional): Extracted text (when htmlOptions.includeExtractedText is enabled)

Text layout modes

Set htmlOptions.textLayout:

Default: semantic - Flow semantic mode with layout awareness, providing the best balance of editability and visual fidelity.

absolute

Best for maximum positional fidelity. Produces positioned text elements for precise placement.

smart

Positioned output with additional grouping/merging heuristics to reduce fragmentation while maintaining fidelity.

flow

Two behaviors depending on htmlOptions.preserveLayout:

  • preserveLayout: true
    • Produces "outline-flow" HTML that aims to be editable while still matching layout constraints.
  • preserveLayout: false
    • Produces semantic HTML (paragraphs/headings/lists) for maximum reflow/editability.

semantic (default)

Produces semantic regions/lines designed for editing while still anchored to the original PDF layout.

When preserveLayout: true, semantic mode renders positioned regions and then uses:

  • Flexbox line layout (when safe)
  • Automatic fallback to absolute positioning when overlap risk or sensitive geometry is detected

This is the mode targeted at preventing "vertical overlaps" without losing fidelity.

textRenderMode: 'svg'

For special cases, you can render text through an SVG text layer when preserveLayout is enabled.

Configuration reference

The top-level constructor takes PDF2HTMLConfig.

OCR

  • enableOCR: boolean
  • ocrConfig?: { confidenceThreshold: number; language?: string; preprocess?: boolean; autoRotate?: boolean }
  • ocrProcessorOptions?: { batchSize?: number; maxConcurrent?: number; timeout?: number }

OCR only runs when the document is detected as scanned.

Font mapping

  • enableFontMapping: boolean
  • fontMappingOptions?: { strategy: 'exact' | 'similar' | 'fallback'; similarityThreshold: number; cacheEnabled: boolean }

Parser

  • parserStrategy?: 'auto' | 'pdfium' | 'unpdf'
  • parserOptions?: { extractText: boolean; extractImages: boolean; extractGraphics: boolean; extractForms: boolean; extractAnnotations: boolean }

HTML generation

htmlOptions?: HTMLGenerationOptions (high-level knobs):

  • format: 'html' | 'html+css' | 'html+inline-css'
  • preserveLayout: boolean
  • responsive: boolean
  • darkMode: boolean
  • imageFormat: 'base64' | 'url'
  • textLayout?: 'absolute' | 'smart' | 'flow' | 'semantic'
  • textLayoutPasses?: 1 | 2
  • textRenderMode?: 'html' | 'svg'
  • textPipeline?: 'legacy' | 'v2'
  • includeExtractedText?: boolean
  • textClassifierProfile?: string
  • semanticLayout?: { blockGapFactor?: number; headingThreshold?: number; maxHeadingLength?: number }
  • useFlexboxLayout?: boolean

Performance

  • maxConcurrentPages?: number (default: 4)
  • cacheEnabled?: boolean
  • wasmMemoryLimit?: number

Demo

pnpm run demo

Testing

pnpm test
pnpm run test:browser

Browser tests are designed to catch layout regressions, especially text overlaps in semantic layouts.

Project structure

src/
  core/      PDF parsing + layout analysis
  html/      HTML/CSS generation + layout engines
  fonts/     Font detection + mapping
  ocr/       OCR engine + processing
  types/     Public types
demo/        React demo app
tests/       Unit + browser tests

Roadmap

  • Finish PDFJS Fallback
  • Add more font mappings
  • Better tables (structure + export)
  • Richer forms/annotations rendering
  • Expanded vector graphics support
  • More layout profiles and tuning presets

License

MIT