npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

pdfvision

v0.7.1

Published

Extract text, metadata, and page images from PDF files. Designed for AI agents.

Readme

npm npm downloads CI codecov CodeRabbit Pull Request Reviews License

🔍 pdfvision turns any PDF into AI-friendly output — text, metadata, structured layout, and rendered page images — in a single CLI / library built for agents.

Mission: any PDF, read accurately by an AI agent. No silent gaps, no "looks fine but the body was an image" failures.

💡 Why pdfvision

PDF tooling has historically been built for humans copying text into a document. Agents need different things: to know whether the extraction actually captured the content, to hand visual pages to a vision model in the same step, and to receive raw structural signals rather than one pre-formatted answer they can't second-guess.

pdfvision is built around that gap. The goal is to deliver every signal a PDF carries, in a form the agent can act on, and never silently hide that the extraction came up short.

  • Silent-failure visibility. Every page reports charCount, imageCount, and textCoverage, so an agent can tell at a glance that "this slide is an image, not text" — and decide to re-run with --render or --ocr instead of trusting an empty string.
  • Multimodal handoff in one step. --render writes PNG paths the agent can pass straight to a vision model — no second tool, no temp-file plumbing.
  • Raw structural signals. --layout returns blocks with role: 'heading', repeated: true for running headers and footers, and multi-column reading order. --image-boxes reports where each raster draw lands. The agent picks which signals matter; pdfvision doesn't bake one answer.
  • OCR when text alone isn't enough. --ocr runs tesseract.js on each page and attaches pages[].ocr (text + confidence + lang) alongside the native pdfjs text — agents diff the two to detect scanned / image-flattened pages without losing the primary signal.
  • Compatibility codepoints handled. Japanese and scientific PDFs full of / / collapse to canonical forms by default. The pre-normalisation text stays available in rawText when a diff matters.
  • Cache-first. Same PDF, second read takes ~30 ms. Agents that revisit a PDF dozens of times across a session pay the parsing cost once.
  • URLs are first-class. --remote https://… downloads, caches, and extracts in one flag.
  • Tag-shaped output too. The xml format carries the same data as json but as <page> / <text> tags, which some LLMs locate more reliably than nested object keys.

The design principle is agent decides; pdfvision delivers raw signals. No auto-detect heuristics that decide for the agent and hide what the PDF actually contained.

🚀 Quick Start

# Try without installing
npx pdfvision document.pdf

# Pull from a URL
npx pdfvision --remote https://raw.githubusercontent.com/mozilla/pdf.js-sample-files/master/tracemonkey.pdf -f json

# Or install globally
npm install -g pdfvision
pdfvision document.pdf

📖 Usage

pdfvision <file.pdf> [options]
pdfvision --remote <url> [options]
pdfvision --clear-cache

Options:
  -p, --pages <range>     Page range (e.g. "1-5", "3", "1,3,5")
  -f, --format <type>     Output format: markdown (default), json, xml
  -r, --render            Render pages as PNG images
      --render-output <dir>
                          Directory for rendered PNGs (requires --render)
      --geometry          Emit per-text-item bbox + font size in pages[].spans (json/xml)
      --layout            Reconstruct lines + blocks (with role / repeated flags) in pages[].layout
      --image-boxes       Emit per-image bbox in pages[].imageBoxes
      --ocr               Run tesseract.js OCR; attach pages[].ocr (text/confidence/lang)
      --ocr-lang <lang>   Tesseract lang(s), plus-separated (e.g. eng+jpn). Default: eng
      --remote <url>      Download an http(s) PDF into the cache, then extract
      --no-cache          Skip the on-disk cache
      --no-normalize      Disable Unicode NFKC normalization (default: on)
      --clear-cache       Wipe every cached extraction, render, and remote download, then exit
  -v, --version           Show version
  -h, --help              Show this help

Output formats

  • markdown (default) — per-page sections, density Overview table, image links inline. For LLM context windows.
  • json — full DocumentResult schema. For programmatic consumers.
  • xml — same data as JSON but tag-shaped. For LLMs that locate <page> / <text> tags more reliably than nested object keys.

Examples

# Specific pages as JSON
pdfvision document.pdf -p 1-3 -f json

# Render PNGs into ./images for a multimodal LLM
pdfvision document.pdf -r --render-output ./images

# Layout + image bboxes — agent reconstructs reading order itself
pdfvision document.pdf --layout --image-boxes -f json

# Per-text-item geometry (bbox + fontSize per glyph run)
pdfvision document.pdf -f json --geometry

# OCR a scanned PDF (multi-language)
pdfvision scan.pdf --ocr --ocr-lang eng+jpn -f json

Coordinates use a top-down origin (0,0 at the top-left, y grows downward) in PDF user-space points so callers can overlay spans / image bboxes directly on the rendered PNG. Multiply by image.width / page.width to map onto pixels.

📚 Library API

import { processDocument } from 'pdfvision';

const result = await processDocument('./document.pdf', { pages: '1-3', render: true });

console.log(result.totalPages);          // number
console.log(result.metadata.title);      // string | null
for (const page of result.pages) {
  console.log(page.page, page.text);     // typed access, no JSON.parse
  if (page.image) console.log(page.image); // PNG path on disk when render: true
}

processFile() returns the same string output the CLI prints (markdown / json / xml).

Exports: processDocument, processFile, parsePageRange, plus full type definitions for DocumentResult / PageResult / PageOverview / PageQuality / DocumentMetadata / ProcessDocumentOptions / ProcessOptions / OutputFormat / TextSpan / LayoutBlock / LayoutLine / PageLayout / ImageBox / PageOcr.

🤖 Agent Skill

pdfvision ships a bundled agent skill at skills/pdfvision/ (a SKILL.md plus a small references/ set) so a Claude Code, Codex, or Cursor session knows when to reach for the CLI and how to pick flags. Install it with npx skills:

# Project install (default) — drops the skill into <cwd>/.claude/skills/pdfvision/
npx skills add yamadashy/pdfvision

# Global install — drops it into ~/.claude/skills/pdfvision/ instead
npx skills add yamadashy/pdfvision -g

The skill covers the daily extraction flow, the density-Overview-based silent-failure detection, and points at references/structured-output.md (full DocumentResult schema for programmatic consumers) and references/ocr.md (multi-language OCR, traineddata, troubleshooting) only when those specific cases apply.

💾 Caching

Results land under <os-tmp>/pdfvision/<sha256-prefix>/ keyed by file content. POSIX 0700 / 0600 permissions, symlink/TOCTOU defences. Override the location with PDFVISION_CACHE_DIR=/path or wipe everything with pdfvision --clear-cache.

🛠️ Requirements

  • Node.js >= 22.13.0
  • @napi-rs/canvas (installed automatically; ships prebuilt binaries for common platforms)
  • tesseract.js is installed as an optional dependency and only loaded when --ocr is requested. Skip it with npm install --omit=optional if you don't need OCR.

📜 License

MIT © yamadashy