@okrapdf/pdfdom

v0.1.0

Published

3 months ago

jQuery for PDFs - query document entities with CSS-like selectors

0High
0Medium
0Low

steventsao713

pdf query jquery document ocr extraction dom selector

pdfquery

jQuery for PDFs. CSS selectors and vision models on any PDF.

npm install pdfquery @okrapdf/pdfdom-plugins

Headless PDF DOM runtime

The local runtime takes a source plus an explicit parser schedule and writes boring DOM-compatible markup. That keeps the first milestone compatible with existing tools instead of inventing a query shell.

pdfdom create ./invoice.pdf --parser text-layer -o html | htmlq --text 'page'
pdfdom create ./invoice.pdf --parser text-layer -o html | htmlq --attribute data-page 'page'
pdfdom create ./invoice.pdf --parser text-layer -o html | xidel -s - -e "css('page')"

import { openPdfDomRuntime, textLayerNodeParser } from '@okrapdf/pdfdom';

const runtime = openPdfDomRuntime({
  source: { text: 'Invoice total: $42' },
  parsers: [textLayerNodeParser],
});
await runtime.ready;

console.log(runtime.serialize());
// <document ...>Invoice total: $42</document>

Parsers are node parsers: they declare what input they need, what capabilities they provide, and any host/network requirements. Generic parsers can stay fully local; specialized parsers can be explicit. For example, an arxiv-paper parser can declare network: { mode: 'required', allowedHosts: ['arxiv.org'] }, read source.metadata.sourceUrl, fetch the arXiv PDF/HTML variants, and still mount the result as the same <document><article>...</article></document> tree.

The runtime owns scheduling, parser lifecycle, DOM mounting, events, snapshots, and serialization. The DOM is the in-memory runtime object; -o html materializes it as streamable markup for tools like htmlq, xidel, Cheerio, jsdom, browser querySelectorAll, and LLM CSS selectors. Use -o json when you want the runtime snapshot/object instead.

Tree-first query (no session)

The document tree is the source of truth. pdfquery is the query/mutation lens over that tree.

import pdfquery from '@okrapdf/pdfdom';

const $ = pdfquery(tree, {
  // Optional: persist patches to your durable tree
  onPatches: (patches) => fetch('/parties/document/my-room/tree', {
    method: 'PATCH',
    headers: { 'Content-Type': 'application/json' },
    body: JSON.stringify({ patches, reason: 'ui-mutation' }),
  }),
});

Selectors, text search, and aggregation work instantly:

$('table').count();                        // 12 tables found locally
$('ocr').contains('revenue').texts();      // text search across all pages
$('[confidence>0.9]').count();             // filter by OCR confidence
$('*').onPage(1).countByType();            // Map { ocr: 45, table: 2, heading: 3 }

Need rich markdown for a specific page? .markdown() delegates to your tree dispatch:

const md = await $('table').onPage(6).markdown();

Need visual understanding? .vlm() delegates to your tree dispatch:

await $('table').onPage(6).vlm('what are the column headers?');
await $('figure').eq(0).css({ margin: 20 }).vlm('describe this chart');
await $('page:first').vlm('summarize this page in 2 sentences');

One query interface over a canonical tree.

PDF Browser runtime

PdfRuntime is the live layer above the tree. It accepts parser/facet candidates, reconciles the current visible winner per node, emits MutationObserver-style records, and exposes snapshot/replay plus a surface binding for viewers.

import { PdfMutationObserver, openPdfRuntime } from '@okrapdf/pdfdom';

const runtime = openPdfRuntime({
  documentId: 'doc_123',
  pages: [{ pageNumber: 1, width: 1000, height: 1200 }],
});
await runtime.ready;

const observer = new PdfMutationObserver((records) => {
  for (const record of records) {
    if (record.type === 'text' || record.type === 'bbox') {
      // repaint only the affected node/region
    }
  }
});
observer.observe(runtime, {
  childList: true,
  characterData: true,
  bbox: true,
  class: true,
  facetWinners: true,
});

await runtime.applyFacetBatch({
  facet: 'text',
  source: 'native-text',
  candidates: [{
    id: 'native-1',
    nodeId: 'p1',
    facet: 'text',
    source: 'native-text',
    pageNumber: 1,
    nodeType: 'paragraph',
    confidence: 0.5,
    stage: 'draft',
    text: 'Revneue',
    bbox: { x: 0.1, y: 0.1, width: 0.4, height: 0.03 },
  }],
});

const surface = runtime.surface();
const unsubscribe = surface.subscribe((snapshot) => {
  renderPdfSurface(snapshot);
});

runtime.snapshot() captures document state, facet candidates, winners, provenance, invalidated paint regions, and the replay event log. Use replayPdfRuntimeSnapshot() or replayPdfRuntimeEvents() for reconnects and debugging.

Ingestion (separate concern)

| Plugin | What it does | API key | |--------|-------------|---------| | pymupdf | Local text + table extraction, page rasterization | No | | llamaParse | LlamaIndex Cloud extraction (eager or defer: true) | LLAMAINDEX_API_KEY | | vlmOpenRouter | .vlm() on any element via OpenRouter | OPENROUTER_API_KEY | | vlmBboxDetect | VLM visual entity detection (tables/figures the OCR missed) | Uses vlmOpenRouter | | doclingServe | IBM Docling (self-hosted) | No | | googleOcr | Google Document AI | GCP credentials |

Use your preferred ingestion pipeline to populate the tree (OCR, layout, VLM, etc). pdfquery itself does not own or store state.

Quick start (no API key)

npx tsx examples/basic.ts

import { loadFixture } from '@okrapdf/pdfdom';

const $ = loadFixture('financial-report');

$('table').count();            // 8
$('table').texts();            // markdown content of each table
$('[confidence>0.95]').count(); // 81
$('*').countByType();          // Map { table: 8, header: 15, ... }

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme