@okrapdf/pdfdom
v0.1.0
Published
jQuery for PDFs - query document entities with CSS-like selectors
Maintainers
Readme
pdfquery
jQuery for PDFs. CSS selectors and vision models on any PDF.
npm install pdfquery @okrapdf/pdfdom-pluginsHeadless PDF DOM runtime
The local runtime takes a source plus an explicit parser schedule and writes boring DOM-compatible markup. That keeps the first milestone compatible with existing tools instead of inventing a query shell.
pdfdom create ./invoice.pdf --parser text-layer -o html | htmlq --text 'page'
pdfdom create ./invoice.pdf --parser text-layer -o html | htmlq --attribute data-page 'page'
pdfdom create ./invoice.pdf --parser text-layer -o html | xidel -s - -e "css('page')"import { openPdfDomRuntime, textLayerNodeParser } from '@okrapdf/pdfdom';
const runtime = openPdfDomRuntime({
source: { text: 'Invoice total: $42' },
parsers: [textLayerNodeParser],
});
await runtime.ready;
console.log(runtime.serialize());
// <document ...>Invoice total: $42</document>Parsers are node parsers: they declare what input they need, what capabilities
they provide, and any host/network requirements. Generic parsers can stay fully
local; specialized parsers can be explicit. For example, an arxiv-paper
parser can declare network: { mode: 'required', allowedHosts: ['arxiv.org'] },
read source.metadata.sourceUrl, fetch the arXiv PDF/HTML variants, and still
mount the result as the same <document><article>...</article></document> tree.
The runtime owns scheduling, parser lifecycle, DOM mounting, events, snapshots,
and serialization. The DOM is the in-memory runtime object; -o html
materializes it as streamable markup for tools like htmlq, xidel, Cheerio,
jsdom, browser querySelectorAll, and LLM CSS selectors. Use -o json when
you want the runtime snapshot/object instead.
Tree-first query (no session)
The document tree is the source of truth. pdfquery is the query/mutation lens over that tree.
import pdfquery from '@okrapdf/pdfdom';
const $ = pdfquery(tree, {
// Optional: persist patches to your durable tree
onPatches: (patches) => fetch('/parties/document/my-room/tree', {
method: 'PATCH',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ patches, reason: 'ui-mutation' }),
}),
});Selectors, text search, and aggregation work instantly:
$('table').count(); // 12 tables found locally
$('ocr').contains('revenue').texts(); // text search across all pages
$('[confidence>0.9]').count(); // filter by OCR confidence
$('*').onPage(1).countByType(); // Map { ocr: 45, table: 2, heading: 3 }Need rich markdown for a specific page? .markdown() delegates to your tree dispatch:
const md = await $('table').onPage(6).markdown();Need visual understanding? .vlm() delegates to your tree dispatch:
await $('table').onPage(6).vlm('what are the column headers?');
await $('figure').eq(0).css({ margin: 20 }).vlm('describe this chart');
await $('page:first').vlm('summarize this page in 2 sentences');One query interface over a canonical tree.
PDF Browser runtime
PdfRuntime is the live layer above the tree. It accepts parser/facet
candidates, reconciles the current visible winner per node, emits
MutationObserver-style records, and exposes snapshot/replay plus a surface
binding for viewers.
import { PdfMutationObserver, openPdfRuntime } from '@okrapdf/pdfdom';
const runtime = openPdfRuntime({
documentId: 'doc_123',
pages: [{ pageNumber: 1, width: 1000, height: 1200 }],
});
await runtime.ready;
const observer = new PdfMutationObserver((records) => {
for (const record of records) {
if (record.type === 'text' || record.type === 'bbox') {
// repaint only the affected node/region
}
}
});
observer.observe(runtime, {
childList: true,
characterData: true,
bbox: true,
class: true,
facetWinners: true,
});
await runtime.applyFacetBatch({
facet: 'text',
source: 'native-text',
candidates: [{
id: 'native-1',
nodeId: 'p1',
facet: 'text',
source: 'native-text',
pageNumber: 1,
nodeType: 'paragraph',
confidence: 0.5,
stage: 'draft',
text: 'Revneue',
bbox: { x: 0.1, y: 0.1, width: 0.4, height: 0.03 },
}],
});
const surface = runtime.surface();
const unsubscribe = surface.subscribe((snapshot) => {
renderPdfSurface(snapshot);
});runtime.snapshot() captures document state, facet candidates, winners,
provenance, invalidated paint regions, and the replay event log. Use
replayPdfRuntimeSnapshot() or replayPdfRuntimeEvents() for reconnects and
debugging.
Ingestion (separate concern)
| Plugin | What it does | API key |
|--------|-------------|---------|
| pymupdf | Local text + table extraction, page rasterization | No |
| llamaParse | LlamaIndex Cloud extraction (eager or defer: true) | LLAMAINDEX_API_KEY |
| vlmOpenRouter | .vlm() on any element via OpenRouter | OPENROUTER_API_KEY |
| vlmBboxDetect | VLM visual entity detection (tables/figures the OCR missed) | Uses vlmOpenRouter |
| doclingServe | IBM Docling (self-hosted) | No |
| googleOcr | Google Document AI | GCP credentials |
Use your preferred ingestion pipeline to populate the tree (OCR, layout, VLM, etc). pdfquery itself does not own or store state.
Quick start (no API key)
npx tsx examples/basic.tsimport { loadFixture } from '@okrapdf/pdfdom';
const $ = loadFixture('financial-report');
$('table').count(); // 8
$('table').texts(); // markdown content of each table
$('[confidence>0.95]').count(); // 81
$('*').countByType(); // Map { table: 8, header: 15, ... }License
MIT
