paperkit
v0.3.0
Published
Cross-runtime document processing toolkit: denoise, OCR, layout, and export for web, React Native, and Node.
Downloads
142
Maintainers
Readme
paperkit
Cross-runtime document processing toolkit. Take a photo of a document → get a clean image, OCR text, structured document tree, searchable PDF, or JSON. Same API on web, React Native (Expo), and Node.js.
import { parseDocument, toMarkdown, backend } from "paperkit";
const doc = await parseDocument(photo, backend, {
layout: { model: { url: "/models/doclayout.onnx" }, classNames: DEFAULT_DOCLAYOUT_CLASS_NAMES },
text: { model: { url: "/models/ppocr-rec.onnx" }, charset: [...ppocrKeys, " "] },
formula: formulaRecognizer, // optional — math → LaTeX
table: tableRecognizer, // optional — tables → HTML
});
console.log(toMarkdown(doc));// Or use one-shot OCR when you don't need the layout tree:
import { ocr, backend } from "paperkit";
const result = await ocr(photo, backend, {
detection: { model: { url: "/models/ppocr-det.onnx" } },
recognition: { model: { url: "/models/ppocr-rec.onnx" }, charset: [...ppocrKeys, " "] },
});Why paperkit
- One API, three runtimes. Import from
"paperkit"anywhere — the bundler picks the right build via conditional exports. No separate packages to keep in sync. - Bring your own model. We never bundle weights. Point
model.url/model.pathat ONNX files you host. Recommended models and direct download URLs are listed below and in each per-feature doc. - Pay only for what you use. ONNX Runtime is an optional peer dep. If you only use classical features (page detection, binarization, perspective, deskew, blur, keyword classification, script detection, rule KIE), your bundle stays tiny.
- Tree-shakeable. Importing
denoisedoesn't pull in OCR code. - TypeScript-first. Full types, discriminated unions, no
any. - 100 % statement / line coverage on the library core across 257 tests.
Install
npm install paperkit
# + whichever ONNX runtime you need (optional peer dep)
npm install onnxruntime-web # browsers
npx expo install onnxruntime-react-native # Expo / React Native
npm install onnxruntime-node # Node.js / Electron main
# Optional peer deps for specific features:
npm install sharp # Node image I/O (decode / encode)
npm install pdfjs-dist @napi-rs/canvas # rasterizePDF on Node
npm install pdfjs-dist # rasterizePDF on the webExpo specifics
onnxruntime-react-native is a native module — you cannot use it in Expo Go. Use a development build or EAS Build:
npx expo install expo-dev-client
npx expo prebuildWhat's in the box
Every feature below ships in this release. Features marked "No" under Model? run with zero external downloads — pure TypeScript.
Geometry — camera-photo clean-up
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| applyExifRotation / exifOrientationFromBytes / rotateByExif | Apply JPEG EXIF Orientation | No | geometry.md |
| detectPage | Find page corners (Otsu + largest connected component + convex hull + diagonal-extremes) | No | geometry.md |
| correctPerspective | Warp a 4-corner quad to a rectangle (DLT homography + bilinear) | No | geometry.md |
| deskew / estimateSkewAngle | Remove rotational skew via projection-profile variance | No | geometry.md |
| dewarp / createDewarper | Flatten curved / folded pages via UV-grid model | Yes — UVDoc | geometry.md |
Appearance — pixel clean-up
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| denoise / denoiseRaw / createDenoiser | Tiled ML denoising over any same-shape ONNX (Restormer, NAFNet, Swin2SR, …) | Yes — NAFNet / others | appearance.md |
| binarize | Adaptive threshold (Gaussian-mean or Sauvola) via integral image | No | appearance.md |
| removeShadow | Divide-by-blurred-background illumination correction | No | appearance.md |
OCR — printed text, handwriting, formulas, tables
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| detectText / createDetector | DB-based text-region detection with unclip postprocess | Yes — PP-OCRv4 det | ocr.md |
| recognizeText / createRecognizer / ctcGreedyDecode | CRNN recognition with auto pre/post-softmax CTC decode | Yes — PP-OCRv4 rec | ocr.md |
| ocr / createOcrPipeline | Full image → text with reading-order sort + per-region progress | Yes — PP-OCRv4 | ocr.md |
| recognizeHandwriting / createHandwritingRecognizer | English handwriting via TrOCR vision-encoder-decoder | Yes — TrOCR | ocr.md |
| recognizeFormula / createFormulaRecognizer | Math → LaTeX via TexTeller vision-encoder-decoder | Yes — TexTeller | ocr.md |
| recognizeTable / createTableRecognizer | Tables → HTML with cell text + colspan / rowspan | Yes — SLANet-plus | ocr.md |
| recognizeVisionEncoderDecoder / createVisionEncoderDecoderRecognizer | Generic VisionEncoderDecoder runner — any encoder-decoder ONNX | Yes — user-supplied | ocr.md |
| createTokenDecoder / createUnigramMetaspaceDecoder / createByteBpeDecoder | Tokenizer decoders for HF tokenizer.json; auto-dispatch on model.type | No | ocr.md |
Layout — typed-region dispatcher
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| analyzeLayout / createLayoutAnalyzer | Typed regions (title, text, list, table, formula, figure) with bboxes + confidence | Yes — DocLayout-YOLO | layout.md |
| parseDocument / createDocumentPipeline | Full document parse — layout + per-region recognition + reading-order sort | Yes — layout + recognizers | layout.md |
| DEFAULT_DOCLAYOUT_CLASS_NAMES / DEFAULT_DOCLAYOUT_MAPPING | 10 raw labels + their canonical mapping | — | layout.md |
Input — PDF rasterization
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| rasterizePDF / rasterizePdfWith / installRasterizePDF | Render each PDF page to RawImage via pdfjs-dist + canvas | No (peer deps: pdfjs-dist + @napi-rs/canvas on Node) | input.md |
React Native isn't supported — pdfjs has DOM / worker assumptions. Use react-native-pdf or a native PDF lib to rasterize off-JS, then feed the resulting RawImage[] into the rest of paperkit.
Export — turn results into files
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| toSearchablePDF | Multi-page (or single-page) searchable PDF with invisible text overlay | No | export.md |
| toMarkdown | DocumentResult → structured Markdown (headings, lists, $$math$$, inline <table>, figure captions) or flat line-per-region from OcrResult | No | export.md |
| toJSON | OcrResult → stable persistence-ready schema | No | export.md |
Quality — pre-flight checks
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| estimateBlur | Variance-of-Laplacian focus score | No | quality.md |
Classification — what kind of document is this?
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| classifyByKeywords | Zero-model OCR-text classifier (8 default categories, customizable) | No | classify.md |
| classifyDocument / createDocumentClassifier | Image classifier over any ONNX [1, 3, H, W] → [1, C] model | Yes — DiT-RVLCDIP | classify.md |
| DEFAULT_RVLCDIP_LABELS | 16-class RVL-CDIP taxonomy | — | classify.md |
Text analysis
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| detectScript | Dominant Unicode script (11 scripts: latin, han, hiragana, katakana, hangul, cyrillic, arabic, hebrew, thai, devanagari, greek) | No | text.md |
KIE — key-information extraction
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| extractByRules / createRuleBasedExtractor | Pattern + keyword extractor with word-boundary matching and typed coercion | No | kie.md |
| INVOICE_SCHEMA / RECEIPT_SCHEMA / ID_CARD_SCHEMA | Ready-to-use schemas for common document types | — | kie.md |
| KieExtractor interface | Contract that rule-based, VLM, and future LayoutLM backends all satisfy | — (user-supplied when VLM) | kie.md |
Batch / progress / workers
| API | Does what | Model? | Docs |
|---|---|:---:|---|
| batchMap | Generic concurrency-limited mapper (order-preserving, fail-fast) | No | batch.md |
| onProgress | Accepted by denoise, ocr, parseDocument, recognizeTable, rasterizePDF, batchMap | No | batch.md |
| Worker patterns (browser + Node) | Result types are plain-data POD; all handles stay inside the worker that created them | No — documented patterns, no RPC wrapper | workers.md |
Recommended models
Every feature that needs ML takes a model.url (web) or model.path (Node / RN). Here are the tested ONNX weights that paperkit validates against — point at these to get the same behavior as the smoke scripts.
Models are never bundled with paperkit. You download them once from the canonical source (usually HuggingFace or ModelScope), host or ship them with your app, and pass the URL / path to the relevant factory.
PP-OCRv4 — printed-text OCR
mkdir -p models/ppocr
# Text detection (multilingual, ~4.7 MB)
curl -L -o models/ppocr/det.onnx \
"https://huggingface.co/SWHL/RapidOCR/resolve/main/PP-OCRv4/ch_PP-OCRv4_det_infer.onnx"
# Text recognition (Chinese + English, ~10 MB)
curl -L -o models/ppocr/rec.onnx \
"https://huggingface.co/SWHL/RapidOCR/resolve/main/PP-OCRv4/ch_PP-OCRv4_rec_infer.onnx"
# Character dictionary (~6,600 chars; paperkit appends a trailing space)
curl -L -o models/ppocr/keys.txt \
"https://raw.githubusercontent.com/PaddlePaddle/PaddleOCR/release/2.7/ppocr/utils/ppocr_keys_v1.txt"Usage:
import { ocr, backend } from "paperkit";
import { promises as fs } from "node:fs";
const keys = (await fs.readFile("models/ppocr/keys.txt", "utf8")).split("\n").filter(Boolean);
const charset = [...keys, " "]; // IMPORTANT — trailing space isn't in the downloaded file
const result = await ocr(photoBytes, backend, {
detection: { model: { path: "models/ppocr/det.onnx" } },
recognition: { model: { path: "models/ppocr/rec.onnx" }, charset },
});Higher-accuracy server variants and language-specific alternatives (English, Korean, Arabic, Hindi, etc.) are listed in docs/features/ocr.md.
DocLayout-YOLO — layout (75 MB)
mkdir -p models/layout
curl -L -o models/layout/doclayout.onnx \
"https://huggingface.co/wybxc/DocLayout-YOLO-DocStructBench-onnx/resolve/main/doclayout_yolo_docstructbench_imgsz1024.onnx"import { parseDocument, DEFAULT_DOCLAYOUT_CLASS_NAMES, backend } from "paperkit";
const doc = await parseDocument(photoBytes, backend, {
layout: { model: { path: "models/layout/doclayout.onnx" }, classNames: DEFAULT_DOCLAYOUT_CLASS_NAMES },
text: { model: { path: "models/ppocr/rec.onnx" }, charset },
});TrOCR — handwriting (~64 MB, int8)
mkdir -p models/handwriting
BASE="https://huggingface.co/Xenova/trocr-small-handwritten/resolve/main"
curl -L -o models/handwriting/encoder.onnx "$BASE/onnx/encoder_model_quantized.onnx" # ~22 MB
curl -L -o models/handwriting/decoder.onnx "$BASE/onnx/decoder_model_quantized.onnx" # ~38 MB
curl -L -o models/handwriting/tokenizer.json "$BASE/tokenizer.json" # ~4 MBimport { createHandwritingRecognizer, createTokenDecoder, backend } from "paperkit";
import { promises as fs } from "node:fs";
const tokenizerJson = JSON.parse(await fs.readFile("models/handwriting/tokenizer.json", "utf8"));
const handwriting = await createHandwritingRecognizer(backend, {
encoder: { path: "models/handwriting/encoder.onnx" },
decoder: { path: "models/handwriting/decoder.onnx" },
decodeTokens: createTokenDecoder(tokenizerJson),
maxLength: 64,
});
// Plug into parseDocument:
await parseDocument(bytes, backend, { layout, text: handwriting });English only. Larger fp16 / fp32 variants (higher quality, bigger files) are listed in docs/features/ocr.md.
TexTeller — formula recognition (~303 MB int8)
mkdir -p models/formula
BASE="https://huggingface.co/onnx-community/TexTeller-ONNX/resolve/main"
curl -L -o models/formula/encoder.onnx "$BASE/onnx/encoder_model_int8.onnx" # ~84 MB
curl -L -o models/formula/decoder.onnx "$BASE/onnx/decoder_model_int8.onnx" # ~218 MB
curl -L -o models/formula/tokenizer.json "$BASE/tokenizer.json" # ~1.3 MBimport { createFormulaRecognizer, createTokenDecoder, backend } from "paperkit";
const tokenizerJson = JSON.parse(await fs.readFile("models/formula/tokenizer.json", "utf8"));
const formulaRec = await createFormulaRecognizer(backend, {
encoder: { path: "models/formula/encoder.onnx" },
decoder: { path: "models/formula/decoder.onnx" },
decodeTokens: createTokenDecoder(tokenizerJson),
});
await parseDocument(bytes, backend, { layout, text, formula: formulaRec });Smaller q4f16 (~200 MB) and larger fp16 / fp32 variants in docs/features/ocr.md.
SLANet-plus — table recognition (~7.4 MB)
mkdir -p models/table
curl -L -o models/table/slanet-plus.onnx \
"https://www.modelscope.cn/models/RapidAI/RapidTable/resolve/v2.0.0/slanet-plus.onnx"import { createTableRecognizer, createRecognizer, backend } from "paperkit";
const cellTextRecognizer = await createRecognizer(backend, {
model: { path: "models/ppocr/rec.onnx" }, charset,
});
const tableRec = await createTableRecognizer(backend, {
model: { path: "models/table/slanet-plus.onnx" },
cellTextRecognizer,
});
await parseDocument(bytes, backend, { layout, text: cellTextRecognizer, table: tableRec });UVDoc — dewarping (~30 MB)
mkdir -p models/dewarp
BASE="https://huggingface.co/fredcallagan/uvdoc-grid-onnx/resolve/main"
# Both files required — the .onnx references the .onnx.data externally.
curl -L -o models/dewarp/UVDoc_grid.onnx "$BASE/UVDoc_grid.onnx" # 237 KB
curl -L -o models/dewarp/UVDoc_grid.onnx.data "$BASE/UVDoc_grid.onnx.data" # 30 MBimport { dewarp, backend } from "paperkit";
const flat = await dewarp(photoImage, backend, {
model: { path: "models/dewarp/UVDoc_grid.onnx" },
});NAFNet — denoise / deblur (~91 MB)
mkdir -p models/denoise
curl -L -o models/denoise/nafnet.onnx \
"https://huggingface.co/opencv/deblurring_nafnet/resolve/main/deblurring_nafnet_2025may.onnx"import { denoise, backend } from "paperkit";
const clean = await denoise(photoBytes, backend, {
model: { path: "models/denoise/nafnet.onnx" },
inputName: "lq", // NAFNet input tensor
outputName: "output",
tileSize: 512, // NAFNet SCA module requires ≥ 384 per side
overlap: 64,
normalize: { scale: 1 / 255 },
});Other denoisers (Restormer, Swin2SR, NAFNet SIDD) and PyTorch → ONNX export instructions in docs/features/appearance.md.
DiT — document classification (~83 MB int8)
mkdir -p models/classify
curl -L -o models/classify/dit-rvlcdip.onnx \
"https://huggingface.co/Xenova/dit-base-finetuned-rvlcdip/resolve/main/onnx/model_quantized.onnx"import { classifyDocument, DEFAULT_RVLCDIP_LABELS, backend } from "paperkit";
const { category, confidence } = await classifyDocument(image, backend, {
model: { path: "models/classify/dit-rvlcdip.onnx" },
labels: DEFAULT_RVLCDIP_LABELS,
topK: 3,
});Size / precision variants (fp16, q4f16, fp32) in docs/features/classify.md.
Usage patterns
One-shot helpers
Easiest for small scripts — model loads and disposes per call:
import { denoise, backend } from "paperkit";
const clean = await denoise(file, backend, { model: { url: "/models/nafnet.onnx" } });Reusable pipelines (keep models loaded)
Preferred in apps where you process many images:
import { createDenoiser, createOcrPipeline, backend } from "paperkit";
const denoiser = await createDenoiser(backend, { model: { url: "/models/nafnet.onnx" } });
const ocrPipe = await createOcrPipeline(backend, { detection, recognition });
for (const photo of photos) {
const clean = await denoiser.denoise(photo);
const result = await ocrPipe.runRaw(clean);
}
await denoiser.dispose();
await ocrPipe.dispose();Full phone-photo pipeline — no ML required for the classical part
import {
applyExifRotation, detectPage, correctPerspective, deskew,
binarize, removeShadow,
createOcrPipeline, toSearchablePDF,
backend,
} from "paperkit";
const raw = await applyExifRotation(photo, backend);
const quad = detectPage(raw);
const flat = quad ? correctPerspective(raw, quad) : raw;
const upright = deskew(flat);
const lit = removeShadow(upright);
const bw = binarize(lit);
const pipe = await createOcrPipeline(backend, { detection, recognition });
const result = await pipe.runRaw(bw);
const jpeg = await backend.encodeImage(bw, "jpeg", 85);
const pdfBytes = await toSearchablePDF({
imageBytes: jpeg, imageWidth: bw.width, imageHeight: bw.height, ocr: result,
});See examples/node/scan.ts for the runnable version of this pipeline (zero models needed).
Full document parse (layout + per-region recognition)
import {
parseDocument, toMarkdown,
createFormulaRecognizer, createTableRecognizer,
createHandwritingRecognizer, createTokenDecoder,
DEFAULT_DOCLAYOUT_CLASS_NAMES,
backend,
} from "paperkit";
const formulaRec = await createFormulaRecognizer(backend, { encoder, decoder, decodeTokens: createTokenDecoder(formulaTokenizer) });
const tableRec = await createTableRecognizer(backend, { model: { path: "models/table/slanet-plus.onnx" }, cellTextRecognizer });
const doc = await parseDocument(photoBytes, backend, {
layout: { model: { path: "models/layout/doclayout.onnx" }, classNames: DEFAULT_DOCLAYOUT_CLASS_NAMES },
text: { model: { path: "models/ppocr/rec.onnx" }, charset: [...keys, " "] },
formula: formulaRec,
table: tableRec,
});
console.log(toMarkdown(doc));
// # Title
//
// Paragraph body text.
//
// $$ \psi_0(M) = \int … $$
//
// <table><tr><td colspan="2">A</td>…</table>Batch over many images
import { batchMap, createOcrPipeline, backend } from "paperkit";
const pipe = await createOcrPipeline(backend, { detection, recognition });
const results = await batchMap(
photos,
(bytes) => pipe.run(bytes),
{ concurrency: 2, onProgress: (e) => console.log(`${e.current}/${e.total}`) },
);
await pipe.dispose();Runtime-specific notes
Web:
import { denoise, backend } from "paperkit";
async function handleFile(file: File) {
const clean = await denoise(file, backend, { model: { url: "/models/nafnet.onnx" } });
const bytes = await backend.encodeImage(clean, "png");
return new Blob([bytes], { type: "image/png" });
}Node:
import { promises as fs } from "node:fs";
import { denoise, backend } from "paperkit";
const clean = await denoise(await fs.readFile("photo.jpg"), backend, {
model: { path: "./models/nafnet.onnx" },
});React Native (Expo development build):
import { FileSystem } from "expo-file-system";
import { denoiseRaw, backend } from "paperkit";
// Decode the image in your app — the native adapter doesn't bundle an image codec.
const raw = /* your decode-to-RGBA helper */;
const modelPath = `${FileSystem.documentDirectory}models/nafnet.onnx`;
// Download the ONNX model once on first launch; cache locally.
const clean = await denoiseRaw(raw, backend, { model: { path: modelPath } });Consumers typically use expo-image-manipulator + a small RGBA decoder for input, and expo-file-system to manage model downloads.
Architecture
your app ── imports "paperkit" ──► paperkit entry file (web / native / node)
│
│ wires core + runtime adapter
▼
paperkit core
│
┌───────────┬────────┬─────────┴─────────┬────────┬──────────┐
▼ ▼ ▼ ▼ ▼ ▼
geometry appearance ocr layout input export
(classical) (ML + classical) (ML) (ML dispatcher) (peer) (pure)
│ │ │ │ │ │
quality classify text batch kie workers
(classical) (both) (classical) (pure code) (pure+VLM) (patterns)Extending paperkit
Add a new feature module:
- Create
src/modules/<area>/<feature>.ts. Export functions that take aBackendand any options. - If the feature needs ML, call
backend.loadModel(...)andsession.run(...). Core tensor helpers (imageToTensor,tensorToImage, tiling, homography) live insrc/core/. - Add an index file for the module and re-export from
src/entries/shared.ts. - Done — your feature works on every runtime automatically.
Add a new runtime (Deno, Bun, Electron renderer, …):
- Implement the
Backendinterface insrc/adapters/<runtime>.ts. - Create
src/entries/<runtime>.ts(follow the pattern of existing entries). - Add the entry to
tsup.config.tsand theexportsmap inpackage.json.
Add an alternate recognizer:
Implement the Recognizer interface from src/modules/ocr/types.ts — recognize(image) → { text, confidence } + dispose(). Pass your Recognizer instance directly as options.recognition (for createOcrPipeline) or as options.text / options.formula / options.table (for parseDocument). No pipeline changes needed.
Validation
Every ML feature has been smoke-tested end-to-end against the recommended weights:
| Script | Covers |
|---|---|
| scripts/smoke-ocr.ts | PP-OCRv4 detection + recognition on a mixed-CH/EN book page |
| scripts/smoke-denoise.ts | OpenCV NAFNet deblurring on a blurred document |
| scripts/smoke-handwriting.ts | TrOCR small (int8) on an IAM handwriting line |
| scripts/smoke-formula.ts | TexTeller (int8) on a display equation |
| scripts/smoke-table.ts | SLANet-plus + PP-OCRv4 cell text on a multi-row table with colspan="4" |
| scripts/smoke-dewarp.ts | UVDoc grid-sample on a scanned book page |
| scripts/smoke-layout.ts | DocLayout-YOLO on a two-column paper with display formulas |
| scripts/smoke-classify.ts | DiT RVL-CDIP (int8) on scientific paper / form / book |
| scripts/smoke-pdf-roundtrip.ts | rasterizePDF → OCR → toSearchablePDF multi-page roundtrip |
Run any smoke script with npx tsx scripts/<name>.ts. Each expects the weights to live under models/<feature>/ (see the model download section).
Unit tests: 257 passing, 100 % statement / line / function coverage on src/core/** and src/modules/**.
Documentation
Per-feature guides in docs/features/:
| Module | Purpose | Models |
|---|---|---|
| geometry.md | EXIF, page detection, perspective, deskew, dewarp | UVDoc (dewarp only) |
| appearance.md | Denoise, binarize, shadow removal | NAFNet / Restormer / Swin2SR (denoise only) |
| ocr.md | Text detection + recognition + handwriting + formula + table | PP-OCR / TrOCR / TexTeller / SLANet |
| layout.md | Typed-region dispatcher + parseDocument | DocLayout-YOLO + any recognizer |
| input.md | PDF rasterization | None (peer deps: pdfjs-dist + canvas) |
| export.md | Searchable PDF + Markdown + JSON | None |
| quality.md | Blur detection | None |
| classify.md | Keyword + image classification | DiT (image path only) |
| text.md | Script detection | None |
| kie.md | Rule-based + VLM integration pattern | None (rule path) |
| batch.md | onProgress + batchMap | None |
| workers.md | Browser + Node worker patterns | None |
Runnable examples live under examples/:
examples/node/scan.ts— phone photo → clean image (no ML)examples/node/denoise.ts— denoise a single image with any ONNX modelexamples/node/ocr.ts— OCR a single image with PP-OCRexamples/node-worker/— full OCR pipeline running inworker_threads
License
MIT
