npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

docparser-core

v2.2.1

Published

Enterprise Document Intelligence Engine — Core Package

Readme

docparser-core

Enterprise Document Intelligence Engine — parse, clean, chunk, enrich, and securely format documents for RAG pipelines.

Published package name on npm: docparser-core

npm version license tests coverage

npm install docparser-core

Table of contents


Why DocParser

DocParser is a document intelligence pipeline (not “just a splitter”). Compared to typical alternatives (LangChain splitters, Unstructured.io, LlamaIndex loaders, Docling-style extractors), DocParser emphasizes:

  • Security-first ingestion: file validation and threat scanning happen before parsing.
  • Receipts and quality metrics: every run produces a content-safe receipt and a quantitative quality report.
  • Structured intermediate model (UDM): parsing produces a Unified Document Model (headings, paragraphs, tables, images, etc.), which enables better chunking than raw string splitting.
  • Multiple chunking strategies + auto-selection: choose the right strategy for the document (or let hybrid_auto choose).
  • Built-in enrichment hooks: keyword/entity extraction, importance scoring, plus optional LLM and embeddings via plugins.
  • Production ergonomics: batch + streaming APIs, CLI, cloud adapter interface, and serverless handlers.

Quick Start

Package name on npm: docparser-core

Install:

npm install docparser-core

If you plan to use the optional OCR runtime as well, install both packages:

npm install docparser-core docparser-ocr

Beginner checklist:

  1. Create a parser with a preset like general.
  2. Pass either a string or a Buffer to parser.process(...).
  3. Always provide a filename so format detection can do the right thing.
  4. Read parsed chunks from result.chunks and audit metadata from result.receipt.

Smallest working example:

import { DocParser } from 'docparser-core';

const parser = new DocParser({ preset: 'general' });
const result = await parser.process('Hello world. This is DocParser.', 'hello.txt');

console.log(result.chunks.length);
console.log(result.receipt.detectedFormat);

Processing a file from disk:

import { readFile } from 'node:fs/promises';
import { DocParser } from 'docparser-core';

const parser = new DocParser({ preset: 'general' });
const report = await readFile('./examples/inputs/report.md');
const result = await parser.process(report, 'report.md');

console.log(result.documentId);
console.log(result.chunks[0]?.content);
console.log(result.receipt.detectedFormat);

What you get back:

  • result.chunks: retrieval-ready chunks and metadata
  • result.receipt: content-safe processing receipt
  • result.qualityReport: chunk quality and coverage diagnostics
  • result.securityReport: threat and PII scan results

If you only want to get started without plugins, stay with docparser-core. Add docparser-ocr later when you need OCR/image preprocessing.

Recent parser additions in core:

  • raw .png, .jpg/.jpeg, .tiff, and .webp files now parse directly into UDM image elements
  • image-only PDF pages now rasterize into OCR-ready PNG data URLs instead of returning an empty UDM

First 15 Minutes

If you want the fastest productive path as a developer, do this:

  1. Install dependencies at the repo root: npm install
  2. Build the core package: npm run build -w packages/core
  3. Run the core test suite once: npm test -w packages/core
  4. Process one checked-in sample from ./examples/inputs
  5. If you work on medical flows, validate medical_summary_json before and after every heuristic change

Fast local commands:

# from repo root
npm run build -w packages/core
npm test -w packages/core

# inspect a sample file with the CLI
npx docparser inspect ./packages/core/examples/inputs/report.md

# emit a structured medical projection
npx docparser process ./packages/core/examples/inputs/Sample_Opd_Advice_1.pdf --format medical_summary_json

What to use when:

  • Use process(...) when you already have a string or Buffer in memory.
  • Use processFile(...) when you want the file-backed path, MIME detection, and easier local testing.
  • Use structured_json when building a custom projector or downstream data model.
  • Use medical_summary_json when you want document-family routing, review queues, and clinically oriented tables.

Developer Playbook

If you are building on top of docparser-core, these are the main things you can do and where to start:

| Goal | Use this surface | Start here | | --- | --- | --- | | Parse one document into retrieval-ready chunks | new DocParser(...).process(...) | API Reference | | Parse raw text, HTML, Markdown, Office docs, PDFs, or images | Built-in parsers | Supported Formats | | OCR standalone images and scanned PDFs | createOCRPlugin() or providers.ocr | Plugins and CLI | | Use local Tesseract, native Tesseract, Ollama, Claude, OpenRouter, or a gateway | providers.ocr / providers.llm | Built-In Providers | | Preflight provider health before production runs | docparser health or checkConfiguredProvidersHealth(...) | CLI and Live Provider Contract Tests | | Scaffold a commented config for new developers | docparser init / docparser setup | CLI | | Inspect how a file parses before chunking | docparser inspect | CLI | | Batch a directory or a rerun manifest | BatchProcessor or CLI batch flow | Batch Processing | | Stream progress and chunk-ready events | StreamProcessor or CLI telemetry flags | Streaming and CLI | | Project structured_json into DB-ready domain tables | projectStructuredOutputByName(...) / OutputRouter.formatProjectedByName(...) | API Reference and Output formats | | Add custom parsers, enrichers, chunkers, or outputs | Plugin system | Plugins and Contributing | | Export JSON, JSONL, vector-store payloads, and reports | OutputRouter and CLI report flags | API Reference and CLI | | Deploy in Docker, serverless, or cloud-storage pipelines | Deployment helpers | Docker, Serverless, and Cloud Storage |

Fastest workflow picks:

  • If you want a code-first start, begin with DocParser in API Reference.
  • If you want a no-code or low-code start, run npx docparser init and then use the generated config with docparser process.
  • If you want provider-backed OCR or LLM enrichment quickly, use one of the checked-in configs under ./examples/configs and one of the checked-in inputs under ./examples/inputs.
  • If you want fully local OCR, combine docparser-core with docparser-ocr and choose tesseract or native_tesseract in providers.ocr.
  • If you want hosted or gateway-managed inference, use openrouter, claude, ollama, or organization_gateway in providers.ocr / providers.llm and preflight with docparser health.

Supported Formats

This table reflects built-in parser implementations (what ParserRegistry will actually parse today). Format detection is performed by the security gate (magic bytes + extension + content sniffing).

Note on OCR handoff: createOCRPlugin() can only OCR image elements whose imageRef is a data URL. The built-in parsers now provide that handoff for raw PNG/JPEG/TIFF/WebP files, embedded OOXML images, and rasterized image-only PDF pages.

| Format | Typical extension(s) | Detected MIME type(s) | Parser | Key features | | ---------- | ----------------------------------------- | --------------------------------------------------------------------------- | ----------------- | --------------------------------------------------------------------------------------------------------------------------------- | | Plain text | .txt | text/plain | PlainTextParser | Emits a paragraph with full text. Also used as fallback when a structured parse yields no elements. | | Markdown | .md, .markdown | text/markdown, text/x-markdown | MarkdownParser | Headings → headings, lists → lists, code blocks → code elements, tables (GFM) → table elements, blockquotes → callouts. | | HTML/XHTML | .html, .htm, .xhtml | text/html, application/xhtml+xml | HTMLParser | Strips script/style/noscript, extracts text from <body> (or root if no body). | | XML | .xml | application/xml | XMLParser | Validates XML and emits an XML code element (plus a short root-element summary). | | JSON | .json | application/json | JSONParser | Parses JSON into a table when possible (array/object), otherwise emits a JSON code element. | | CSV | .csv | text/csv, application/csv | CSVParser | Builds a table element (headers + rows). Simple comma-splitting (no quoted-field parsing). | | Images | .png, .jpg, .jpeg, .tiff, .webp | image/png, image/jpeg, image/tiff, image/webp | ImageParser | Emits OCR-ready UDM image elements as data URLs, preserving the file hash and MIME type for downstream OCR routing. | | DOCX | .docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document | DOCXParser | Extracts WordprocessingML text directly from the OOXML archive, emits paragraph text, and collects parser warnings. | | PPTX | .pptx | application/vnd.openxmlformats-officedocument.presentationml.presentation | PPTXParser | Slide headings, paragraphs, lists, tables, image references, speaker notes; page breaks between slides; core properties metadata. | | XLSX | .xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | XLSXParser | Multi-sheet extraction, shared strings, basic type inference, table + structured data + natural language summary per sheet. | | PDF | .pdf | application/pdf | PDFParser | Extracts embedded text with pdf.js and, for image-only pages, rasterizes them into OCR-ready PNG image elements. |

Format detection edge cases

  • If a file is named .txt but the content looks like HTML/XML/JSON, DocParser will prefer the sniffed type (e.g. HTML) over plain text.

Installation

npm

npm install docparser-core

yarn

yarn add docparser-core

pnpm

pnpm add docparser-core

Optional OCR package

To keep the default docparser-core install smaller and avoid native/image OCR dependencies unless needed, OCR runtime components are in a separate optional package:

npm install docparser-ocr

Install this package only when you use createOCRPlugin() or direct OCR/image preprocessing classes.

Published OCR package name: docparser-ocr

Runtime requirements

  • Node.js: >=20.19.0
  • This package is ESM ("type": "module").

API Reference

All examples below import from the package root:

import {
  DocParser,
  BatchProcessor,
  StreamProcessor,
  CloudProcessor,
  createLLMPlugin,
  createEmbeddingPlugin,
  createOCRPlugin,
  OutputRouter,
  DEFAULT_CONFIG,
} from 'docparser-core';

import { OCRPipeline, TesseractProvider } from 'docparser-ocr';

new DocParser(config?)

import type { DocumentParserConfig } from 'docparser-core';
import { DocParser } from 'docparser-core';

const config: DocumentParserConfig = {
  preset: 'financial',
  chunking: { strategy: 'element_aware', maxChunkTokens: 1024 },
  security: { pii: { enabled: true, mode: 'mask' } },
  output: { format: 'jsonl' },
};

const parser = new DocParser(config);

Key constructor behaviors:

  • Config is resolved by ConfigManager (defaults → preset overrides → user overrides).
  • The pipeline is stage-based (security → parse → clean → chunk → enrich → scan → receipt/report).

Config discovery helpers

const resolved = parser.getConfig();
const configHash = parser.getConfigHash();

await parser.process(input, filename)

import { DocParser } from 'docparser-core';

const parser = new DocParser({ preset: 'general' });
const result = await parser.process(Buffer.from('Hello'), 'hello.txt');

console.log(result.documentId);
console.log(result.chunks);
console.log(result.receipt);
console.log(result.qualityReport);
console.log(result.securityReport);
console.log(result.explainabilityReport);

Input types

  • input: Buffer | string
  • filename: string (used for format detection)

Full result structure

import type { ProcessingResult } from 'docparser-core';

function handle(result: ProcessingResult) {
  // retrieval-ready chunks
  result.chunks;

  // content-safe audit trail (no document text)
  result.receipt;

  // quality metrics + recommendations
  result.qualityReport;

  // security validation + PII/threat reporting
  result.securityReport;

  // stage-by-stage rationale and transformation evidence
  result.explainabilityReport;
}

Runtime progress and chunk hooks

Use hooks when you need live pipeline telemetry during parser.process(...).

import { DocParser } from 'docparser-core';
import type { AuditLogEntry, ProgressDetails, ProcessingStage } from 'docparser-core';

const parser = new DocParser({
  hooks: {
    onProgress: (stage: ProcessingStage, percentage: number, details?: ProgressDetails) => {
      console.log(stage, percentage, details?.documentId, details?.totalChunks);
    },
    onChunkReady: (chunk) => {
      console.log('chunk ready:', (chunk as { chunkId: string }).chunkId);
    },
    onDocumentComplete: (receipt) => {
      console.log('done:', (receipt as { documentId: string }).documentId);
    },
    onGovernanceEvent: (event: AuditLogEntry) => {
      console.log(event.action, event.details);
    },
  },
});

await parser.process('Progress-aware processing example.', 'progress.txt');

onProgress now emits structured stage payloads for these built-in stages:

  • started
  • security
  • parsing
  • table_nl
  • cleaning
  • chunking
  • enrichment
  • pii_scan
  • security_scan
  • coverage
  • quality_report
  • complete

The details payload may include fields like filename, inputBytes, mimeType, fileHash, documentId, durationMs, itemsProcessed, totalChunks, strategyUsed, coveragePercentage, qualityScore, warningsCount, errorsCount, skipped, cached, and cacheScope.

Governance hooks

Use hooks.onGovernanceEvent with security.classification when you want chunk-level and document-level governance decisions during processing.

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  security: {
    classification: {
      enabled: true,
      defaultLevel: 'internal',
      rules: [{ pattern: 'secret', level: 'restricted' }],
    },
  },
  hooks: {
    onGovernanceEvent: (event) => {
      console.log(event.action, event.details);
    },
  },
});

Governance events currently emit chunk_classified for each final chunk and document_classified for the final document decision. Those same events also back securityReport.auditEventsCount, and the resolved governance decision is written to chunk.securityClassification and receipt.securitySummary.classification.

Runtime pipeline presets and stage toggles

Use pipeline.runtimePreset when you want to change processing behavior at runtime without redefining the whole config object.

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  preset: 'financial',
  pipeline: {
    runtimePreset: 'fast',
    stages: {
      enrichment: true,
    },
  },
});

const result = await parser.process('Contact [email protected] for the report.', 'runtime.txt');
console.log(result.receipt.chunkingStrategy);

Built-in runtime pipeline presets:

  • standard: default balanced pipeline behavior.
  • fast: switches chunking to sliding_window and skips table natural-language generation, enrichment, PII scan, and prompt-injection scan unless you explicitly re-enable a stage.
  • quality: favors more thorough chunking and keeps all optional stages enabled.
  • llm_light: keeps the pipeline on but avoids heavier LLM-style enrichment outputs such as summaries and generated questions.

Available stage toggles under pipeline.stages:

  • tableNl
  • cleaning
  • enrichment
  • piiScan
  • promptInjection

When a stage is disabled, onProgress still emits that stage with details.skipped = true, so progress consumers can distinguish a skipped stage from a missing event.

Incremental reprocessing cache

Use performance.cache to reuse parse and chunk results when the same document is processed again with the same configuration.

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  performance: {
    cache: {
      enabled: true,
      maxItems: 50,
      ttl: 60_000,
      cacheParsing: true,
      cacheChunking: true,
    },
  },
});

await parser.process('Cache me once.', 'cached.txt');
await parser.process('Cache me once.', 'cached.txt');

console.log(parser.getCacheStats());
parser.clearCache();

Behavior:

  • Parse cache is keyed by the validated file hash.
  • Chunk cache is keyed by file hash plus config hash.
  • Progress events for cached stages include details.cached = true and details.cacheScope (parse or chunk).
  • The current runtime implementation supports only the in-memory cache backend. If another backend is configured, incremental reprocessing cache is disabled instead of attempting a partial implementation.

Explainability report

Use result.explainabilityReport when you need a compact audit trail of what changed during processing and why.

import { DocParser } from 'docparser-core';

const parser = new DocParser();
const result = await parser.process('  The docu-\nment is “important”.  ', 'explain.txt');

console.log(result.explainabilityReport.summary);
console.log(result.explainabilityReport.stages);
console.log(result.explainabilityReport.evidence[0]);

The explainability report currently includes:

  • summary: aggregate counts for cleaning transformations, warnings, errors, skipped stages, and cached stages.
  • stages: per-stage timing plus completed, skipped, or cached status for the measured pipeline stages.
  • evidence: sampled before/after cleaning evidence, parser or pipeline notes, and coverage-gap previews when coverage is incomplete.

This makes it easier to answer questions like "why did this text change?", "which stages were skipped?", and "did cached results influence this run?" without reconstructing the full pipeline manually.

parser.formatOutput(result)

formatOutput routes through OutputRouter using config.output.

JSON / JSONL

import { DocParser } from 'docparser-core';

const parser = new DocParser({ output: { format: 'json' } });
const result = await parser.process('A test document.', 'test.txt');

const json = parser.formatOutput(result);
console.log(typeof json); // string
import { DocParser } from 'docparser-core';

const parser = new DocParser({ output: { format: 'jsonl' } });
const result = await parser.process('A test document.', 'test.txt');

const jsonl = parser.formatOutput(result);
console.log(typeof jsonl); // string

Vector DB payloads (metadata + IDs)

DocParser formatters do not generate embeddings. If you generate embeddings (via a plugin or your own pipeline), store vectors in your DB and use DocParser’s output as metadata/payload.

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  output: {
    format: 'pinecone',
    vectorDb: { namespace: 'docs-prod' },
  },
});

const result = await parser.process('Vector payload demo.', 'demo.txt');
const payload = parser.formatOutput(result);
import { DocParser, OutputRouter } from 'docparser-core';

const parser = new DocParser();
const result = await parser.process('Vector payload demo.', 'demo.txt');

const router = new OutputRouter({ format: 'qdrant', vectorDb: { collection: 'chunks' } });
const qdrantPoints = router.format(result.chunks);

Supported output formats (implemented)

  • json, jsonl, text, markdown, csv, structured_json, prescription_json, medical_summary_json, pinecone, chroma, weaviate, qdrant, langchain, llamaindex, custom

Structured relational output and named projectors

Use structured_json when you want a normalized intermediate schema, then project that schema into domain-specific tables or DTOs.

import { DocParser } from 'docparser-core';

const parser = new DocParser();
const result = await parser.process('Prescription ID: RX-2026-0415', 'prescription.txt');

console.log(parser.listStructuredProjectors());
const prescriptionTables = parser.projectStructuredOutputByName(result, 'prescription');

You can also register your own projector once and reuse it by name:

import { DocParser } from 'docparser-core';

const parser = new DocParser();
parser.registerStructuredProjector({
  projectionName: 'summary_projection',
  project: (structured) => ({
    projection: 'summary_projection',
    sourceSchema: structured.schema_version,
    contentItems: structured.summary.content_item_count,
  }),
});

const result = await parser.process('Custom structured projection.', 'projection.txt');
const projected = parser.projectStructuredOutputByName(result, 'summary_projection');

Built-in projector notes:

  • structured_json emits the normalized structured_json.v1 schema.
  • prescription_json is the built-in prescription-table projection over that schema.
  • medical_summary_json classifies documents into medical families (for example prescription, OP advice, lab report, discharge summary, referral letter, insurance form), emits high-signal extracted fields, and includes confidence-aware review_findings rows for ambiguous or noisy template captures. It also emits advanced clinical tables for medication_safety, normalized lab_observations, computed lab_trends, chronological timeline_events, explicit referrals, and normalized insurance_coverages rows.
  • projectStructuredOutputs(...) and projectStructuredOutputsByName(...) let you fan out one normalized intermediate into multiple domain outputs without recomputing the base projection.

Medical summary workflow

Use medical_summary_json when you need more than plain chunk output and want document-level routing plus reviewable structured tables.

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  preset: 'medical',
  output: { format: 'medical_summary_json' },
});

const result = await parser.process(
  [
    'OPD Date Time | 31 Oct 2025 04:47 PM',
    'Patient: Jane Doe',
    'Doctor: Dr. Meera Nair',
    'Diagnosis: Viral fever',
    'Follow Up / Cross - Referral',
    '07 Nov 2025 Date',
  ].join('\n'),
  'op-advice.txt',
);

const output = parser.formatOutput(result);
console.log(output.tables.medical_documents[0]);
console.log(output.tables.review_findings);
console.log(output.tables.timeline_events);

The main medical_summary_json tables are:

  • medical_documents: one row per source document with document type, core demographics, encounter date, and validation flags
  • review_findings: confidence-aware routing rows for missing, noisy, or low-confidence extractions
  • medication_safety: high-risk medications, duplicate therapy hints, and interaction signals
  • lab_observations: normalized test rows with units, ranges, and abnormal flags
  • lab_trends: grouped numeric trends for repeat lab observations
  • timeline_events: encounter, birth, lab observation, and follow-up dates
  • referrals: referral specialty and referred-provider rows when referral signals are strong enough
  • insurance_coverages: payer/policy/member/claim identifiers when insurance signals are present

Recommended developer loop for medical heuristics:

  1. Run one or more real samples from ./examples/inputs
  2. Inspect review_findings first instead of trusting extracted fields blindly
  3. Tighten heuristics against the smallest failing slice
  4. Re-run focused tests and then real sample PDFs

This is especially important for OP advice and second-opinion documents, where OCR/template noise can be high and review routing is safer than overconfident extraction.

await parser.use(plugin)

Register plugins to extend or enrich processing.

import { DocParser, createEmbeddingPlugin } from 'docparser-core';
import type { EmbeddingProviderAdapter } from 'docparser-core';

const mockEmbedder: EmbeddingProviderAdapter = {
  name: 'mock-embedder',
  dimensions: 3,
  async embed(texts) {
    return texts.map(() => [0.1, 0.2, 0.3]);
  },
  async embedSingle(text) {
    return [0.1, 0.2, 0.3];
  },
};

const parser = new DocParser();
await parser.use(
  createEmbeddingPlugin(mockEmbedder, {
    batchSize: 100,
    maxConcurrentBatches: 2,
  }),
);

await parser.loadPlugin(pluginSource)

Load a plugin dynamically from a module specifier, file URL, async factory, or direct config-style registration.

import { DocParser } from 'docparser-core';

const parser = new DocParser();
await parser.loadPlugin({
  module: './plugins/myChunkPlugin.js',
  exportName: 'chunkPlugin',
});

Relative module paths are resolved from process.cwd(). Package specifiers and file: URLs are also supported.


Presets

Presets are config override bundles applied on top of defaults.

| Preset | What it optimizes for | | ---------------- | ------------------------------------------------------------------------------- | | general | Balanced defaults (baseline) | | financial | Tables/charts emphasis, currency/date normalization, PII masking | | legal | Hierarchical chunking, larger chunk sizes + overlap, conservative normalization | | technical | Code + images handled as first-class chunks | | medical | Aggressive PII policy (redaction) and broader scan targets | | conversational | Semantic chunking for topic shifts (chat/transcripts) | | research | Structured chunking with keywords/entities and cross-references |


Chunking strategies

DocParser currently implements these strategy names:

| Strategy | When to use it | Notes | | --------------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | hybrid_auto | You don’t know upfront what’s best | Auto-selects an internal strategy based on the UDM. | | hierarchical | Headings/sections matter | Builds chunks aligned to section structure and splits oversized single text elements by structural boundaries before falling back to hard word limits. | | sliding_window | Long unstructured text | Token-window with overlap; avoids mid-sentence splits where possible. | | element_aware | Mixed content | Treats atomic elements like tables/code/images as indivisible and auto-splits oversized page-level text elements by blank lines, lines, sentences, then word boundaries. | | semantic_similarity | Topic shift grouping | Lexical similarity-based boundaries (no embeddings required). |

Note: speaker_turn is implemented. Some additional strategy names in ChunkingStrategyName currently behave as aliases to existing built-in strategies.

Adaptive chunking

Enable chunking.adaptive.enabled when you want DocParser to keep using hybrid_auto strategy selection but also retune chunk sizes per document shape.

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  chunking: {
    strategy: 'hybrid_auto',
    adaptive: { enabled: true },
  },
});

Current adaptive profiles:

  • Structured documents with headings: larger hierarchical chunks and small-chunk merging.
  • Visual or code-heavy documents: element-aware chunks with tighter token ceilings.
  • Long unstructured text: denser sliding windows with more overlap to preserve context.

Oversized page-level text fallback

Some parsers, especially PDF extraction on form-like documents, can emit a single very large text element for a page. hierarchical and element_aware now break those oversized elements into smaller chunks automatically using this fallback order:

  1. Blank-line sections
  2. Line boundaries
  3. Sentence boundaries
  4. Hard word boundaries

This keeps atomic elements intact while preventing one-page text blobs from bypassing maxChunkTokens.


Output formats

Examples below use parser.formatOutput(result); all outputs are derived from result.chunks.

Pinecone

import { DocParser } from 'docparser-core';

const parser = new DocParser({ output: { format: 'pinecone', vectorDb: { namespace: 'docs' } } });
const result = await parser.process('Hello Pinecone', 'pinecone.txt');
const payload = parser.formatOutput(result);

Chroma

import { DocParser } from 'docparser-core';

const parser = new DocParser({ output: { format: 'chroma' } });
const result = await parser.process('Hello Chroma', 'chroma.txt');
const payload = parser.formatOutput(result);

Weaviate

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  output: { format: 'weaviate', vectorDb: { collection: 'DocumentChunk' } },
});
const result = await parser.process('Hello Weaviate', 'weaviate.txt');
const objects = parser.formatOutput(result);

Qdrant

import { DocParser } from 'docparser-core';

const parser = new DocParser({ output: { format: 'qdrant', vectorDb: { collection: 'chunks' } } });
const result = await parser.process('Hello Qdrant', 'qdrant.txt');
const points = parser.formatOutput(result);

LangChain

import { DocParser } from 'docparser-core';

const parser = new DocParser({ output: { format: 'langchain' } });
const result = await parser.process('Hello LangChain', 'langchain.txt');
const docs = parser.formatOutput(result);

Custom formatter

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  output: {
    format: 'custom',
    customFormatter: (chunks) => ({ count: chunks.length }),
  },
});

const result = await parser.process('Custom output', 'custom.txt');
const out = parser.formatOutput(result);

Configuration

DocParser accepts a single configuration object (DocumentParserConfig). All fields are optional; defaults are applied.

import type { DocumentParserConfig } from 'docparser-core';

export const config: DocumentParserConfig = {
  // High-level: choose a preset override bundle
  preset: 'general',

  // Input constraints (NOTE: some input limits are currently enforced via security gate)
  input: {
    allowedFormats: ['*'], // MIME list or '*' (wildcard)
    maxFileSize: 100 * 1024 * 1024, // bytes
    maxPages: 5000,
    maxElements: 50000,
    password: undefined, // reserved for password-protected inputs
    urlFetchTimeout: 30_000,
    urlFetchHeaders: {},
    encodingOverride: undefined,
  },

  // Per-format parsing knobs
  parsing: {
    pdf: {
      extractImages: true,
      extractTables: true,
      ocrScannedPages: true,
      detectMultiColumn: true,
      detectHeadersFooters: true,
      imageDpi: 300,
    },
    docx: {
      extractImages: true,
      extractTables: true,
      followStyles: true,
      includeComments: false,
      includeTrackChanges: false,
    },
    html: {
      sanitize: true,
      removeScripts: true,
      removeStyles: false,
      removeBoilerplate: true,
      followLinks: false,
      maxDepth: 1,
    },
    xlsx: {
      includeFormulas: false,
      includeCharts: true,
      sheetSelection: 'all',
      emptyCellHandling: 'skip',
    },
    pptx: {
      includeNotes: true,
      includeHiddenSlides: false,
    },
    ocr: {
      engine: 'tesseract',
      languages: ['eng'],
      confidenceThreshold: 60,
      preprocess: true,
      dpi: 300,
    },
    images: {
      classifyType: true,
      extractTextOcr: true,
      generateDescriptions: false,
      skipDecorative: true,
      minSize: { width: 50, height: 50 },
    },
    charts: {
      extractData: true,
      generateSummary: true,
      convertToTable: true,
    },
  },

  // Cleaning: normalize and remove noise before chunking
  cleaning: {
    normalizeUnicode: true,
    fixHyphenation: true,
    mergeBrokenParagraphs: true,
    removeWatermarks: true,
    removeHeadersFooters: true,
    removePageNumbers: true,
    normalizeWhitespace: true,
    normalizeDates: false,
    normalizeCurrencies: false,
    buildGlossary: true,
    customPatternsToRemove: [], // regex strings
    customPatternsToFlag: [], // regex strings
    preserveFormattingIn: ['code', 'quotes', 'tables'],
  },

  // Chunking: turn UDM elements into retrieval-ready chunks
  chunking: {
    strategy: 'hybrid_auto',
    customStrategy: undefined, // reserved
    tokenCounter: 'approximate',
    customTokenCounter: undefined,
    minChunkTokens: 64,
    maxChunkTokens: 512,
    targetChunkTokens: 256,
    overlap: {
      enabled: true,
      tokens: 50,
      strategy: 'sentence_boundary',
    },
    headingContext: {
      includeParentHeadings: true,
      maxHeadingDepth: 3,
      separator: ' > ',
    },
    tableHandling: 'own_chunk',
    chartHandling: 'own_chunk',
    imageHandling: 'skip_decorative',
    codeHandling: 'own_chunk',
    mergeSmallChunks: false,
    neverSplit: ['table', 'code', 'chart', 'image'],
    semanticThreshold: 0.3,
    semanticWindowSize: 3,
  },

  // Enrichment: keywords/entities/importance and optional extras
  enrichment: {
    extractKeywords: true,
    keywordMethod: 'tfidf',
    maxKeywords: 10,
    extractEntities: true,
    entityTypes: ['PERSON', 'ORGANIZATION', 'DATE', 'MONEY', 'LOCATION'],
    detectTopics: false,
    generateSummary: false,
    summaryMaxTokens: 50,
    generateQuestions: false,
    maxQuestions: 3,
    computeImportance: true,
    resolveCrossReferences: true,
    linkChunks: true,
    feedback: {
      enabled: false,
      preferredTerms: [],
      deprioritizedTerms: [],
      prioritizedEntityTypes: [],
      preferredTermBoost: 0.08,
      deprioritizedTermPenalty: 0.1,
      entityTypeBoost: 0.05,
    },
    semanticDedup: {
      enabled: false,
      similarityThreshold: 0.6,
      minSharedTerms: 2,
      maxRelatedChunks: 3,
    },
    customEnrichers: [],
  },

  // Security gate + in-pipeline security scanning
  security: {
    maxFileSize: 100 * 1024 * 1024,
    allowedFormats: ['*'],
    blockMacros: true,
    blockJavascriptInPdf: true,
    blockXxe: true,
    blockPolyglots: true,
    virusScanHook: undefined,
    sandboxParsing: true,
    maxMemoryPerDoc: 512 * 1024 * 1024,
    maxProcessingTime: 300_000,
    maxTempDisk: 1024 * 1024 * 1024,
    pii: {
      enabled: true,
      provider: undefined, // PIIProviderAdapter
      mode: 'detect', // detect | mask | redact | hash
      scanTargets: ['text_content', 'table_cells', 'metadata_fields'],
      maskFormat: '[{{TYPE}}]',
      allowlist: [],
      customPatterns: [],
      encryptionKey: undefined,
    },
    promptInjection: {
      enabled: true,
      action: 'flag',
      customPatterns: [],
      sensitivity: 'medium',
    },
    dataLifecycle: {
      encryptTempFiles: true,
      secureDelete: true,
      clearMemoryAfter: true,
      maxCacheTtl: 3_600_000,
    },
    audit: {
      enabled: true,
      logDestination: 'memory',
      customLogger: undefined,
      includeDocumentName: true,
      anonymizeDocumentName: false,
    },
    classification: {
      enabled: false,
      defaultLevel: 'internal',
      rules: [],
    },
  },

  // Output formatting
  output: {
    format: 'json',
    customFormatter: undefined,
    includeFields: [],
    excludeFields: [],
    includeQualityReport: true,
    includeProcessingReceipt: true,
    flattenMetadata: false,
    vectorDb: {
      namespace: undefined,
      collection: undefined,
      index: undefined,
      batchSize: 100,
    },
  },

  // Performance/caching (some fields are reserved for future concurrency engines)
  performance: {
    mode: 'balanced',
    workerThreads: 1,
    maxConcurrentDocs: 4,
    streaming: false,
    streamingThreshold: 50 * 1024 * 1024,
    cache: {
      enabled: true,
      backend: 'memory',
      maxSize: 100 * 1024 * 1024,
      maxItems: 100,
      ttl: 3_600_000,
      cacheParsing: true,
      cacheChunking: true,
      customBackend: undefined,
    },
    memoryBudget: 1024 * 1024 * 1024,
  },

  // Runtime pipeline behavior presets + optional stage toggles
  pipeline: {
    runtimePreset: 'standard',
    stages: {
      tableNl: true,
      cleaning: true,
      enrichment: true,
      piiScan: true,
      promptInjection: true,
    },
  },

  // Plugins can be pre-registered from objects, async factories, module specifiers, or file URLs.
  plugins: [],

  // Hooks
  hooks: {
    // Stage name + percentage + structured payload for runtime telemetry
    onProgress: undefined,
    onWarning: undefined,
    onError: undefined,
    // Fired once per final chunk after enrichment/security scans complete
    onChunkReady: undefined,
    onGovernanceEvent: undefined,
    onDocumentComplete: undefined,
  },

  // Debug
  debug: {
    enabled: false,
    visualOutput: false,
    visualOutputPath: undefined,
    profilePerformance: false,
    verboseLogging: false,
    dumpUdm: false,
    dumpUdmPath: undefined,
  },
};

Plugins

DocParser plugins can be registered via await parser.use(plugin), await parser.loadPlugin(pluginSource), or upfront through new DocParser({ plugins: [...] }).

Example config-driven dynamic loading:

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  plugins: [
    {
      module: './plugins/myEnricher.js',
      exportName: 'enricherPlugin',
    },
    async () => ({
      name: 'inline-enricher',
      version: '1.3.0',
      type: 'enrichment',
      hooks: {
        enrichAll: async (chunks) => chunks,
      },
    }),
  ],
});

createLLMPlugin() (with a mock provider)

import { DocParser, createLLMPlugin } from 'docparser-core';
import type { LLMProviderAdapter, LLMRequest, LLMResponse } from 'docparser-core';

const mockLLM: LLMProviderAdapter = {
  name: 'mock-llm',
  async isAvailable() {
    return true;
  },
  async complete(req: LLMRequest): Promise<LLMResponse> {
    return {
      text: 'Mock summary.\n1. What is this?\n2. Why does it matter?',
      tokensUsed: 10,
      model: req.model ?? 'mock',
    };
  },
  // Present for interface completeness; not used by createLLMPlugin
  async embed(texts: string[]) {
    return texts.map(() => [0.01, 0.02, 0.03]);
  },
};

const parser = new DocParser();
await parser.use(
  createLLMPlugin(mockLLM, { generateSummary: true, generateQuestions: true, maxChunks: 25 }),
);

const result = await parser.process('This is a chunk that will be summarized.', 'llm.txt');
console.log(result.chunks[0]?.summary);

createEmbeddingPlugin()

import { DocParser, createEmbeddingPlugin } from 'docparser-core';
import type { EmbeddingProviderAdapter } from 'docparser-core';

const embedder: EmbeddingProviderAdapter = {
  name: 'mock-embeddings',
  dimensions: 3,
  async embed(texts) {
    return texts.map(() => [0.1, 0.2, 0.3]);
  },
  async embedSingle(text) {
    return [0.1, 0.2, 0.3];
  },
};

const parser = new DocParser();
await parser.use(
  createEmbeddingPlugin(embedder, {
    batchSize: 100,
    maxConcurrentBatches: 2,
  }),
);

const result = await parser.process('Embedding demo', 'embed.txt');
console.log(result.chunks[0]?.metadata.embedding);

The legacy positional signature createEmbeddingPlugin(embedder, batchSize, maxConcurrentBatches) is still supported.

createOCRPlugin()

Install the optional OCR runtime package first:

npm install docparser-ocr
import { DocParser, createOCRPlugin } from 'docparser-core';

const parser = new DocParser();
await parser.use(
  createOCRPlugin({
    tesseract: { languages: ['eng'] },
    pipeline: { qualityLevel: 'thorough', minConfidence: 0.3 },
    routing: {
      preferDocumentLanguage: true,
      forceImageTypes: ['screenshot', 'form', 'handwritten'],
    },
  }),
);

// OCR triggers when parsed content includes image elements whose imageRef is a data URL.
// Built-in parsers now provide that for raw PNG/JPEG/TIFF/WebP files,
// OOXML embedded images, and rasterized image-only PDF pages.

Smart OCR routing now prefers document language when the provider supports it, skips low-value image types like logos by default, and still routes high-signal images such as screenshots, forms, and handwritten snippets even when the parser did not explicitly mark containsText = true.

High-signal OCR routes now also bypass the lightweight non-text image precheck before recognition. That matters for prescriptions, scanned forms, and handwriting, where a cheap contrast heuristic may miss real text even though OCR should still run.

High-signal OCR routes now also evaluate the built-in OCR retry strategies instead of stopping at the first barely acceptable pass. By default, the OCR runtime now retries low-confidence results with scanned-document, low-contrast, and handwriting-oriented recognition strategies, including alternate Tesseract page segmentation modes such as single_block and sparse_text.

Use pipeline.qualityLevel to control how much OCR work the runtime should spend per image:

  • fast: minimal OCR work and no built-in retries
  • balanced: current default, with practical recovery passes for common scans
  • thorough: more preprocessing, more retry profiles, and more full-profile evaluation
  • extreme: computation-heavy OCR that tries the broadest built-in set of segmentation and engine strategies

If you provide pipeline.retryProfiles, they replace the built-in retry set by default so existing custom OCR flows stay stable. Set pipeline.useBuiltInRetryProfiles = true when you want your custom profiles appended after the built-in quality-level retries.

When OCR runs through the optional runtime package, the plugin now records per-image OCR diagnostics in udm.processingNotes and in the processing receipt. Those diagnostics include geometry counts, detected script, detected orientation, and any OSD-driven rotation that was applied before retry profiles ran.

If docparser-ocr is not installed, createOCRPlugin() throws an explicit runtime error with the install command.

Writing a custom plugin

DocParser’s runtime plugin interface is DocParserPlugin. In practice, hooks are invoked with a single argument (UDM or chunks), and your hook returns the modified value.

import { DocParser } from 'docparser-core';
import type { DocParserPlugin } from 'docparser-core';
import type { DocumentUDM } from 'docparser-core';

const taggerPlugin: DocParserPlugin = {
  name: 'tagger',
  version: '1.3.0',
  type: 'parser',
  hooks: {
    afterParse: async (udm: DocumentUDM) => {
      return {
        ...udm,
        processingNotes: [
          ...udm.processingNotes,
          { severity: 'info', stage: 'plugin', message: 'Tagged by taggerPlugin' },
        ],
      };
    },
  },
};

const parser = new DocParser();
await parser.use(taggerPlugin);

Feedback-driven enrichment

Use enrichment.feedback when you want enrichment outputs to reflect reviewer or product feedback about which signals matter more.

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  enrichment: {
    feedback: {
      enabled: true,
      preferredTerms: ['invoice', 'payment'],
      deprioritizedTerms: ['lunch', 'social'],
      prioritizedEntityTypes: ['MONEY', 'DATE'],
    },
  },
});

When enabled, preferred terms are promoted in chunk keywords, deprioritized terms reduce ranking weight, and prioritized entity types increase chunk importance. The applied matches are written to chunk.metadata.feedbackSignals for downstream inspection.

Semantic dedup enrichment

Use enrichment.semanticDedup when you want near-duplicate chunks linked for downstream review without removing either chunk from the final result.

import { DocParser } from 'docparser-core';

const parser = new DocParser({
  enrichment: {
    semanticDedup: {
      enabled: true,
      similarityThreshold: 0.6,
      minSharedTerms: 2,
      maxRelatedChunks: 3,
    },
  },
});

When enabled, semantically similar chunks receive related relationships and a chunk.metadata.semanticDedup.relatedChunks summary containing the related chunk IDs, similarity score, and shared terms.


Batch Processing

Use BatchProcessor for concurrency + per-file isolation + progress.

import { BatchProcessor, createLogger } from 'docparser-core';

const logger = createLogger({ level: 'warn' });
const batch = new BatchProcessor(logger, { preset: 'general', output: { format: 'jsonl' } });

const result = await batch.process(
  [
    { buffer: Buffer.from('Doc 1'), filename: 'a.txt' },
    { buffer: Buffer.from('# Title\nHello'), filename: 'b.md' },
  ],
  {
    concurrency: 3,
    onProgress: (done, total, filename) => {
      console.log(`[${done}/${total}] ${filename}`);
    },
  },
);

console.log(result.stats);

Streaming

Stream bytes in and emit chunks as JSONL (or objects) using StreamProcessor.

import { StreamProcessor } from 'docparser-core';
import { createReadStream, createWriteStream } from 'node:fs';

const processor = new StreamProcessor();

createReadStream('report.docx')
  .pipe(processor.createChunkStream('report.docx'))
  .pipe(createWriteStream('chunks.jsonl'));

createChunkStream() now forwards each chunk as soon as the parser marks it ready, so downstream consumers can start ingesting chunk output before final receipt and quality reporting complete.

When performance.streaming is enabled and the incoming stream crosses performance.streamingThreshold, both processStream() and createChunkStream() spill the input to a temp file and reuse DocParser.processFile() so file-backed parser paths can run without buffering the full stream in memory.

If you want object-mode chunk objects instead of JSONL strings:

import { StreamProcessor } from 'docparser-core';
import { createReadStream } from 'node:fs';

const processor = new StreamProcessor();

createReadStream('report.docx')
  .pipe(processor.createChunkStream('report.docx', { objectMode: true }))
  .on('data', (chunk) => {
    console.log(chunk.chunkId, chunk.content.length);
  });

Or buffer the stream and get a normal ProcessingResult:

import { StreamProcessor } from 'docparser-core';
import { createReadStream } from 'node:fs';

const processor = new StreamProcessor();
const result = await processor.processStream(createReadStream('big.md'), 'big.md');

CLI

The package ships a docparser CLI (built from src/cli/cli.ts).

Help

npm install docparser-core
npx docparser --help

# Interactive config wizard
npx docparser init
# Alias
npx docparser setup

# Or run without installing:
npx --package docparser-core docparser --help

init and setup run the same interactive wizard. The wizard can generate:

  • a starter config for common workflows
  • a full annotated template that covers the current DocumentParserConfig surface
  • either docparser.config.mjs or docparser.config.jsonc
  • an optional .env.local file for secret placeholders

The generated config is validated before the wizard exits, so invalid enum values or impossible numeric combinations fail immediately with clear errors.

docparser init

npx docparser init

# You can also pre-fill parts of the wizard
npx docparser init --format mjs --template full --output ./docparser.config.mjs

What the wizard does:

  • asks for a config format: mjs or jsonc
  • asks for a starter or full annotated template
  • guides you through common defaults like preset, runtime preset, output format, PII mode, and OCR defaults
  • can scaffold built-in providers.ocr and providers.llm sections for Tesseract, native Tesseract, Ollama, Claude, OpenRouter, and organization-managed gateways
  • can generate a sample plugin stub and a sample governance hook stub
  • can generate .env.local and wire secrets through environment-variable references instead of hardcoding them into the config file

Use mjs when you want comments, hooks, imports, or executable logic in the config. Use jsonc when you want a purely declarative config file with comments.

docparser health

npx docparser health --config ./docparser.config.mjs

# Widen the sweep beyond startup-only providers and write JSON to a file
npx docparser health \
  --config ./docparser.config.mjs \
  --all \
  --output ./out/provider-health.json

health loads the same config formats as process and inspect, resolves .env and .env.local, and runs provider health checks against the configured built-in or custom runtime-backed providers that are active in the current runtime configuration. For LLM registrations, that means summary or question generation is enabled either on the provider registration or via inherited enrichment defaults.

Health flags:

  • --config: required. Load a parser config from a JSON, JSONC, or ESM module file.
  • --all: check all active configured runtime-backed providers instead of only those marked with healthCheck.validateOnStartup: true.
  • --output: write the JSON health report to a file instead of stdout.

Behavior:

  • exits non-zero when any checked provider is unhealthy
  • prints a JSON summary with provider-level health results
  • uses the same provider retry and redaction rules as the runtime adapters
  • defaults to startup-scoped checks, so it mirrors the fail-fast provider validation path used during parser initialization

Built-In Providers

DocumentParserConfig now supports a top-level providers section for built-in OCR and LLM runtime registration. That means you can turn on OCR and LLM enrichment directly from config files without writing custom plugin factories first.

export default {
  parsing: {
    ocr: {
      languages: ['eng'],
      confidenceThreshold: 60,
      dpi: 300,
    },
  },
  providers: {
    ocr: {
      provider: {
        kind: 'native_tesseract',
        executablePath: process.env.DOCPARSER_NATIVE_TESSERACT_PATH,
        languages: ['eng'],
      },
    },
    llm: {
      provider: {
        kind: 'openrouter',
        apiKey: process.env.DOCPARSER_OPENROUTER_API_KEY,
        model: 'anthropic/claude-3.5-sonnet',
      },
      generateSummary: true,
      generateQuestions: false,
      summaryMaxTokens: 96,
    },
  },
};

Supported built-in provider kinds:

  • OCR: tesseract, native_tesseract, ollama, claude, openrouter, organization_gateway
  • LLM: ollama, claude, openrouter, organization_gateway

organization_gateway stays generic on purpose. Set protocol to openai, anthropic, or ollama and point baseUrl at your internal proxy or gateway.

HTTP-backed built-in providers share the same hardening controls:

  • provider.retry: bounded retry/backoff for transient network and upstream failures
  • providers.ocr.healthCheck and providers.llm.healthCheck: optional startup validation so the parser can fail fast before processing documents
  • surfaced provider failures redact common bearer tokens, API keys, secrets, and passwords before the error leaves the adapter layer

You can preflight those providers yourself before processing documents:

import { checkConfiguredProvidersHealth } from 'docparser-core';

const results = await checkConfiguredProvidersHealth(config, { startupOnly: true });
console.log(results);

Production Templates

The examples below are intended as production-oriented starting points. Tune parsing.ocr, chunking, and enrichment settings to your document mix, but keep the provider retry and startup health sections unless your environment already handles those concerns upstream.

Checked-in equivalents of these templates now live in ./examples/configs/:

  • ./examples/configs/native-tesseract-openrouter.mjs
  • ./examples/configs/tesseract-js-openrouter.mjs
  • ./examples/configs/ollama-local.mjs
  • ./examples/configs/claude-direct.mjs
  • ./examples/configs/openrouter-hosted.mjs
  • ./examples/configs/organization-gateway-openai.mjs
  • ./examples/configs/provider-example.env

Runnable sample inputs that pair with these configs live in ./examples/inputs/:

  • ./examples/inputs/report.md
  • ./examples/inputs/page.html
  • ./examples/inputs/renewal-form.png
  • ./examples/inputs/renewal-scan.pdf
  • ./examples/inputs/README.md

Native Tesseract OCR with OpenRouter LLM:

export default {
  parsing: {
    ocr: {
      languages: ['eng'],
      confidenceThreshold: 60,
      dpi: 300,
    },
  },
  providers: {
    ocr: {
      provider: {
        kind: 'native_tesseract',
        executablePath: process.env.DOCPARSER_NATIVE_TESSERACT_PATH,
        languages: ['eng'],
        userDefinedDpi: 300,
      },
      processEmbeddedImages: true,
      processImagePDFs: true,
      pipeline: {
        qualityLevel: 'thorough',
        useBuiltInRetryProfiles: true,
        earlyExitConfidence: 0.92,
      },
    },
    llm: {
      provider: {
        kind: 'openrouter',
        apiKey: process.env.DOCPARSER_OPENROUTER_API_KEY,
        model: 'anthropic/claude-3.5-sonnet',
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      generateSummary: true,
      generateQuestions: false,
      summaryMaxTokens: 128,
      concurrency: 2,
    },
  },
};

If you prefer the optional JavaScript OCR runtime instead of a native binary, switch kind: 'native_tesseract' to kind: 'tesseract' and keep the same OCR registration shape.

Fully local Ollama OCR and LLM:

export default {
  providers: {
    ocr: {
      provider: {
        kind: 'ollama',
        baseUrl: process.env.DOCPARSER_OLLAMA_BASE_URL,
        model: 'llava:13b',
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      processEmbeddedImages: true,
      processImagePDFs: true,
      pipeline: { qualityLevel: 'balanced' },
    },
    llm: {
      provider: {
        kind: 'ollama',
        baseUrl: process.env.DOCPARSER_OLLAMA_BASE_URL,
        model: 'qwen2.5:7b-instruct',
        embeddingModel: 'nomic-embed-text',
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      generateSummary: true,
      generateQuestions: true,
      maxQuestions: 4,
      summaryMaxTokens: 128,
    },
  },
};

Direct Claude OCR and LLM:

export default {
  providers: {
    ocr: {
      provider: {
        kind: 'claude',
        apiKey: process.env.DOCPARSER_CLAUDE_API_KEY,
        model: 'claude-3-7-sonnet-latest',
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      processEmbeddedImages: true,
      processImagePDFs: true,
      pipeline: { qualityLevel: 'balanced' },
    },
    llm: {
      provider: {
        kind: 'claude',
        apiKey: process.env.DOCPARSER_CLAUDE_API_KEY,
        model: 'claude-3-7-sonnet-latest',
        maxTokens: 1024,
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      generateSummary: true,
      generateQuestions: false,
      summaryMaxTokens: 128,
    },
  },
};

OpenRouter OCR and LLM:

export default {
  providers: {
    ocr: {
      provider: {
        kind: 'openrouter',
        apiKey: process.env.DOCPARSER_OPENROUTER_API_KEY,
        model: 'google/gemini-2.0-flash-001',
        referer: process.env.DOCPARSER_OPENROUTER_REFERER,
        appName: process.env.DOCPARSER_OPENROUTER_APP_NAME,
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      processEmbeddedImages: true,
      processImagePDFs: true,
      pipeline: { qualityLevel: 'balanced' },
    },
    llm: {
      provider: {
        kind: 'openrouter',
        apiKey: process.env.DOCPARSER_OPENROUTER_API_KEY,
        model: 'anthropic/claude-3.5-sonnet',
        referer: process.env.DOCPARSER_OPENROUTER_REFERER,
        appName: process.env.DOCPARSER_OPENROUTER_APP_NAME,
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      generateSummary: true,
      generateQuestions: true,
      maxQuestions: 3,
      summaryMaxTokens: 128,
    },
  },
};

Organization-managed OCR and LLM gateway:

export default {
  providers: {
    ocr: {
      provider: {
        kind: 'organization_gateway',
        protocol: 'openai',
        baseUrl: process.env.DOCPARSER_GATEWAY_BASE_URL,
        apiKey: process.env.DOCPARSER_GATEWAY_API_KEY,
        model: 'gpt-4.1-mini',
        healthPath: '/readyz',
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      processEmbeddedImages: true,
      processImagePDFs: true,
      pipeline: { qualityLevel: 'balanced' },
    },
    llm: {
      provider: {
        kind: 'organization_gateway',
        protocol: 'openai',
        baseUrl: process.env.DOCPARSER_GATEWAY_BASE_URL,
        apiKey: process.env.DOCPARSER_GATEWAY_API_KEY,
        model: 'gpt-4.1-mini',
        healthPath: '/readyz',
        timeoutMs: 30000,
        retry: {
          maxAttempts: 3,
          initialDelayMs: 250,
          maxDelayMs: 2000,
          backoffMultiplier: 2,
        },
      },
      healthCheck: {
        validateOnStartup: true,
        failOnUnhealthy: true,
      },
      generateSummary: true,
      generateQuestions: false,
      summaryMaxTokens: 128,
    },
  },
};

If your gateway speaks Anthropic or Ollama semantics instead, keep the same shape and switch protocol to anthropic or ollama.

docparser process <file>

npx docparser process ./examples/inputs/report.md \
  --config ./docparser.config.mjs \
  --plugin ./plugins/custom-chunker.mjs \
  --progress jsonl \
  --stream-events \
  --format jsonl \
  --preset general \
  --strategy hybrid_auto \
  --max-tokens 512 \
  --min-tokens 64 \
  --overlap 50 \
  --pii detect \
  --receipt ./out/receipt.json \
  --quality ./out/quality.json \
  --security ./out/securit