docparser-core
v2.2.1
Published
Enterprise Document Intelligence Engine — Core Package
Maintainers
Readme
docparser-core
Enterprise Document Intelligence Engine — parse, clean, chunk, enrich, and securely format documents for RAG pipelines.
Published package name on npm: docparser-core
npm install docparser-coreTable of contents
- Why DocParser
- Quick Start
- First 15 Minutes
- Developer Playbook
- Supported Formats
- Installation
- API Reference
- Plugins
- Batch Processing
- Streaming
- CLI
- Live Provider Contract Tests
- Docker
- Serverless
- Cloud Storage
- Security
- Processing Receipt
- Quality Report
- Explainability Report
- Architecture
- Comparison with alternatives
- Contributing
- Publishing
- License
Why DocParser
DocParser is a document intelligence pipeline (not “just a splitter”). Compared to typical alternatives (LangChain splitters, Unstructured.io, LlamaIndex loaders, Docling-style extractors), DocParser emphasizes:
- Security-first ingestion: file validation and threat scanning happen before parsing.
- Receipts and quality metrics: every run produces a content-safe receipt and a quantitative quality report.
- Structured intermediate model (UDM): parsing produces a Unified Document Model (headings, paragraphs, tables, images, etc.), which enables better chunking than raw string splitting.
- Multiple chunking strategies + auto-selection: choose the right strategy for the document (or let
hybrid_autochoose). - Built-in enrichment hooks: keyword/entity extraction, importance scoring, plus optional LLM and embeddings via plugins.
- Production ergonomics: batch + streaming APIs, CLI, cloud adapter interface, and serverless handlers.
Quick Start
Package name on npm: docparser-core
Install:
npm install docparser-coreIf you plan to use the optional OCR runtime as well, install both packages:
npm install docparser-core docparser-ocrBeginner checklist:
- Create a parser with a preset like
general. - Pass either a string or a
Buffertoparser.process(...). - Always provide a filename so format detection can do the right thing.
- Read parsed chunks from
result.chunksand audit metadata fromresult.receipt.
Smallest working example:
import { DocParser } from 'docparser-core';
const parser = new DocParser({ preset: 'general' });
const result = await parser.process('Hello world. This is DocParser.', 'hello.txt');
console.log(result.chunks.length);
console.log(result.receipt.detectedFormat);Processing a file from disk:
import { readFile } from 'node:fs/promises';
import { DocParser } from 'docparser-core';
const parser = new DocParser({ preset: 'general' });
const report = await readFile('./examples/inputs/report.md');
const result = await parser.process(report, 'report.md');
console.log(result.documentId);
console.log(result.chunks[0]?.content);
console.log(result.receipt.detectedFormat);What you get back:
result.chunks: retrieval-ready chunks and metadataresult.receipt: content-safe processing receiptresult.qualityReport: chunk quality and coverage diagnosticsresult.securityReport: threat and PII scan results
If you only want to get started without plugins, stay with docparser-core. Add docparser-ocr later when you need OCR/image preprocessing.
Recent parser additions in core:
- raw
.png,.jpg/.jpeg,.tiff, and.webpfiles now parse directly into UDM image elements - image-only PDF pages now rasterize into OCR-ready PNG data URLs instead of returning an empty UDM
First 15 Minutes
If you want the fastest productive path as a developer, do this:
- Install dependencies at the repo root:
npm install - Build the core package:
npm run build -w packages/core - Run the core test suite once:
npm test -w packages/core - Process one checked-in sample from
./examples/inputs - If you work on medical flows, validate
medical_summary_jsonbefore and after every heuristic change
Fast local commands:
# from repo root
npm run build -w packages/core
npm test -w packages/core
# inspect a sample file with the CLI
npx docparser inspect ./packages/core/examples/inputs/report.md
# emit a structured medical projection
npx docparser process ./packages/core/examples/inputs/Sample_Opd_Advice_1.pdf --format medical_summary_jsonWhat to use when:
- Use
process(...)when you already have a string orBufferin memory. - Use
processFile(...)when you want the file-backed path, MIME detection, and easier local testing. - Use
structured_jsonwhen building a custom projector or downstream data model. - Use
medical_summary_jsonwhen you want document-family routing, review queues, and clinically oriented tables.
Developer Playbook
If you are building on top of docparser-core, these are the main things you can do and where to start:
| Goal | Use this surface | Start here |
| --- | --- | --- |
| Parse one document into retrieval-ready chunks | new DocParser(...).process(...) | API Reference |
| Parse raw text, HTML, Markdown, Office docs, PDFs, or images | Built-in parsers | Supported Formats |
| OCR standalone images and scanned PDFs | createOCRPlugin() or providers.ocr | Plugins and CLI |
| Use local Tesseract, native Tesseract, Ollama, Claude, OpenRouter, or a gateway | providers.ocr / providers.llm | Built-In Providers |
| Preflight provider health before production runs | docparser health or checkConfiguredProvidersHealth(...) | CLI and Live Provider Contract Tests |
| Scaffold a commented config for new developers | docparser init / docparser setup | CLI |
| Inspect how a file parses before chunking | docparser inspect | CLI |
| Batch a directory or a rerun manifest | BatchProcessor or CLI batch flow | Batch Processing |
| Stream progress and chunk-ready events | StreamProcessor or CLI telemetry flags | Streaming and CLI |
| Project structured_json into DB-ready domain tables | projectStructuredOutputByName(...) / OutputRouter.formatProjectedByName(...) | API Reference and Output formats |
| Add custom parsers, enrichers, chunkers, or outputs | Plugin system | Plugins and Contributing |
| Export JSON, JSONL, vector-store payloads, and reports | OutputRouter and CLI report flags | API Reference and CLI |
| Deploy in Docker, serverless, or cloud-storage pipelines | Deployment helpers | Docker, Serverless, and Cloud Storage |
Fastest workflow picks:
- If you want a code-first start, begin with
DocParserin API Reference. - If you want a no-code or low-code start, run
npx docparser initand then use the generated config withdocparser process. - If you want provider-backed OCR or LLM enrichment quickly, use one of the checked-in configs under
./examples/configsand one of the checked-in inputs under./examples/inputs. - If you want fully local OCR, combine
docparser-corewithdocparser-ocrand choosetesseractornative_tesseractinproviders.ocr. - If you want hosted or gateway-managed inference, use
openrouter,claude,ollama, ororganization_gatewayinproviders.ocr/providers.llmand preflight withdocparser health.
Supported Formats
This table reflects built-in parser implementations (what ParserRegistry will actually parse today). Format detection is performed by the security gate (magic bytes + extension + content sniffing).
Note on OCR handoff:
createOCRPlugin()can only OCR image elements whoseimageRefis a data URL. The built-in parsers now provide that handoff for raw PNG/JPEG/TIFF/WebP files, embedded OOXML images, and rasterized image-only PDF pages.
| Format | Typical extension(s) | Detected MIME type(s) | Parser | Key features |
| ---------- | ----------------------------------------- | --------------------------------------------------------------------------- | ----------------- | --------------------------------------------------------------------------------------------------------------------------------- |
| Plain text | .txt | text/plain | PlainTextParser | Emits a paragraph with full text. Also used as fallback when a structured parse yields no elements. |
| Markdown | .md, .markdown | text/markdown, text/x-markdown | MarkdownParser | Headings → headings, lists → lists, code blocks → code elements, tables (GFM) → table elements, blockquotes → callouts. |
| HTML/XHTML | .html, .htm, .xhtml | text/html, application/xhtml+xml | HTMLParser | Strips script/style/noscript, extracts text from <body> (or root if no body). |
| XML | .xml | application/xml | XMLParser | Validates XML and emits an XML code element (plus a short root-element summary). |
| JSON | .json | application/json | JSONParser | Parses JSON into a table when possible (array/object), otherwise emits a JSON code element. |
| CSV | .csv | text/csv, application/csv | CSVParser | Builds a table element (headers + rows). Simple comma-splitting (no quoted-field parsing). |
| Images | .png, .jpg, .jpeg, .tiff, .webp | image/png, image/jpeg, image/tiff, image/webp | ImageParser | Emits OCR-ready UDM image elements as data URLs, preserving the file hash and MIME type for downstream OCR routing. |
| DOCX | .docx | application/vnd.openxmlformats-officedocument.wordprocessingml.document | DOCXParser | Extracts WordprocessingML text directly from the OOXML archive, emits paragraph text, and collects parser warnings. |
| PPTX | .pptx | application/vnd.openxmlformats-officedocument.presentationml.presentation | PPTXParser | Slide headings, paragraphs, lists, tables, image references, speaker notes; page breaks between slides; core properties metadata. |
| XLSX | .xlsx | application/vnd.openxmlformats-officedocument.spreadsheetml.sheet | XLSXParser | Multi-sheet extraction, shared strings, basic type inference, table + structured data + natural language summary per sheet. |
| PDF | .pdf | application/pdf | PDFParser | Extracts embedded text with pdf.js and, for image-only pages, rasterizes them into OCR-ready PNG image elements. |
Format detection edge cases
- If a file is named
.txtbut the content looks like HTML/XML/JSON, DocParser will prefer the sniffed type (e.g. HTML) over plain text.
Installation
npm
npm install docparser-coreyarn
yarn add docparser-corepnpm
pnpm add docparser-coreOptional OCR package
To keep the default docparser-core install smaller and avoid native/image OCR dependencies unless needed,
OCR runtime components are in a separate optional package:
npm install docparser-ocrInstall this package only when you use createOCRPlugin() or direct OCR/image preprocessing classes.
Published OCR package name: docparser-ocr
Runtime requirements
- Node.js:
>=20.19.0 - This package is ESM (
"type": "module").
API Reference
All examples below import from the package root:
import {
DocParser,
BatchProcessor,
StreamProcessor,
CloudProcessor,
createLLMPlugin,
createEmbeddingPlugin,
createOCRPlugin,
OutputRouter,
DEFAULT_CONFIG,
} from 'docparser-core';
import { OCRPipeline, TesseractProvider } from 'docparser-ocr';new DocParser(config?)
import type { DocumentParserConfig } from 'docparser-core';
import { DocParser } from 'docparser-core';
const config: DocumentParserConfig = {
preset: 'financial',
chunking: { strategy: 'element_aware', maxChunkTokens: 1024 },
security: { pii: { enabled: true, mode: 'mask' } },
output: { format: 'jsonl' },
};
const parser = new DocParser(config);Key constructor behaviors:
- Config is resolved by
ConfigManager(defaults → preset overrides → user overrides). - The pipeline is stage-based (security → parse → clean → chunk → enrich → scan → receipt/report).
Config discovery helpers
const resolved = parser.getConfig();
const configHash = parser.getConfigHash();await parser.process(input, filename)
import { DocParser } from 'docparser-core';
const parser = new DocParser({ preset: 'general' });
const result = await parser.process(Buffer.from('Hello'), 'hello.txt');
console.log(result.documentId);
console.log(result.chunks);
console.log(result.receipt);
console.log(result.qualityReport);
console.log(result.securityReport);
console.log(result.explainabilityReport);Input types
input:Buffer | stringfilename: string (used for format detection)
Full result structure
import type { ProcessingResult } from 'docparser-core';
function handle(result: ProcessingResult) {
// retrieval-ready chunks
result.chunks;
// content-safe audit trail (no document text)
result.receipt;
// quality metrics + recommendations
result.qualityReport;
// security validation + PII/threat reporting
result.securityReport;
// stage-by-stage rationale and transformation evidence
result.explainabilityReport;
}Runtime progress and chunk hooks
Use hooks when you need live pipeline telemetry during parser.process(...).
import { DocParser } from 'docparser-core';
import type { AuditLogEntry, ProgressDetails, ProcessingStage } from 'docparser-core';
const parser = new DocParser({
hooks: {
onProgress: (stage: ProcessingStage, percentage: number, details?: ProgressDetails) => {
console.log(stage, percentage, details?.documentId, details?.totalChunks);
},
onChunkReady: (chunk) => {
console.log('chunk ready:', (chunk as { chunkId: string }).chunkId);
},
onDocumentComplete: (receipt) => {
console.log('done:', (receipt as { documentId: string }).documentId);
},
onGovernanceEvent: (event: AuditLogEntry) => {
console.log(event.action, event.details);
},
},
});
await parser.process('Progress-aware processing example.', 'progress.txt');onProgress now emits structured stage payloads for these built-in stages:
startedsecurityparsingtable_nlcleaningchunkingenrichmentpii_scansecurity_scancoveragequality_reportcomplete
The details payload may include fields like filename, inputBytes, mimeType, fileHash, documentId, durationMs, itemsProcessed, totalChunks, strategyUsed, coveragePercentage, qualityScore, warningsCount, errorsCount, skipped, cached, and cacheScope.
Governance hooks
Use hooks.onGovernanceEvent with security.classification when you want chunk-level and document-level governance decisions during processing.
import { DocParser } from 'docparser-core';
const parser = new DocParser({
security: {
classification: {
enabled: true,
defaultLevel: 'internal',
rules: [{ pattern: 'secret', level: 'restricted' }],
},
},
hooks: {
onGovernanceEvent: (event) => {
console.log(event.action, event.details);
},
},
});Governance events currently emit chunk_classified for each final chunk and document_classified for the final document decision. Those same events also back securityReport.auditEventsCount, and the resolved governance decision is written to chunk.securityClassification and receipt.securitySummary.classification.
Runtime pipeline presets and stage toggles
Use pipeline.runtimePreset when you want to change processing behavior at runtime without redefining the whole config object.
import { DocParser } from 'docparser-core';
const parser = new DocParser({
preset: 'financial',
pipeline: {
runtimePreset: 'fast',
stages: {
enrichment: true,
},
},
});
const result = await parser.process('Contact [email protected] for the report.', 'runtime.txt');
console.log(result.receipt.chunkingStrategy);Built-in runtime pipeline presets:
standard: default balanced pipeline behavior.fast: switches chunking tosliding_windowand skips table natural-language generation, enrichment, PII scan, and prompt-injection scan unless you explicitly re-enable a stage.quality: favors more thorough chunking and keeps all optional stages enabled.llm_light: keeps the pipeline on but avoids heavier LLM-style enrichment outputs such as summaries and generated questions.
Available stage toggles under pipeline.stages:
tableNlcleaningenrichmentpiiScanpromptInjection
When a stage is disabled, onProgress still emits that stage with details.skipped = true, so progress consumers can distinguish a skipped stage from a missing event.
Incremental reprocessing cache
Use performance.cache to reuse parse and chunk results when the same document is processed again with the same configuration.
import { DocParser } from 'docparser-core';
const parser = new DocParser({
performance: {
cache: {
enabled: true,
maxItems: 50,
ttl: 60_000,
cacheParsing: true,
cacheChunking: true,
},
},
});
await parser.process('Cache me once.', 'cached.txt');
await parser.process('Cache me once.', 'cached.txt');
console.log(parser.getCacheStats());
parser.clearCache();Behavior:
- Parse cache is keyed by the validated file hash.
- Chunk cache is keyed by file hash plus config hash.
- Progress events for cached stages include
details.cached = trueanddetails.cacheScope(parseorchunk). - The current runtime implementation supports only the in-memory cache backend. If another backend is configured, incremental reprocessing cache is disabled instead of attempting a partial implementation.
Explainability report
Use result.explainabilityReport when you need a compact audit trail of what changed during processing and why.
import { DocParser } from 'docparser-core';
const parser = new DocParser();
const result = await parser.process(' The docu-\nment is “important”. ', 'explain.txt');
console.log(result.explainabilityReport.summary);
console.log(result.explainabilityReport.stages);
console.log(result.explainabilityReport.evidence[0]);The explainability report currently includes:
summary: aggregate counts for cleaning transformations, warnings, errors, skipped stages, and cached stages.stages: per-stage timing pluscompleted,skipped, orcachedstatus for the measured pipeline stages.evidence: sampled before/after cleaning evidence, parser or pipeline notes, and coverage-gap previews when coverage is incomplete.
This makes it easier to answer questions like "why did this text change?", "which stages were skipped?", and "did cached results influence this run?" without reconstructing the full pipeline manually.
parser.formatOutput(result)
formatOutput routes through OutputRouter using config.output.
JSON / JSONL
import { DocParser } from 'docparser-core';
const parser = new DocParser({ output: { format: 'json' } });
const result = await parser.process('A test document.', 'test.txt');
const json = parser.formatOutput(result);
console.log(typeof json); // stringimport { DocParser } from 'docparser-core';
const parser = new DocParser({ output: { format: 'jsonl' } });
const result = await parser.process('A test document.', 'test.txt');
const jsonl = parser.formatOutput(result);
console.log(typeof jsonl); // stringVector DB payloads (metadata + IDs)
DocParser formatters do not generate embeddings. If you generate embeddings (via a plugin or your own pipeline), store vectors in your DB and use DocParser’s output as metadata/payload.
import { DocParser } from 'docparser-core';
const parser = new DocParser({
output: {
format: 'pinecone',
vectorDb: { namespace: 'docs-prod' },
},
});
const result = await parser.process('Vector payload demo.', 'demo.txt');
const payload = parser.formatOutput(result);import { DocParser, OutputRouter } from 'docparser-core';
const parser = new DocParser();
const result = await parser.process('Vector payload demo.', 'demo.txt');
const router = new OutputRouter({ format: 'qdrant', vectorDb: { collection: 'chunks' } });
const qdrantPoints = router.format(result.chunks);Supported output formats (implemented)
json,jsonl,text,markdown,csv,structured_json,prescription_json,medical_summary_json,pinecone,chroma,weaviate,qdrant,langchain,llamaindex,custom
Structured relational output and named projectors
Use structured_json when you want a normalized intermediate schema, then project that schema into domain-specific tables or DTOs.
import { DocParser } from 'docparser-core';
const parser = new DocParser();
const result = await parser.process('Prescription ID: RX-2026-0415', 'prescription.txt');
console.log(parser.listStructuredProjectors());
const prescriptionTables = parser.projectStructuredOutputByName(result, 'prescription');You can also register your own projector once and reuse it by name:
import { DocParser } from 'docparser-core';
const parser = new DocParser();
parser.registerStructuredProjector({
projectionName: 'summary_projection',
project: (structured) => ({
projection: 'summary_projection',
sourceSchema: structured.schema_version,
contentItems: structured.summary.content_item_count,
}),
});
const result = await parser.process('Custom structured projection.', 'projection.txt');
const projected = parser.projectStructuredOutputByName(result, 'summary_projection');Built-in projector notes:
structured_jsonemits the normalizedstructured_json.v1schema.prescription_jsonis the built-in prescription-table projection over that schema.medical_summary_jsonclassifies documents into medical families (for example prescription, OP advice, lab report, discharge summary, referral letter, insurance form), emits high-signal extracted fields, and includes confidence-awarereview_findingsrows for ambiguous or noisy template captures. It also emits advanced clinical tables formedication_safety, normalizedlab_observations, computedlab_trends, chronologicaltimeline_events, explicitreferrals, and normalizedinsurance_coveragesrows.projectStructuredOutputs(...)andprojectStructuredOutputsByName(...)let you fan out one normalized intermediate into multiple domain outputs without recomputing the base projection.
Medical summary workflow
Use medical_summary_json when you need more than plain chunk output and want document-level routing plus reviewable structured tables.
import { DocParser } from 'docparser-core';
const parser = new DocParser({
preset: 'medical',
output: { format: 'medical_summary_json' },
});
const result = await parser.process(
[
'OPD Date Time | 31 Oct 2025 04:47 PM',
'Patient: Jane Doe',
'Doctor: Dr. Meera Nair',
'Diagnosis: Viral fever',
'Follow Up / Cross - Referral',
'07 Nov 2025 Date',
].join('\n'),
'op-advice.txt',
);
const output = parser.formatOutput(result);
console.log(output.tables.medical_documents[0]);
console.log(output.tables.review_findings);
console.log(output.tables.timeline_events);The main medical_summary_json tables are:
medical_documents: one row per source document with document type, core demographics, encounter date, and validation flagsreview_findings: confidence-aware routing rows for missing, noisy, or low-confidence extractionsmedication_safety: high-risk medications, duplicate therapy hints, and interaction signalslab_observations: normalized test rows with units, ranges, and abnormal flagslab_trends: grouped numeric trends for repeat lab observationstimeline_events: encounter, birth, lab observation, and follow-up datesreferrals: referral specialty and referred-provider rows when referral signals are strong enoughinsurance_coverages: payer/policy/member/claim identifiers when insurance signals are present
Recommended developer loop for medical heuristics:
- Run one or more real samples from
./examples/inputs - Inspect
review_findingsfirst instead of trusting extracted fields blindly - Tighten heuristics against the smallest failing slice
- Re-run focused tests and then real sample PDFs
This is especially important for OP advice and second-opinion documents, where OCR/template noise can be high and review routing is safer than overconfident extraction.
await parser.use(plugin)
Register plugins to extend or enrich processing.
import { DocParser, createEmbeddingPlugin } from 'docparser-core';
import type { EmbeddingProviderAdapter } from 'docparser-core';
const mockEmbedder: EmbeddingProviderAdapter = {
name: 'mock-embedder',
dimensions: 3,
async embed(texts) {
return texts.map(() => [0.1, 0.2, 0.3]);
},
async embedSingle(text) {
return [0.1, 0.2, 0.3];
},
};
const parser = new DocParser();
await parser.use(
createEmbeddingPlugin(mockEmbedder, {
batchSize: 100,
maxConcurrentBatches: 2,
}),
);await parser.loadPlugin(pluginSource)
Load a plugin dynamically from a module specifier, file URL, async factory, or direct config-style registration.
import { DocParser } from 'docparser-core';
const parser = new DocParser();
await parser.loadPlugin({
module: './plugins/myChunkPlugin.js',
exportName: 'chunkPlugin',
});Relative module paths are resolved from process.cwd(). Package specifiers and file: URLs are also supported.
Presets
Presets are config override bundles applied on top of defaults.
| Preset | What it optimizes for |
| ---------------- | ------------------------------------------------------------------------------- |
| general | Balanced defaults (baseline) |
| financial | Tables/charts emphasis, currency/date normalization, PII masking |
| legal | Hierarchical chunking, larger chunk sizes + overlap, conservative normalization |
| technical | Code + images handled as first-class chunks |
| medical | Aggressive PII policy (redaction) and broader scan targets |
| conversational | Semantic chunking for topic shifts (chat/transcripts) |
| research | Structured chunking with keywords/entities and cross-references |
Chunking strategies
DocParser currently implements these strategy names:
| Strategy | When to use it | Notes |
| --------------------- | ---------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------ |
| hybrid_auto | You don’t know upfront what’s best | Auto-selects an internal strategy based on the UDM. |
| hierarchical | Headings/sections matter | Builds chunks aligned to section structure and splits oversized single text elements by structural boundaries before falling back to hard word limits. |
| sliding_window | Long unstructured text | Token-window with overlap; avoids mid-sentence splits where possible. |
| element_aware | Mixed content | Treats atomic elements like tables/code/images as indivisible and auto-splits oversized page-level text elements by blank lines, lines, sentences, then word boundaries. |
| semantic_similarity | Topic shift grouping | Lexical similarity-based boundaries (no embeddings required). |
Note:
speaker_turnis implemented. Some additional strategy names inChunkingStrategyNamecurrently behave as aliases to existing built-in strategies.
Adaptive chunking
Enable chunking.adaptive.enabled when you want DocParser to keep using hybrid_auto strategy selection but also retune chunk sizes per document shape.
import { DocParser } from 'docparser-core';
const parser = new DocParser({
chunking: {
strategy: 'hybrid_auto',
adaptive: { enabled: true },
},
});Current adaptive profiles:
- Structured documents with headings: larger hierarchical chunks and small-chunk merging.
- Visual or code-heavy documents: element-aware chunks with tighter token ceilings.
- Long unstructured text: denser sliding windows with more overlap to preserve context.
Oversized page-level text fallback
Some parsers, especially PDF extraction on form-like documents, can emit a single very large text element for a page. hierarchical and element_aware now break those oversized elements into smaller chunks automatically using this fallback order:
- Blank-line sections
- Line boundaries
- Sentence boundaries
- Hard word boundaries
This keeps atomic elements intact while preventing one-page text blobs from bypassing maxChunkTokens.
Output formats
Examples below use parser.formatOutput(result); all outputs are derived from result.chunks.
Pinecone
import { DocParser } from 'docparser-core';
const parser = new DocParser({ output: { format: 'pinecone', vectorDb: { namespace: 'docs' } } });
const result = await parser.process('Hello Pinecone', 'pinecone.txt');
const payload = parser.formatOutput(result);Chroma
import { DocParser } from 'docparser-core';
const parser = new DocParser({ output: { format: 'chroma' } });
const result = await parser.process('Hello Chroma', 'chroma.txt');
const payload = parser.formatOutput(result);Weaviate
import { DocParser } from 'docparser-core';
const parser = new DocParser({
output: { format: 'weaviate', vectorDb: { collection: 'DocumentChunk' } },
});
const result = await parser.process('Hello Weaviate', 'weaviate.txt');
const objects = parser.formatOutput(result);Qdrant
import { DocParser } from 'docparser-core';
const parser = new DocParser({ output: { format: 'qdrant', vectorDb: { collection: 'chunks' } } });
const result = await parser.process('Hello Qdrant', 'qdrant.txt');
const points = parser.formatOutput(result);LangChain
import { DocParser } from 'docparser-core';
const parser = new DocParser({ output: { format: 'langchain' } });
const result = await parser.process('Hello LangChain', 'langchain.txt');
const docs = parser.formatOutput(result);Custom formatter
import { DocParser } from 'docparser-core';
const parser = new DocParser({
output: {
format: 'custom',
customFormatter: (chunks) => ({ count: chunks.length }),
},
});
const result = await parser.process('Custom output', 'custom.txt');
const out = parser.formatOutput(result);Configuration
DocParser accepts a single configuration object (DocumentParserConfig). All fields are optional; defaults are applied.
import type { DocumentParserConfig } from 'docparser-core';
export const config: DocumentParserConfig = {
// High-level: choose a preset override bundle
preset: 'general',
// Input constraints (NOTE: some input limits are currently enforced via security gate)
input: {
allowedFormats: ['*'], // MIME list or '*' (wildcard)
maxFileSize: 100 * 1024 * 1024, // bytes
maxPages: 5000,
maxElements: 50000,
password: undefined, // reserved for password-protected inputs
urlFetchTimeout: 30_000,
urlFetchHeaders: {},
encodingOverride: undefined,
},
// Per-format parsing knobs
parsing: {
pdf: {
extractImages: true,
extractTables: true,
ocrScannedPages: true,
detectMultiColumn: true,
detectHeadersFooters: true,
imageDpi: 300,
},
docx: {
extractImages: true,
extractTables: true,
followStyles: true,
includeComments: false,
includeTrackChanges: false,
},
html: {
sanitize: true,
removeScripts: true,
removeStyles: false,
removeBoilerplate: true,
followLinks: false,
maxDepth: 1,
},
xlsx: {
includeFormulas: false,
includeCharts: true,
sheetSelection: 'all',
emptyCellHandling: 'skip',
},
pptx: {
includeNotes: true,
includeHiddenSlides: false,
},
ocr: {
engine: 'tesseract',
languages: ['eng'],
confidenceThreshold: 60,
preprocess: true,
dpi: 300,
},
images: {
classifyType: true,
extractTextOcr: true,
generateDescriptions: false,
skipDecorative: true,
minSize: { width: 50, height: 50 },
},
charts: {
extractData: true,
generateSummary: true,
convertToTable: true,
},
},
// Cleaning: normalize and remove noise before chunking
cleaning: {
normalizeUnicode: true,
fixHyphenation: true,
mergeBrokenParagraphs: true,
removeWatermarks: true,
removeHeadersFooters: true,
removePageNumbers: true,
normalizeWhitespace: true,
normalizeDates: false,
normalizeCurrencies: false,
buildGlossary: true,
customPatternsToRemove: [], // regex strings
customPatternsToFlag: [], // regex strings
preserveFormattingIn: ['code', 'quotes', 'tables'],
},
// Chunking: turn UDM elements into retrieval-ready chunks
chunking: {
strategy: 'hybrid_auto',
customStrategy: undefined, // reserved
tokenCounter: 'approximate',
customTokenCounter: undefined,
minChunkTokens: 64,
maxChunkTokens: 512,
targetChunkTokens: 256,
overlap: {
enabled: true,
tokens: 50,
strategy: 'sentence_boundary',
},
headingContext: {
includeParentHeadings: true,
maxHeadingDepth: 3,
separator: ' > ',
},
tableHandling: 'own_chunk',
chartHandling: 'own_chunk',
imageHandling: 'skip_decorative',
codeHandling: 'own_chunk',
mergeSmallChunks: false,
neverSplit: ['table', 'code', 'chart', 'image'],
semanticThreshold: 0.3,
semanticWindowSize: 3,
},
// Enrichment: keywords/entities/importance and optional extras
enrichment: {
extractKeywords: true,
keywordMethod: 'tfidf',
maxKeywords: 10,
extractEntities: true,
entityTypes: ['PERSON', 'ORGANIZATION', 'DATE', 'MONEY', 'LOCATION'],
detectTopics: false,
generateSummary: false,
summaryMaxTokens: 50,
generateQuestions: false,
maxQuestions: 3,
computeImportance: true,
resolveCrossReferences: true,
linkChunks: true,
feedback: {
enabled: false,
preferredTerms: [],
deprioritizedTerms: [],
prioritizedEntityTypes: [],
preferredTermBoost: 0.08,
deprioritizedTermPenalty: 0.1,
entityTypeBoost: 0.05,
},
semanticDedup: {
enabled: false,
similarityThreshold: 0.6,
minSharedTerms: 2,
maxRelatedChunks: 3,
},
customEnrichers: [],
},
// Security gate + in-pipeline security scanning
security: {
maxFileSize: 100 * 1024 * 1024,
allowedFormats: ['*'],
blockMacros: true,
blockJavascriptInPdf: true,
blockXxe: true,
blockPolyglots: true,
virusScanHook: undefined,
sandboxParsing: true,
maxMemoryPerDoc: 512 * 1024 * 1024,
maxProcessingTime: 300_000,
maxTempDisk: 1024 * 1024 * 1024,
pii: {
enabled: true,
provider: undefined, // PIIProviderAdapter
mode: 'detect', // detect | mask | redact | hash
scanTargets: ['text_content', 'table_cells', 'metadata_fields'],
maskFormat: '[{{TYPE}}]',
allowlist: [],
customPatterns: [],
encryptionKey: undefined,
},
promptInjection: {
enabled: true,
action: 'flag',
customPatterns: [],
sensitivity: 'medium',
},
dataLifecycle: {
encryptTempFiles: true,
secureDelete: true,
clearMemoryAfter: true,
maxCacheTtl: 3_600_000,
},
audit: {
enabled: true,
logDestination: 'memory',
customLogger: undefined,
includeDocumentName: true,
anonymizeDocumentName: false,
},
classification: {
enabled: false,
defaultLevel: 'internal',
rules: [],
},
},
// Output formatting
output: {
format: 'json',
customFormatter: undefined,
includeFields: [],
excludeFields: [],
includeQualityReport: true,
includeProcessingReceipt: true,
flattenMetadata: false,
vectorDb: {
namespace: undefined,
collection: undefined,
index: undefined,
batchSize: 100,
},
},
// Performance/caching (some fields are reserved for future concurrency engines)
performance: {
mode: 'balanced',
workerThreads: 1,
maxConcurrentDocs: 4,
streaming: false,
streamingThreshold: 50 * 1024 * 1024,
cache: {
enabled: true,
backend: 'memory',
maxSize: 100 * 1024 * 1024,
maxItems: 100,
ttl: 3_600_000,
cacheParsing: true,
cacheChunking: true,
customBackend: undefined,
},
memoryBudget: 1024 * 1024 * 1024,
},
// Runtime pipeline behavior presets + optional stage toggles
pipeline: {
runtimePreset: 'standard',
stages: {
tableNl: true,
cleaning: true,
enrichment: true,
piiScan: true,
promptInjection: true,
},
},
// Plugins can be pre-registered from objects, async factories, module specifiers, or file URLs.
plugins: [],
// Hooks
hooks: {
// Stage name + percentage + structured payload for runtime telemetry
onProgress: undefined,
onWarning: undefined,
onError: undefined,
// Fired once per final chunk after enrichment/security scans complete
onChunkReady: undefined,
onGovernanceEvent: undefined,
onDocumentComplete: undefined,
},
// Debug
debug: {
enabled: false,
visualOutput: false,
visualOutputPath: undefined,
profilePerformance: false,
verboseLogging: false,
dumpUdm: false,
dumpUdmPath: undefined,
},
};Plugins
DocParser plugins can be registered via await parser.use(plugin), await parser.loadPlugin(pluginSource), or upfront through new DocParser({ plugins: [...] }).
Example config-driven dynamic loading:
import { DocParser } from 'docparser-core';
const parser = new DocParser({
plugins: [
{
module: './plugins/myEnricher.js',
exportName: 'enricherPlugin',
},
async () => ({
name: 'inline-enricher',
version: '1.3.0',
type: 'enrichment',
hooks: {
enrichAll: async (chunks) => chunks,
},
}),
],
});createLLMPlugin() (with a mock provider)
import { DocParser, createLLMPlugin } from 'docparser-core';
import type { LLMProviderAdapter, LLMRequest, LLMResponse } from 'docparser-core';
const mockLLM: LLMProviderAdapter = {
name: 'mock-llm',
async isAvailable() {
return true;
},
async complete(req: LLMRequest): Promise<LLMResponse> {
return {
text: 'Mock summary.\n1. What is this?\n2. Why does it matter?',
tokensUsed: 10,
model: req.model ?? 'mock',
};
},
// Present for interface completeness; not used by createLLMPlugin
async embed(texts: string[]) {
return texts.map(() => [0.01, 0.02, 0.03]);
},
};
const parser = new DocParser();
await parser.use(
createLLMPlugin(mockLLM, { generateSummary: true, generateQuestions: true, maxChunks: 25 }),
);
const result = await parser.process('This is a chunk that will be summarized.', 'llm.txt');
console.log(result.chunks[0]?.summary);createEmbeddingPlugin()
import { DocParser, createEmbeddingPlugin } from 'docparser-core';
import type { EmbeddingProviderAdapter } from 'docparser-core';
const embedder: EmbeddingProviderAdapter = {
name: 'mock-embeddings',
dimensions: 3,
async embed(texts) {
return texts.map(() => [0.1, 0.2, 0.3]);
},
async embedSingle(text) {
return [0.1, 0.2, 0.3];
},
};
const parser = new DocParser();
await parser.use(
createEmbeddingPlugin(embedder, {
batchSize: 100,
maxConcurrentBatches: 2,
}),
);
const result = await parser.process('Embedding demo', 'embed.txt');
console.log(result.chunks[0]?.metadata.embedding);The legacy positional signature createEmbeddingPlugin(embedder, batchSize, maxConcurrentBatches) is still supported.
createOCRPlugin()
Install the optional OCR runtime package first:
npm install docparser-ocrimport { DocParser, createOCRPlugin } from 'docparser-core';
const parser = new DocParser();
await parser.use(
createOCRPlugin({
tesseract: { languages: ['eng'] },
pipeline: { qualityLevel: 'thorough', minConfidence: 0.3 },
routing: {
preferDocumentLanguage: true,
forceImageTypes: ['screenshot', 'form', 'handwritten'],
},
}),
);
// OCR triggers when parsed content includes image elements whose imageRef is a data URL.
// Built-in parsers now provide that for raw PNG/JPEG/TIFF/WebP files,
// OOXML embedded images, and rasterized image-only PDF pages.Smart OCR routing now prefers document language when the provider supports it, skips low-value image types like logos by default, and still routes high-signal images such as screenshots, forms, and handwritten snippets even when the parser did not explicitly mark containsText = true.
High-signal OCR routes now also bypass the lightweight non-text image precheck before recognition. That matters for prescriptions, scanned forms, and handwriting, where a cheap contrast heuristic may miss real text even though OCR should still run.
High-signal OCR routes now also evaluate the built-in OCR retry strategies instead of stopping at the first barely acceptable pass. By default, the OCR runtime now retries low-confidence results with scanned-document, low-contrast, and handwriting-oriented recognition strategies, including alternate Tesseract page segmentation modes such as single_block and sparse_text.
Use pipeline.qualityLevel to control how much OCR work the runtime should spend per image:
fast: minimal OCR work and no built-in retriesbalanced: current default, with practical recovery passes for common scansthorough: more preprocessing, more retry profiles, and more full-profile evaluationextreme: computation-heavy OCR that tries the broadest built-in set of segmentation and engine strategies
If you provide pipeline.retryProfiles, they replace the built-in retry set by default so existing custom OCR flows stay stable. Set pipeline.useBuiltInRetryProfiles = true when you want your custom profiles appended after the built-in quality-level retries.
When OCR runs through the optional runtime package, the plugin now records per-image OCR diagnostics in udm.processingNotes and in the processing receipt. Those diagnostics include geometry counts, detected script, detected orientation, and any OSD-driven rotation that was applied before retry profiles ran.
If docparser-ocr is not installed, createOCRPlugin() throws an explicit runtime error with the install command.
Writing a custom plugin
DocParser’s runtime plugin interface is DocParserPlugin. In practice, hooks are invoked with a single argument (UDM or chunks), and your hook returns the modified value.
import { DocParser } from 'docparser-core';
import type { DocParserPlugin } from 'docparser-core';
import type { DocumentUDM } from 'docparser-core';
const taggerPlugin: DocParserPlugin = {
name: 'tagger',
version: '1.3.0',
type: 'parser',
hooks: {
afterParse: async (udm: DocumentUDM) => {
return {
...udm,
processingNotes: [
...udm.processingNotes,
{ severity: 'info', stage: 'plugin', message: 'Tagged by taggerPlugin' },
],
};
},
},
};
const parser = new DocParser();
await parser.use(taggerPlugin);Feedback-driven enrichment
Use enrichment.feedback when you want enrichment outputs to reflect reviewer or product feedback about which signals matter more.
import { DocParser } from 'docparser-core';
const parser = new DocParser({
enrichment: {
feedback: {
enabled: true,
preferredTerms: ['invoice', 'payment'],
deprioritizedTerms: ['lunch', 'social'],
prioritizedEntityTypes: ['MONEY', 'DATE'],
},
},
});When enabled, preferred terms are promoted in chunk keywords, deprioritized terms reduce ranking weight, and prioritized entity types increase chunk importance. The applied matches are written to chunk.metadata.feedbackSignals for downstream inspection.
Semantic dedup enrichment
Use enrichment.semanticDedup when you want near-duplicate chunks linked for downstream review without removing either chunk from the final result.
import { DocParser } from 'docparser-core';
const parser = new DocParser({
enrichment: {
semanticDedup: {
enabled: true,
similarityThreshold: 0.6,
minSharedTerms: 2,
maxRelatedChunks: 3,
},
},
});When enabled, semantically similar chunks receive related relationships and a chunk.metadata.semanticDedup.relatedChunks summary containing the related chunk IDs, similarity score, and shared terms.
Batch Processing
Use BatchProcessor for concurrency + per-file isolation + progress.
import { BatchProcessor, createLogger } from 'docparser-core';
const logger = createLogger({ level: 'warn' });
const batch = new BatchProcessor(logger, { preset: 'general', output: { format: 'jsonl' } });
const result = await batch.process(
[
{ buffer: Buffer.from('Doc 1'), filename: 'a.txt' },
{ buffer: Buffer.from('# Title\nHello'), filename: 'b.md' },
],
{
concurrency: 3,
onProgress: (done, total, filename) => {
console.log(`[${done}/${total}] ${filename}`);
},
},
);
console.log(result.stats);Streaming
Stream bytes in and emit chunks as JSONL (or objects) using StreamProcessor.
import { StreamProcessor } from 'docparser-core';
import { createReadStream, createWriteStream } from 'node:fs';
const processor = new StreamProcessor();
createReadStream('report.docx')
.pipe(processor.createChunkStream('report.docx'))
.pipe(createWriteStream('chunks.jsonl'));createChunkStream() now forwards each chunk as soon as the parser marks it ready, so downstream consumers can start ingesting chunk output before final receipt and quality reporting complete.
When performance.streaming is enabled and the incoming stream crosses performance.streamingThreshold, both processStream() and createChunkStream() spill the input to a temp file and reuse DocParser.processFile() so file-backed parser paths can run without buffering the full stream in memory.
If you want object-mode chunk objects instead of JSONL strings:
import { StreamProcessor } from 'docparser-core';
import { createReadStream } from 'node:fs';
const processor = new StreamProcessor();
createReadStream('report.docx')
.pipe(processor.createChunkStream('report.docx', { objectMode: true }))
.on('data', (chunk) => {
console.log(chunk.chunkId, chunk.content.length);
});Or buffer the stream and get a normal ProcessingResult:
import { StreamProcessor } from 'docparser-core';
import { createReadStream } from 'node:fs';
const processor = new StreamProcessor();
const result = await processor.processStream(createReadStream('big.md'), 'big.md');CLI
The package ships a docparser CLI (built from src/cli/cli.ts).
Help
npm install docparser-core
npx docparser --help
# Interactive config wizard
npx docparser init
# Alias
npx docparser setup
# Or run without installing:
npx --package docparser-core docparser --helpinit and setup run the same interactive wizard. The wizard can generate:
- a starter config for common workflows
- a full annotated template that covers the current
DocumentParserConfigsurface - either
docparser.config.mjsordocparser.config.jsonc - an optional
.env.localfile for secret placeholders
The generated config is validated before the wizard exits, so invalid enum values or impossible numeric combinations fail immediately with clear errors.
docparser init
npx docparser init
# You can also pre-fill parts of the wizard
npx docparser init --format mjs --template full --output ./docparser.config.mjsWhat the wizard does:
- asks for a config format:
mjsorjsonc - asks for a starter or full annotated template
- guides you through common defaults like preset, runtime preset, output format, PII mode, and OCR defaults
- can scaffold built-in
providers.ocrandproviders.llmsections for Tesseract, native Tesseract, Ollama, Claude, OpenRouter, and organization-managed gateways - can generate a sample plugin stub and a sample governance hook stub
- can generate
.env.localand wire secrets through environment-variable references instead of hardcoding them into the config file
Use mjs when you want comments, hooks, imports, or executable logic in the config. Use jsonc when you want a purely declarative config file with comments.
docparser health
npx docparser health --config ./docparser.config.mjs
# Widen the sweep beyond startup-only providers and write JSON to a file
npx docparser health \
--config ./docparser.config.mjs \
--all \
--output ./out/provider-health.jsonhealth loads the same config formats as process and inspect, resolves .env and .env.local, and runs provider health checks against the configured built-in or custom runtime-backed providers that are active in the current runtime configuration. For LLM registrations, that means summary or question generation is enabled either on the provider registration or via inherited enrichment defaults.
Health flags:
--config: required. Load a parser config from a JSON, JSONC, or ESM module file.--all: check all active configured runtime-backed providers instead of only those marked withhealthCheck.validateOnStartup: true.--output: write the JSON health report to a file instead of stdout.
Behavior:
- exits non-zero when any checked provider is unhealthy
- prints a JSON summary with provider-level health results
- uses the same provider retry and redaction rules as the runtime adapters
- defaults to startup-scoped checks, so it mirrors the fail-fast provider validation path used during parser initialization
Built-In Providers
DocumentParserConfig now supports a top-level providers section for built-in OCR and LLM runtime registration. That means you can turn on OCR and LLM enrichment directly from config files without writing custom plugin factories first.
export default {
parsing: {
ocr: {
languages: ['eng'],
confidenceThreshold: 60,
dpi: 300,
},
},
providers: {
ocr: {
provider: {
kind: 'native_tesseract',
executablePath: process.env.DOCPARSER_NATIVE_TESSERACT_PATH,
languages: ['eng'],
},
},
llm: {
provider: {
kind: 'openrouter',
apiKey: process.env.DOCPARSER_OPENROUTER_API_KEY,
model: 'anthropic/claude-3.5-sonnet',
},
generateSummary: true,
generateQuestions: false,
summaryMaxTokens: 96,
},
},
};Supported built-in provider kinds:
- OCR:
tesseract,native_tesseract,ollama,claude,openrouter,organization_gateway - LLM:
ollama,claude,openrouter,organization_gateway
organization_gateway stays generic on purpose. Set protocol to openai, anthropic, or ollama and point baseUrl at your internal proxy or gateway.
HTTP-backed built-in providers share the same hardening controls:
provider.retry: bounded retry/backoff for transient network and upstream failuresproviders.ocr.healthCheckandproviders.llm.healthCheck: optional startup validation so the parser can fail fast before processing documents- surfaced provider failures redact common bearer tokens, API keys, secrets, and passwords before the error leaves the adapter layer
You can preflight those providers yourself before processing documents:
import { checkConfiguredProvidersHealth } from 'docparser-core';
const results = await checkConfiguredProvidersHealth(config, { startupOnly: true });
console.log(results);Production Templates
The examples below are intended as production-oriented starting points. Tune parsing.ocr, chunking, and enrichment settings to your document mix, but keep the provider retry and startup health sections unless your environment already handles those concerns upstream.
Checked-in equivalents of these templates now live in ./examples/configs/:
./examples/configs/native-tesseract-openrouter.mjs./examples/configs/tesseract-js-openrouter.mjs./examples/configs/ollama-local.mjs./examples/configs/claude-direct.mjs./examples/configs/openrouter-hosted.mjs./examples/configs/organization-gateway-openai.mjs./examples/configs/provider-example.env
Runnable sample inputs that pair with these configs live in ./examples/inputs/:
./examples/inputs/report.md./examples/inputs/page.html./examples/inputs/renewal-form.png./examples/inputs/renewal-scan.pdf./examples/inputs/README.md
Native Tesseract OCR with OpenRouter LLM:
export default {
parsing: {
ocr: {
languages: ['eng'],
confidenceThreshold: 60,
dpi: 300,
},
},
providers: {
ocr: {
provider: {
kind: 'native_tesseract',
executablePath: process.env.DOCPARSER_NATIVE_TESSERACT_PATH,
languages: ['eng'],
userDefinedDpi: 300,
},
processEmbeddedImages: true,
processImagePDFs: true,
pipeline: {
qualityLevel: 'thorough',
useBuiltInRetryProfiles: true,
earlyExitConfidence: 0.92,
},
},
llm: {
provider: {
kind: 'openrouter',
apiKey: process.env.DOCPARSER_OPENROUTER_API_KEY,
model: 'anthropic/claude-3.5-sonnet',
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
generateSummary: true,
generateQuestions: false,
summaryMaxTokens: 128,
concurrency: 2,
},
},
};If you prefer the optional JavaScript OCR runtime instead of a native binary, switch kind: 'native_tesseract' to kind: 'tesseract' and keep the same OCR registration shape.
Fully local Ollama OCR and LLM:
export default {
providers: {
ocr: {
provider: {
kind: 'ollama',
baseUrl: process.env.DOCPARSER_OLLAMA_BASE_URL,
model: 'llava:13b',
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
processEmbeddedImages: true,
processImagePDFs: true,
pipeline: { qualityLevel: 'balanced' },
},
llm: {
provider: {
kind: 'ollama',
baseUrl: process.env.DOCPARSER_OLLAMA_BASE_URL,
model: 'qwen2.5:7b-instruct',
embeddingModel: 'nomic-embed-text',
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
generateSummary: true,
generateQuestions: true,
maxQuestions: 4,
summaryMaxTokens: 128,
},
},
};Direct Claude OCR and LLM:
export default {
providers: {
ocr: {
provider: {
kind: 'claude',
apiKey: process.env.DOCPARSER_CLAUDE_API_KEY,
model: 'claude-3-7-sonnet-latest',
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
processEmbeddedImages: true,
processImagePDFs: true,
pipeline: { qualityLevel: 'balanced' },
},
llm: {
provider: {
kind: 'claude',
apiKey: process.env.DOCPARSER_CLAUDE_API_KEY,
model: 'claude-3-7-sonnet-latest',
maxTokens: 1024,
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
generateSummary: true,
generateQuestions: false,
summaryMaxTokens: 128,
},
},
};OpenRouter OCR and LLM:
export default {
providers: {
ocr: {
provider: {
kind: 'openrouter',
apiKey: process.env.DOCPARSER_OPENROUTER_API_KEY,
model: 'google/gemini-2.0-flash-001',
referer: process.env.DOCPARSER_OPENROUTER_REFERER,
appName: process.env.DOCPARSER_OPENROUTER_APP_NAME,
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
processEmbeddedImages: true,
processImagePDFs: true,
pipeline: { qualityLevel: 'balanced' },
},
llm: {
provider: {
kind: 'openrouter',
apiKey: process.env.DOCPARSER_OPENROUTER_API_KEY,
model: 'anthropic/claude-3.5-sonnet',
referer: process.env.DOCPARSER_OPENROUTER_REFERER,
appName: process.env.DOCPARSER_OPENROUTER_APP_NAME,
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
generateSummary: true,
generateQuestions: true,
maxQuestions: 3,
summaryMaxTokens: 128,
},
},
};Organization-managed OCR and LLM gateway:
export default {
providers: {
ocr: {
provider: {
kind: 'organization_gateway',
protocol: 'openai',
baseUrl: process.env.DOCPARSER_GATEWAY_BASE_URL,
apiKey: process.env.DOCPARSER_GATEWAY_API_KEY,
model: 'gpt-4.1-mini',
healthPath: '/readyz',
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
processEmbeddedImages: true,
processImagePDFs: true,
pipeline: { qualityLevel: 'balanced' },
},
llm: {
provider: {
kind: 'organization_gateway',
protocol: 'openai',
baseUrl: process.env.DOCPARSER_GATEWAY_BASE_URL,
apiKey: process.env.DOCPARSER_GATEWAY_API_KEY,
model: 'gpt-4.1-mini',
healthPath: '/readyz',
timeoutMs: 30000,
retry: {
maxAttempts: 3,
initialDelayMs: 250,
maxDelayMs: 2000,
backoffMultiplier: 2,
},
},
healthCheck: {
validateOnStartup: true,
failOnUnhealthy: true,
},
generateSummary: true,
generateQuestions: false,
summaryMaxTokens: 128,
},
},
};If your gateway speaks Anthropic or Ollama semantics instead, keep the same shape and switch protocol to anthropic or ollama.
docparser process <file>
npx docparser process ./examples/inputs/report.md \
--config ./docparser.config.mjs \
--plugin ./plugins/custom-chunker.mjs \
--progress jsonl \
--stream-events \
--format jsonl \
--preset general \
--strategy hybrid_auto \
--max-tokens 512 \
--min-tokens 64 \
--overlap 50 \
--pii detect \
--receipt ./out/receipt.json \
--quality ./out/quality.json \
--security ./out/securit