@maya-ai/document-pipeline
v0.1.0
Published
App-agnostic pipeline that ingests files (PDF, images, Office docs, ZIP archives, plain text) and produces a normalised analysis package: per-file records, per-unit text and rendered images, document segmentation, three-tier classification with field extr
Downloads
155
Maintainers
Readme
Document Pipeline
Reusable component. Lives at
server/src/document-pipeline/. Self-contained — zero imports from outside the directory except external npm deps and node built-ins. This document travels with the component.
App-agnostic capability that ingests files (PDF, images, Office docs, ZIP archives, plain text) and produces a normalised analysis package: per-file records, per-unit text and rendered images, document segmentation, three-tier classification with field extraction, and an optional continuation handler that produces task-specific output.
Public API
import { runDocumentPipeline } from "./document-pipeline/pipeline.js";
import type {
DocumentPipelineRequest,
DocumentPipelineResult,
DocumentPipelineDeps,
} from "./document-pipeline/pipeline.js";
const result: DocumentPipelineResult = await runDocumentPipeline(request, deps);DocumentPipelineDeps (all optional):
| Field | Type | Purpose |
|-------|------|---------|
| llmProvider | PipelineLlmAdapter | Used by Tier B classification, segmentation, continuation. See ./llm.ts. |
| tierCAdapter | PipelineLlmAdapter | Multimodal Tier-C-specific provider. Falls back to llmProvider when omitted. |
| ocrPlugin | OCRPlugin | Injected into the extraction stage. |
| policyRegistry | DocumentPolicyRegistry | Resolves request.policyName to a DocumentPolicy. |
| extraExtractors | ExtractorPlugin[] | Merged with the built-in extractors. |
| continuationHandler | ContinuationHandler | Called after assembly when request.taskIntent is present. |
The pipeline depends on a narrow LLM adapter (PipelineLlmAdapter in ./llm.ts), not on any concrete provider:
export interface PipelineLlmAdapter {
readonly enabled: boolean;
generate(request: PipelineLlmRequest): Promise<PipelineLlmResponse | undefined>;
}Apps wrap their concrete LlmProvider into this shape via a small shim (in this app: server/src/llm/pipeline-adapter.ts).
Stage Flow
| # | Stage | Function | Failure behaviour |
|---|-------|----------|-------------------|
| 1 | Ingest | ingestFile | Unrecoverable — returns failed package immediately |
| 2 | Decompose | decomposeFile (or expandZipSubmission for archives) | Falls through with empty units |
| 3 | Extract | extractUnit per unit | Per-unit failures tracked; partial results kept |
| 4 | Segment | segmentDocuments | Fallback: all units in one group |
| 5 | Classify | classifyGroups | Falls through with empty classifiedGroups |
| 6 | Assemble | assembleAnalysisPackage | Always succeeds; warns about inconsistencies in log |
| 7 | Continue | continuationHandler | Skipped unless deps + request.taskIntent are both present |
Three-Tier Classification Cascade
Each segmented document group is classified by a cascade. Each tier has a confidence threshold; results above the threshold short-circuit the next tier.
| Tier | Mechanism | Cost | Runs when |
|------|-----------|------|-----------|
| A — Deterministic | MRZ regex (TD3 passport / TD2 / TD1 ID), barcode anchor (CODE128/PDF417/QR/etc.), heuristic field parsing | None | Always (unless minTier=tier-b or tier-c) |
| B — LLM text | Policy-injected text prompt with extracted text snippets | One LLM call | Tier A confidence below tierAThreshold AND executionMode != deterministic-first |
| C — Multimodal LLM | Same prompt + actual rendered page images attached as ImageContent parts | One+ LLM calls with image bytes | Tier B confidence below tierBThreshold AND group has rendered images AND executionMode is hybrid or full-multimodal |
Execution modes
| Mode | Tier A | Tier B | Tier C | Notes |
|------|--------|--------|--------|-------|
| deterministic-first | ✅ | ❌ | ❌ | Structural rules only, no LLM |
| hybrid | ✅ | ✅ when A insufficient | ✅ when B insufficient and group has images | LLM text-only segmentation |
| full-multimodal | ✅ | ✅ when A insufficient | ✅ when B insufficient and group has images | Up to 12 page images attached to segmentation LLM call |
For classification, hybrid and full-multimodal behave identically — both run the full cascade. The two modes diverge at segmentation: only full-multimodal attaches page images to the boundary-detection LLM call.
minTier semantics
minTier controls which tiers are attempted. Tiers below minTier are skipped entirely; the classifier does not silently salvage by running a lower tier.
| Combination | Behaviour |
|-------------|-----------|
| minTier=tier-a, executionMode=hybrid | Full A → B → C cascade (default) |
| minTier=tier-b, executionMode=hybrid | Skip A; LLM-only |
| minTier=tier-c, executionMode=hybrid | Skip A and B; multimodal only. Group must have images, otherwise → unknown |
| minTier=tier-b, executionMode=deterministic-first | Contradictory: Tier A blocked by minTier, Tier B/C blocked by mode → unknown |
| minTier=tier-c, no images on group | Tier C cannot run, lower tiers blocked → unknown |
The "unknown" fallback group has classificationTier: <minTier> so the result records what was attempted, and uncertaintyFlags: ["no-classifier-result"].
Configuration
Per-request fields (on DocumentPipelineRequest)
| Field | Type | Purpose |
|-------|------|---------|
| executionMode | enum | deterministic-first \| hybrid \| full-multimodal |
| tierAThreshold | [0,1] | Min Tier-A confidence to short-circuit Tier B/C |
| tierBThreshold | [0,1] | Min Tier-B confidence to skip Tier C |
| minTier | enum | tier-a \| tier-b \| tier-c — lowest tier the classifier may attempt |
| policyName | string | Named policy in the registry (open-vocabulary if absent) |
| taskIntent | string | Continuation goal — gates whether the continuation handler runs |
| language | BCP-47 | Hint, e.g. en, ar-EG |
| country | ISO 3166-1 alpha-2 | Hint, e.g. US, DE |
| zipParallelism | enum | sequential \| parallel. Sequential is deterministic; parallel is faster on archives with many small entries. |
| limits | DocumentPipelineLimits | Override per-request budgets (max units, max pixels, max bytes, etc.). |
request.context.submissionIntent (separate from taskIntent) is a hint about why the submission was made; it flows into the segmentation LLM prompt and the continuation handler's appContext.
Server-wide defaults via env (parsed by pipeline-config.ts)
When the consumer app calls loadPipelineDefaults():
| Variable | Default | Purpose |
|----------|---------|---------|
| LLM_DEFAULT_EXECUTION_MODE | hybrid | Default applied when the request omits executionMode. |
| LLM_DEFAULT_TIER_A_THRESHOLD | 0.8 | Default applied when the request omits tierAThreshold. |
| LLM_DEFAULT_TIER_B_THRESHOLD | 0.6 | Default applied when the request omits tierBThreshold. |
| LLM_DEFAULT_MIN_TIER | tier-a | Default applied when the request omits minTier. |
Invalid env values (unknown enum, non-numeric threshold, out-of-range threshold) produce a console.warn at boot and the field falls back to the built-in default. The pipeline never refuses to start because of a bad env var.
Priority order
- Per-request field on
DocumentPipelineRequest - Matching
LLM_DEFAULT_*env var (read once at boot) - Built-in default in
pipeline-config.ts
ZIP Archives
ZIP support is in zip.ts. When fileRecord.mimeType === "application/zip":
- Each entry is run through Stage 1 ingest (MIME validation, size, SHA-256).
- Each entry is run through Stage 2 decompose.
- Unit indices are renumbered into a single submission-wide sequence.
- Each unit is stamped with
sourceFileIndexpointing intofileRecords. - The archive container itself is not in
fileRecords— only the data files inside. The container's hash and metadata are preserved in the processing log. - Per-entry failures (empty / oversize / unsupported MIME / nested ZIP / corrupt deflate) become warnings; sibling entries continue.
- Nested ZIPs (an entry whose detected MIME is
application/zip) are rejected with a warning. Recursion is one level deep only. request.zipParallelismselects sequential vs parallel processing. Both modes produce identicalfileRecordsandunits(same archive order); only timing and processing-log ordering differ.
Document Policies
Policies are natural-language rule sets (spec §6) injected into Tier-B and Tier-C prompts. Implementation lives in policy.ts:
DocumentPolicy—{ name, version, contentHash, policyText }(SHA-256 hash computed at registration).DocumentPolicyRegistry— in-memory store withregister(name, text, version?),get(name),list(),remove(name),size.buildClassificationPrompt(policy, group, units, extractions, context?)— produces{ systemPrompt, userPrompt }.buildFieldExtractionPrompt(fieldName, documentLabel, policy?)— focused single-field prompt for Tier-C refinement.
When no policy is provided, the prompts run in open-vocabulary mode — the model classifies freely, marking everything as unsupported: false.
For loading policy bodies from external files (manifest + Markdown bodies), see document-policies — that's a separate reusable component dedicated to policy loading.
Multimodal (Tier C)
Real image bytes are transmitted, not text hints. The pipeline's PipelineLlmRequest.images?: PipelineImage[] carries { data: <base64>, mimeType } parts. Adapters that don't support multimodal (e.g. a disabled stub) ignore the field.
Stage 4 LLM-assisted segmentation in full-multimodal mode attaches up to 12 page images (first image per unit, capped at MAX_SEGMENT_IMAGES). When images are attached, the prompt enumerates which page indices the model is seeing.
HTTP Integration Sketch
The pipeline doesn't include HTTP wiring — that's the consumer app's job. In this app, server/src/routes/documents.ts exposes:
POST /api/documents/analyze
GET /api/documents/policieswith raw-body uploads (Content-Type: <file mime type>, up to 50 MB) and query params for the per-request fields above. Errors return { error, details? } with appropriate HTTP status.
Tests
In this app, tests live at server/test/document-pipeline/:
| File | Purpose |
|------|---------|
| ingest.test.ts | MIME detection, ingest validation |
| decompose.test.ts | Per-format decomposition |
| ocr.test.ts | OCR plugin contract, coordinate normalisation |
| policy.test.ts | Registry + prompt builders |
| assemble.test.ts | Package assembly + log validation |
| segment.test.ts | Structural + LLM-assisted segmentation, multimodal segmentation |
| classify.test.ts | Three-tier cascade, threshold semantics, minTier, Tier-C image attachment |
| pdf-extractor.test.ts | A2 regression — pdfjs render API |
| continue.test.ts | LLM and passthrough continuation handlers |
| pipeline.test.ts | End-to-end orchestration |
| route.test.ts | HTTP integration via supertest |
| zip.test.ts | ZIP support and per-entry failure modes |
| pipeline-config.test.ts | Env-default parsing |
Run via npm run test:document-pipeline. All tests use disabled or recording mocks — no live LLM, no OCR provider.
Extracting to Another App
- Copy or package
server/src/document-pipeline/. - Keep runtime deps:
pdfjs-dist@4,xlsx,file-type@19,[email protected],@napi-rs/canvas. - Provide a concrete
PipelineLlmAdapter(wrap your app's LLM provider — seeserver/src/llm/pipeline-adapter.tsfor a one-screen example). - Optionally provide an
OCRPluginfor image/PDF text recovery. - Optionally provide a
DocumentPolicyRegistry(seedocument-policiesfor a reusable manifest-based loader). - Wire HTTP routes (raw-body upload, query schema validation) and call
runDocumentPipeline(request, deps).
The pipeline has zero opinion about authentication, persistence, multi-tenancy, or job queues.
