npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@heripo/document-processor

v0.1.2

Published

Document processor with LLM-based analysis for heripo engine

Readme

@heripo/document-processor

LLM-based document structure analysis and processing library

npm version Node.js coverage License

English | 한국어

Note: Please check the root README first for project overview, installation instructions, and roadmap.

@heripo/document-processor is a library that transforms DoclingDocument into ProcessedDocument, optimized for LLM analysis.

Table of Contents

Key Features

  • TOC Extraction: Automatic TOC recognition with rule-based + LLM fallback
  • Hierarchical Structure: Automatic generation of chapter/section/subsection hierarchy
  • Page Mapping: Actual page number mapping using Vision LLM
  • Caption Parsing: Automatic parsing of image and table captions
  • LLM Flexibility: Support for various LLMs including OpenAI, Anthropic, Google
  • Fallback Retry: Automatic retry with fallback model on failure

Installation

# Install with npm
npm install @heripo/document-processor @heripo/model

# Install with pnpm
pnpm add @heripo/document-processor @heripo/model

# Install with yarn
yarn add @heripo/document-processor @heripo/model

Additionally, LLM provider SDKs are required:

# Vercel AI SDK and provider packages
npm install ai @ai-sdk/openai @ai-sdk/anthropic @ai-sdk/google

Usage

Basic Usage

import { anthropic } from '@ai-sdk/anthropic';
import { DocumentProcessor } from '@heripo/document-processor';
import { Logger } from '@heripo/logger';

const logger = Logger(...);

// Basic usage - specify fallback model only
const processor = new DocumentProcessor({
  logger,
  fallbackModel: anthropic('claude-opus-4-5'),
  textCleanerBatchSize: 10,
  captionParserBatchSize: 5,
  captionValidatorBatchSize: 5,
});

// Process document
const processedDoc = await processor.process(
  doclingDocument, // PDF parser output
  'report-001', // Report ID
  outputPath, // Directory containing images/pages
);

// Use results
console.log('TOC:', processedDoc.chapters);
console.log('Images:', processedDoc.images);
console.log('Tables:', processedDoc.tables);

Advanced Usage - Per-Component Model Specification

import { anthropic } from '@ai-sdk/anthropic';
import { openai } from '@ai-sdk/openai';

const processor = new DocumentProcessor({
  logger,
  // Fallback model (for retry on failure)
  fallbackModel: anthropic('claude-opus-4-5'),

  // Per-component model specification
  pageRangeParserModel: openai('gpt-5.1'), // Vision required
  tocExtractorModel: openai('gpt-5.1'), // Structured output
  validatorModel: openai('gpt-5.2'), // Simple validation
  visionTocExtractorModel: openai('gpt-5.1'), // Vision required
  captionParserModel: openai('gpt-5-mini'), // Caption parsing

  // Batch size settings
  textCleanerBatchSize: 20, // Synchronous processing (can be large)
  captionParserBatchSize: 10, // LLM calls (medium)
  captionValidatorBatchSize: 10, // LLM calls (medium)

  // Retry settings
  maxRetries: 3,
  enableFallbackRetry: true, // Automatic retry with fallback model
});

const processedDoc = await processor.process(
  doclingDocument,
  'report-001',
  outputPath,
);

Processing Pipeline

DocumentProcessor processes documents through a 5-stage pipeline:

1. Text Cleaning (TextCleaner)

  • Unicode normalization (NFC)
  • Whitespace cleanup
  • Invalid text filtering (numbers-only text, empty text)

2. Page Range Mapping (PageRangeParser - Vision LLM)

  • Extract actual page numbers from page images
  • PDF page to document logical page mapping
  • Handle page number mismatches due to scanning errors

3. TOC Extraction (5-Stage Pipeline)

Stage 1: TocFinder (Rule-Based)

  • Keyword search (Table of Contents, Contents, etc.)
  • Structure analysis (lists/tables with page number patterns)
  • Multi-page TOC detection with continuation markers

Stage 2: MarkdownConverter

  • Group → Indented list format
  • Table → Markdown table format
  • Preserve hierarchy for LLM processing

Stage 3: TocContentValidator (LLM Validation)

  • Verify if extracted content is actual TOC
  • Return confidence score and reason

Stage 4: VisionTocExtractor (Vision LLM Fallback)

  • Used when rule-based extraction or validation fails
  • Extract TOC directly from page images

Stage 5: TocExtractor (LLM Structuring)

  • Extract hierarchical TocEntry[] (title, level, pageNo)
  • Recursive children structure for nested sections

4. Resource Transformation

  • Images: Caption extraction and parsing with CaptionParser
  • Tables: Grid data transformation and caption parsing
  • Caption Validation: Parsing result validation with CaptionValidator

5. Chapter Conversion (ChapterConverter)

  • Build chapter tree based on TOC
  • Create Chapter hierarchy
  • Link text blocks to chapters by page range
  • Connect image/table IDs to appropriate chapters
  • Fallback: Create single "Document" chapter when TOC is empty

API Documentation

DocumentProcessor Class

Constructor Options

interface DocumentProcessorOptions {
  logger: Logger; // Logger instance (required)

  // LLM model settings
  fallbackModel: LanguageModel; // Fallback model (required)
  pageRangeParserModel?: LanguageModel; // For page range parser
  tocExtractorModel?: LanguageModel; // For TOC extraction
  validatorModel?: LanguageModel; // For validation
  visionTocExtractorModel?: LanguageModel; // For Vision TOC extraction
  captionParserModel?: LanguageModel; // For caption parser

  // Batch processing settings
  textCleanerBatchSize?: number; // Text cleaning (default: 10)
  captionParserBatchSize?: number; // Caption parsing (default: 5)
  captionValidatorBatchSize?: number; // Caption validation (default: 5)

  // Retry settings
  maxRetries?: number; // LLM API retry count (default: 3)
  enableFallbackRetry?: boolean; // Enable fallback retry (default: true)
}

Methods

process(doclingDoc, reportId, outputPath): Promise<ProcessedDocument>

Transforms DoclingDocument into ProcessedDocument.

Parameters:

  • doclingDoc (DoclingDocument): PDF parser output
  • reportId (string): Report ID
  • outputPath (string): Output directory containing images/pages

Returns:

  • Promise<ProcessedDocument>: Processed document

Fallback Retry Mechanism

When enableFallbackRetry: true is set, LLM components automatically retry with fallbackModel on failure:

const processor = new DocumentProcessor({
  logger,
  fallbackModel: anthropic('claude-opus-4-5'), // For retry
  pageRangeParserModel: openai('gpt-5.2'), // First attempt
  enableFallbackRetry: true, // Use fallback on failure
});

// If pageRangeParserModel fails, automatically retries with fallbackModel
const result = await processor.process(doc, 'id', 'path');

Batch Size Parameters

  • textCleanerBatchSize: Synchronous text normalization and filtering batch size. Large values possible due to local processing
  • captionParserBatchSize: LLM-based caption parsing batch size. Small values for API request concurrency and cost management
  • captionValidatorBatchSize: LLM-based caption validation batch size. Small values to limit validation request concurrency

Error Handling

TocExtractError

Errors thrown when TOC extraction fails:

  • TocNotFoundError: TOC not found in document
  • TocParseError: LLM response parsing failed
  • TocValidationError: TOC validation failed
try {
  const result = await processor.process(doc, 'id', 'path');
} catch (error) {
  if (error instanceof TocNotFoundError) {
    console.log('TOC not found. Processing as single chapter.');
  } else if (error instanceof TocParseError) {
    console.error('TOC parsing failed:', error.message);
  }
}

PageRangeParseError

Page range parsing failure:

import { PageRangeParseError } from '@heripo/document-processor';

CaptionParseError & CaptionValidationError

Caption parsing/validation failure:

import {
  CaptionParseError,
  CaptionValidationError,
} from '@heripo/document-processor';

Token Usage Tracking

Major LLM components return token usage:

// PageRangeParser
const { pageRangeMap, tokenUsage } = await pageRangeParser.parse(doc);
console.log('Token usage:', tokenUsage);

// TocExtractor
const { entries, tokenUsage } = await tocExtractor.extract(markdown);
console.log('Token usage:', tokenUsage);

Related Packages

License

This package is distributed under the Apache License 2.0.

Contributing

Contributions are always welcome! Please see the Contributing Guide.

Issues and Support

Project-Wide Information

For project-wide information not covered in this package, see the root README:

  • Citation and Attribution: Academic citation (BibTeX) and attribution methods
  • Contributing Guidelines: Development guidelines, commit rules, PR procedures
  • Community: Issue tracker, discussions, security policy
  • Roadmap: Project development plans

heripo lab | GitHub | heripo engine