npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@heripo/model

v0.1.17

Published

Document models and type definitions for heripo engine

Readme

@heripo/model

Document models and type definitions

npm version Node.js License

English | 한국어

Note: Please check the root README first for project overview, installation instructions, and roadmap.

@heripo/model provides data models and TypeScript type definitions used in heripo engine.

Table of Contents

Overview

heripo engine's data processing pipeline:

DoclingDocument (Docling SDK raw output)
    ↓
ProcessedDocument (LLM-optimized intermediate model)
    ↓
(Various models to be added per roadmap)

@heripo/model defines data models currently used in the PDF parsing and document structure extraction stages. Various domain-specific models for archaeological data analysis, standardization, semantic modeling, etc. will be added in the future.

Installation

# Install with npm
npm install @heripo/model

# Install with pnpm
pnpm add @heripo/model

# Install with yarn
yarn add @heripo/model

Data Models

DoclingDocument

Raw output format from Docling SDK.

import type { DoclingDocument } from '@heripo/model';

Key Fields:

  • type: Document type (e.g., "pdf")
  • item_index: Item index
  • json_content: Document content (JSON object)

ProcessedDocument

Intermediate data model optimized for LLM analysis.

import type { ProcessedDocument } from '@heripo/model';

interface ProcessedDocument {
  reportId: string; // Report ID
  pageRangeMap: Record<number, PageRange>; // PDF page → document page mapping
  chapters: Chapter[]; // Hierarchical chapter structure
  images: ProcessedImage[]; // Extracted image metadata
  tables: ProcessedTable[]; // Extracted table data
  footnotes: ProcessedFootnote[]; // Extracted footnotes
}

Chapter

Hierarchical section structure of the document.

import type { Chapter } from '@heripo/model';

interface Chapter {
  id: string; // Chapter ID
  title: string; // Chapter title
  originTitle: string; // Original title from source
  level: number; // Hierarchy level (1, 2, 3, ...)
  pageNo: number; // Start page number
  textBlocks: TextBlock[]; // Text blocks
  imageIds: string[]; // Image ID references
  tableIds: string[]; // Table ID references
  footnoteIds: string[]; // Footnote ID references
  children?: Chapter[]; // Sub-chapters (optional)
}

TextBlock

Atomic text unit.

import type { TextBlock } from '@heripo/model';

interface TextBlock {
  text: string; // Text content
  pdfPageNo: number; // PDF page number
}

ProcessedImage

Image metadata and reference information.

import type { ProcessedImage } from '@heripo/model';

interface ProcessedImage {
  id: string; // Image ID
  caption?: Caption; // Caption (optional)
  pdfPageNo: number; // PDF page number
  path: string; // Image file path
}

ProcessedTable

Table structure and data.

import type { ProcessedTable } from '@heripo/model';

interface ProcessedTable {
  id: string; // Table ID
  caption?: Caption; // Caption (optional)
  pdfPageNo: number; // PDF page number
  grid: ProcessedTableCell[][]; // 2D grid data
  numRows: number; // Row count
  numCols: number; // Column count
}

ProcessedTableCell

Table cell metadata.

import type { ProcessedTableCell } from '@heripo/model';

interface ProcessedTableCell {
  text: string; // Cell text
  rowSpan: number; // Row span
  colSpan: number; // Column span
  isHeader: boolean; // Is header cell
}

Caption

Image and table captions.

import type { Caption } from '@heripo/model';

interface Caption {
  num?: string; // Caption number (e.g., "1" in "Figure 1")
  fullText: string; // Full caption text
}

PageRange

PDF page to document page mapping.

import type { PageRange } from '@heripo/model';

interface PageRange {
  startPageNo: number; // Start page number
  endPageNo: number; // End page number
}

ProcessedFootnote

Footnote extracted from the document.

import type { ProcessedFootnote } from '@heripo/model';

interface ProcessedFootnote {
  id: string; // Footnote ID
  text: string; // Footnote text
  pdfPageNo: number; // PDF page number
}

DocumentProcessResult

Result of document processing, including the processed document and token usage report.

import type { DocumentProcessResult } from '@heripo/model';

interface DocumentProcessResult {
  document: ProcessedDocument; // Processed document
  usage: TokenUsageReport; // Token usage report
}

OcrStrategy

OCR strategy selection result.

import type { OcrStrategy } from '@heripo/model';

interface OcrStrategy {
  method: 'ocrmac' | 'vlm'; // OCR method
  ocrLanguages?: string[]; // OCR languages
  detectedLanguages?: Bcp47LanguageTag[]; // Detected BCP-47 language tags
  reason: string; // Reason for strategy selection
  sampledPages: number; // Number of sampled pages
  totalPages: number; // Total pages in document
  koreanHanjaMixPages?: number[]; // Pages with Korean-Hanja mixed script
}

Token Usage Types

Types for tracking LLM token usage across processing phases.

import type {
  ComponentUsageReport,
  ModelUsageDetail,
  PhaseUsageReport,
  TokenUsageReport,
  TokenUsageSummary,
} from '@heripo/model';

interface TokenUsageReport {
  components: ComponentUsageReport[]; // Usage per component
  total: TokenUsageSummary; // Total usage summary
}

interface ComponentUsageReport {
  component: string; // Component name
  phases: PhaseUsageReport[]; // Usage per phase
  total: TokenUsageSummary; // Component total
}

interface PhaseUsageReport {
  phase: string; // Phase name
  primary?: ModelUsageDetail; // Primary model usage
  fallback?: ModelUsageDetail; // Fallback model usage
  total: TokenUsageSummary; // Phase total
}

interface ModelUsageDetail {
  modelName: string; // Model name
  inputTokens: number; // Input token count
  outputTokens: number; // Output token count
  totalTokens: number; // Total token count
}

interface TokenUsageSummary {
  inputTokens: number; // Input token count
  outputTokens: number; // Output token count
  totalTokens: number; // Total token count
}

BCP-47 Language Tag Utilities

Utilities for working with BCP-47 language tags.

import {
  type Bcp47LanguageTag,
  BCP47_LANGUAGE_TAGS,
  BCP47_LANGUAGE_TAG_SET,
  isValidBcp47Tag,
  normalizeToBcp47,
} from '@heripo/model';

// Bcp47LanguageTag - Union type of supported BCP-47 language tags
type Bcp47LanguageTag = 'ko' | 'en' | 'ja' | 'zh' | /* ... */ string;

// BCP47_LANGUAGE_TAGS - Const array of 30 supported tags
const BCP47_LANGUAGE_TAGS: readonly Bcp47LanguageTag[];

// BCP47_LANGUAGE_TAG_SET - ReadonlySet for O(1) lookup
const BCP47_LANGUAGE_TAG_SET: ReadonlySet<string>;

// isValidBcp47Tag - Check if a string is a valid BCP-47 tag
function isValidBcp47Tag(tag: string): tag is Bcp47LanguageTag;

// normalizeToBcp47 - Normalize a language string to BCP-47 format
function normalizeToBcp47(tag: string): Bcp47LanguageTag | undefined;

Usage

Reading ProcessedDocument

import type { Chapter, ProcessedDocument } from '@heripo/model';

function analyzeDocument(doc: ProcessedDocument) {
  console.log('Report ID:', doc.reportId);

  // Iterate chapters
  doc.chapters.forEach((chapter) => {
    console.log(`Chapter: ${chapter.title} (level ${chapter.level})`);
    console.log(`  Text blocks: ${chapter.textBlocks.length}`);
    console.log(`  Images: ${chapter.imageIds.length}`);
    console.log(`  Tables: ${chapter.tableIds.length}`);
    console.log(`  Sub-chapters: ${chapter.children?.length ?? 0}`);
  });

  // Check images
  doc.images.forEach((image) => {
    console.log(`Image ${image.id}:`);
    if (image.caption) {
      console.log(`  Caption: ${image.caption.fullText}`);
    }
    console.log(`  Path: ${image.path}`);
  });

  // Check tables
  doc.tables.forEach((table) => {
    console.log(`Table ${table.id}:`);
    console.log(`  Size: ${table.numRows} x ${table.numCols}`);
    if (table.caption) {
      console.log(`  Caption: ${table.caption.fullText}`);
    }
  });
}

Recursive Chapter Traversal

import type { Chapter } from '@heripo/model';

function traverseChapters(chapter: Chapter, depth: number = 0) {
  const indent = '  '.repeat(depth);
  console.log(`${indent}- ${chapter.title}`);

  // Recursively traverse sub-chapters
  chapter.children?.forEach((child) => {
    traverseChapters(child, depth + 1);
  });
}

// Usage
doc.chapters.forEach((chapter) => traverseChapters(chapter));

Type Guards

import type { ProcessedImage, ProcessedTable } from '@heripo/model';

function hasCaption(
  resource: ProcessedImage | ProcessedTable,
): resource is ProcessedImage | ProcessedTable {
  return resource.caption !== undefined;
}

// Usage
const resourcesWithCaptions = [...doc.images, ...doc.tables].filter(hasCaption);

Related Packages

License

This package is distributed under the Apache License 2.0.

Contributing

Contributions are always welcome! Please see the Contributing Guide.

Project-Wide Information

For project-wide information not covered in this package, see the root README:

  • Citation and Attribution: Academic citation (BibTeX) and attribution methods
  • Contributing Guidelines: Development guidelines, commit rules, PR procedures
  • Community: Issue tracker, discussions, security policy
  • Roadmap: Project development plans

heripo lab | GitHub | heripo engine