@verydia/loaders

v0.1.0

Published

3 months ago

Core document loader framework for Verydia ingestion pipelines

Downloads

0High
0Medium
0Low

verydia

verydia loaders document ingestion rag

@verydia/loaders

Core document loader framework for Verydia ingestion pipelines.

Overview

@verydia/loaders provides the foundational types and abstractions for loading documents into Verydia from various sources. This package is:

Environment-agnostic - Works in Node.js, browsers, and edge runtimes
Zero dependencies - No external dependencies for maximum portability
Strongly typed - Full TypeScript support with comprehensive JSDoc
Extensible - Easy to create custom loaders for any data source

Installation

pnpm add @verydia/loaders

Quick Start

Unified `loadDocuments()` Facade

The easiest way to load documents from any source:

import { loadDocuments } from "@verydia/loaders";

// Load from filesystem
const docs = await loadDocuments({
  kind: "fs",
  options: { path: "./documents", recursive: true },
});

// Load a ZIP archive
const docs = await loadDocuments({
  kind: "zip",
  path: "./evidence-bundle.zip",
});

// Load from Notion
const docs = await loadDocuments({
  kind: "notion",
  options: { auth: process.env.NOTION_TOKEN!, databaseId: "abc123" },
});

// Load from Slack
const docs = await loadDocuments({
  kind: "slack",
  options: { token: process.env.SLACK_TOKEN!, channelIds: ["C123"] },
});

// Load from Google Drive
const docs = await loadDocuments({
  kind: "gdrive",
  options: { authClient, query: "mimeType='application/vnd.google-apps.document'" },
});

Simple PDF Loading (FsLoader)

For general-purpose PDF loading with minimal configuration:

import { FsLoader } from "@verydia/loaders";

// Load a single PDF
const loader = new FsLoader({ path: "./document.pdf" });
const docs = await loader.load();
// Returns 1 document with full text

// Load all PDFs in a directory
const loader = new FsLoader({
  path: "./pdfs",
  recursive: true,
  includeExtensions: [".pdf"],
});
const docs = await loader.load();

Advanced PDF Loading (Legal/High-Fidelity Use Cases)

For legal workflows requiring per-page documents or layout preservation:

import { PdfAdvancedLoader } from "@verydia/loaders";

// Per-page mode (one document per page)
const loader = new PdfAdvancedLoader({
  path: "./cases/brief.pdf",
  mode: "perPage",
});
const docs = await loader.load();
// docs[0].metadata.page === 1
// docs[1].metadata.page === 2
// ...

// Docling layout-preserving mode
const loader = new PdfAdvancedLoader({
  path: "./cases/contract.pdf",
  mode: "docling",
  docling: {
    endpoint: process.env.DOCLING_ENDPOINT!,
    apiKeyHeaderName: "x-api-key",
    apiKey: process.env.DOCLING_API_KEY,
  },
});
const docs = await loader.load();
// docs[i].metadata.layoutMarkup contains HTML/Markdown
// Preserves headings, tables, footnotes, etc.

Advanced DOCX Loading (Contracts/Legal Documents)

For contract and legal document workflows requiring section-based analysis:

import { DocxAdvancedLoader } from "@verydia/loaders";

// Single mode (same as basic loader but explicit)
const loader = new DocxAdvancedLoader({
  path: "./contracts/service-agreement.docx",
});
const docs = await loader.load();

// Per-section mode by headings (H1/H2)
const loader = new DocxAdvancedLoader({
  path: "./briefs/motion-to-dismiss.docx",
  mode: "perSection",
  headingLevels: [1, 2],
});
const sections = await loader.load();
// sections[i].metadata.sectionHeading, sectionHtml, etc.

// HTML layout mode
const loader = new DocxAdvancedLoader({
  path: "./policies/employee-handbook.docx",
  mode: "htmlLayout",
});
const [doc] = await loader.load();
// doc.metadata.layoutHtml contains HTML for downstream rendering

When to use which loader:

| Use Case | Loader | Mode | |----------|--------|------| | General PDF/DOCX ingestion | FsLoader | N/A | | Mixed file types (TXT, JSON, CSV, HTML, PDF, DOCX) | FsLoader | N/A | | Legal citations (page-specific) | PdfAdvancedLoader | perPage | | PDF layout preservation (tables, headings) | PdfAdvancedLoader | docling | | Contract section analysis | DocxAdvancedLoader | perSection | | DOCX with HTML formatting | DocxAdvancedLoader | htmlLayout | | Simple single-document PDF/DOCX | PdfAdvancedLoader / DocxAdvancedLoader | single |

Core Concepts

VerydiaDocument

The canonical document model used throughout Verydia for:

Data ingestion (loaders)
Document splitting and chunking
RAG (retrieval-augmented generation)
Telemetry and observability

interface VerydiaDocument {
  id: string;                    // Unique identifier
  text: string;                  // Document content
  metadata: {
    source: string;              // Source system (e.g., 'fs', 'notion', 'slack')
    uri?: string;                // Path, URL, or external ID
    title?: string;              // Human-readable title
    mimeType?: string;           // MIME type
    page?: number;               // Page number (1-indexed)
    createdAt?: string;          // ISO 8601 timestamp
    updatedAt?: string;          // ISO 8601 timestamp
    [key: string]: unknown;      // Domain-specific fields
  };
}

VerydiaLoader

Interface that all loaders must implement:

interface VerydiaLoader {
  load(): Promise<VerydiaDocument[]>;
}

BaseLoader

Abstract base class providing:

Required load() method for subclasses to implement
Built-in loadAndSplit() method for loading + splitting in one call

Usage

Creating a Custom Loader

import { BaseLoader, type VerydiaDocument } from "@verydia/loaders";

class MyCustomLoader extends BaseLoader {
  constructor(private apiKey: string) {
    super();
  }

  async load(): Promise<VerydiaDocument[]> {
    // Fetch documents from your source
    const response = await fetch("https://api.example.com/documents", {
      headers: { Authorization: `Bearer ${this.apiKey}` },
    });

    const data = await response.json();

    return data.documents.map((doc: any, idx: number) => ({
      id: doc.id || `doc-${idx}`,
      text: doc.content,
      metadata: {
        source: "my-api",
        uri: `https://api.example.com/documents/${doc.id}`,
        title: doc.title,
        createdAt: doc.created_at,
        updatedAt: doc.updated_at,
        // Custom metadata
        author: doc.author,
        tags: doc.tags,
      },
    }));
  }
}

// Usage
const loader = new MyCustomLoader("your-api-key");
const docs = await loader.load();
console.log(`Loaded ${docs.length} documents`);

Loading and Splitting

The loadAndSplit() method allows you to load and chunk documents in one call:

import { BaseLoader, type VerydiaDocument, type DocumentSplitter } from "@verydia/loaders";

// Custom splitter that chunks by character count
const chunkByChars: DocumentSplitter = async (docs) => {
  const chunkSize = 500;
  const chunks: VerydiaDocument[] = [];

  for (const doc of docs) {
    for (let i = 0; i < doc.text.length; i += chunkSize) {
      chunks.push({
        id: `${doc.id}-chunk-${Math.floor(i / chunkSize)}`,
        text: doc.text.slice(i, i + chunkSize),
        metadata: {
          ...doc.metadata,
          page: Math.floor(i / chunkSize) + 1,
          isChunk: true,
          originalDocId: doc.id,
        },
      });
    }
  }

  return chunks;
};

// Load and split
const loader = new MyCustomLoader("your-api-key");
const chunks = await loader.loadAndSplit(chunkByChars);
console.log(`Created ${chunks.length} chunks`);

Legal/Enterprise Metadata

The VerydiaDocument metadata supports domain-specific fields for legal and enterprise use cases:

const legalDoc: VerydiaDocument = {
  id: "contract-123",
  text: "This Agreement is entered into...",
  metadata: {
    source: "sharepoint",
    uri: "https://company.sharepoint.com/contracts/123",
    title: "Service Agreement - Acme Corp",
    mimeType: "application/pdf",
    createdAt: "2024-01-15T10:30:00Z",
    updatedAt: "2024-01-20T14:45:00Z",
    
    // Legal-specific metadata
    jurisdiction: "CA",
    documentType: "contract",
    author: "Legal Department",
    department: "Legal",
    confidentiality: "internal",
    tags: ["contracts", "services", "2024"],
    expirationDate: "2025-01-15",
    parties: ["Acme Corp", "Our Company"],
  },
};

Environment-Agnostic Design

This package contains zero environment-specific APIs. It works in:

✅ Node.js
✅ Browsers
✅ Edge runtimes (Cloudflare Workers, Vercel Edge, etc.)
✅ React Native
✅ Electron

Specific loaders (filesystem, Notion, Slack, etc.) will be in separate packages:

@verydia/loaders-fs - Filesystem loader (Node.js only)
@verydia/loaders-web - Web scraping loader
@verydia/loaders-notion - Notion API loader
@verydia/loaders-slack - Slack API loader
And more...

API Reference

Types

`VerydiaDocument`

The canonical document model for all Verydia ingestion.

Fields:

id: string - Unique identifier (should be stable across re-ingestion)
text: string - Document content
metadata: object - Metadata about source, structure, and domain

`VerydiaLoader`

Interface for all document loaders.

Methods:

load(): Promise<VerydiaDocument[]> - Load documents from source

`DocumentSplitter`

Type for document splitting functions.

type DocumentSplitter = (docs: VerydiaDocument[]) => Promise<VerydiaDocument[]>;

Classes

`BaseLoader`

Abstract base class for all loaders.

Methods:

abstract load(): Promise<VerydiaDocument[]> - Implement to load documents
loadAndSplit(splitter?: DocumentSplitter): Promise<VerydiaDocument[]> - Load and optionally split

Testing

pnpm test

Building

pnpm build

Outputs:

ESM: dist/index.js
CJS: dist/index.cjs
TypeScript declarations: dist/index.d.ts

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@verydia/loaders

Overview

Installation

Quick Start

Unified loadDocuments() Facade

Simple PDF Loading (FsLoader)

Advanced PDF Loading (Legal/High-Fidelity Use Cases)

Advanced DOCX Loading (Contracts/Legal Documents)

Core Concepts

VerydiaDocument

VerydiaLoader

BaseLoader

Usage

Creating a Custom Loader

Loading and Splitting

Legal/Enterprise Metadata

Environment-Agnostic Design

API Reference

Types

VerydiaDocument

VerydiaLoader

DocumentSplitter

Classes

BaseLoader

Testing

Building

License

Unified `loadDocuments()` Facade

`VerydiaDocument`

`VerydiaLoader`

`DocumentSplitter`

`BaseLoader`