@verydia/loaders
v0.1.0
Published
Core document loader framework for Verydia ingestion pipelines
Maintainers
Readme
@verydia/loaders
Core document loader framework for Verydia ingestion pipelines.
Overview
@verydia/loaders provides the foundational types and abstractions for loading documents into Verydia from various sources. This package is:
- Environment-agnostic - Works in Node.js, browsers, and edge runtimes
- Zero dependencies - No external dependencies for maximum portability
- Strongly typed - Full TypeScript support with comprehensive JSDoc
- Extensible - Easy to create custom loaders for any data source
Installation
pnpm add @verydia/loadersQuick Start
Unified loadDocuments() Facade
The easiest way to load documents from any source:
import { loadDocuments } from "@verydia/loaders";
// Load from filesystem
const docs = await loadDocuments({
kind: "fs",
options: { path: "./documents", recursive: true },
});
// Load a ZIP archive
const docs = await loadDocuments({
kind: "zip",
path: "./evidence-bundle.zip",
});
// Load from Notion
const docs = await loadDocuments({
kind: "notion",
options: { auth: process.env.NOTION_TOKEN!, databaseId: "abc123" },
});
// Load from Slack
const docs = await loadDocuments({
kind: "slack",
options: { token: process.env.SLACK_TOKEN!, channelIds: ["C123"] },
});
// Load from Google Drive
const docs = await loadDocuments({
kind: "gdrive",
options: { authClient, query: "mimeType='application/vnd.google-apps.document'" },
});Simple PDF Loading (FsLoader)
For general-purpose PDF loading with minimal configuration:
import { FsLoader } from "@verydia/loaders";
// Load a single PDF
const loader = new FsLoader({ path: "./document.pdf" });
const docs = await loader.load();
// Returns 1 document with full text
// Load all PDFs in a directory
const loader = new FsLoader({
path: "./pdfs",
recursive: true,
includeExtensions: [".pdf"],
});
const docs = await loader.load();Advanced PDF Loading (Legal/High-Fidelity Use Cases)
For legal workflows requiring per-page documents or layout preservation:
import { PdfAdvancedLoader } from "@verydia/loaders";
// Per-page mode (one document per page)
const loader = new PdfAdvancedLoader({
path: "./cases/brief.pdf",
mode: "perPage",
});
const docs = await loader.load();
// docs[0].metadata.page === 1
// docs[1].metadata.page === 2
// ...
// Docling layout-preserving mode
const loader = new PdfAdvancedLoader({
path: "./cases/contract.pdf",
mode: "docling",
docling: {
endpoint: process.env.DOCLING_ENDPOINT!,
apiKeyHeaderName: "x-api-key",
apiKey: process.env.DOCLING_API_KEY,
},
});
const docs = await loader.load();
// docs[i].metadata.layoutMarkup contains HTML/Markdown
// Preserves headings, tables, footnotes, etc.Advanced DOCX Loading (Contracts/Legal Documents)
For contract and legal document workflows requiring section-based analysis:
import { DocxAdvancedLoader } from "@verydia/loaders";
// Single mode (same as basic loader but explicit)
const loader = new DocxAdvancedLoader({
path: "./contracts/service-agreement.docx",
});
const docs = await loader.load();
// Per-section mode by headings (H1/H2)
const loader = new DocxAdvancedLoader({
path: "./briefs/motion-to-dismiss.docx",
mode: "perSection",
headingLevels: [1, 2],
});
const sections = await loader.load();
// sections[i].metadata.sectionHeading, sectionHtml, etc.
// HTML layout mode
const loader = new DocxAdvancedLoader({
path: "./policies/employee-handbook.docx",
mode: "htmlLayout",
});
const [doc] = await loader.load();
// doc.metadata.layoutHtml contains HTML for downstream renderingWhen to use which loader:
| Use Case | Loader | Mode |
|----------|--------|------|
| General PDF/DOCX ingestion | FsLoader | N/A |
| Mixed file types (TXT, JSON, CSV, HTML, PDF, DOCX) | FsLoader | N/A |
| Legal citations (page-specific) | PdfAdvancedLoader | perPage |
| PDF layout preservation (tables, headings) | PdfAdvancedLoader | docling |
| Contract section analysis | DocxAdvancedLoader | perSection |
| DOCX with HTML formatting | DocxAdvancedLoader | htmlLayout |
| Simple single-document PDF/DOCX | PdfAdvancedLoader / DocxAdvancedLoader | single |
Core Concepts
VerydiaDocument
The canonical document model used throughout Verydia for:
- Data ingestion (loaders)
- Document splitting and chunking
- RAG (retrieval-augmented generation)
- Telemetry and observability
interface VerydiaDocument {
id: string; // Unique identifier
text: string; // Document content
metadata: {
source: string; // Source system (e.g., 'fs', 'notion', 'slack')
uri?: string; // Path, URL, or external ID
title?: string; // Human-readable title
mimeType?: string; // MIME type
page?: number; // Page number (1-indexed)
createdAt?: string; // ISO 8601 timestamp
updatedAt?: string; // ISO 8601 timestamp
[key: string]: unknown; // Domain-specific fields
};
}VerydiaLoader
Interface that all loaders must implement:
interface VerydiaLoader {
load(): Promise<VerydiaDocument[]>;
}BaseLoader
Abstract base class providing:
- Required
load()method for subclasses to implement - Built-in
loadAndSplit()method for loading + splitting in one call
Usage
Creating a Custom Loader
import { BaseLoader, type VerydiaDocument } from "@verydia/loaders";
class MyCustomLoader extends BaseLoader {
constructor(private apiKey: string) {
super();
}
async load(): Promise<VerydiaDocument[]> {
// Fetch documents from your source
const response = await fetch("https://api.example.com/documents", {
headers: { Authorization: `Bearer ${this.apiKey}` },
});
const data = await response.json();
return data.documents.map((doc: any, idx: number) => ({
id: doc.id || `doc-${idx}`,
text: doc.content,
metadata: {
source: "my-api",
uri: `https://api.example.com/documents/${doc.id}`,
title: doc.title,
createdAt: doc.created_at,
updatedAt: doc.updated_at,
// Custom metadata
author: doc.author,
tags: doc.tags,
},
}));
}
}
// Usage
const loader = new MyCustomLoader("your-api-key");
const docs = await loader.load();
console.log(`Loaded ${docs.length} documents`);Loading and Splitting
The loadAndSplit() method allows you to load and chunk documents in one call:
import { BaseLoader, type VerydiaDocument, type DocumentSplitter } from "@verydia/loaders";
// Custom splitter that chunks by character count
const chunkByChars: DocumentSplitter = async (docs) => {
const chunkSize = 500;
const chunks: VerydiaDocument[] = [];
for (const doc of docs) {
for (let i = 0; i < doc.text.length; i += chunkSize) {
chunks.push({
id: `${doc.id}-chunk-${Math.floor(i / chunkSize)}`,
text: doc.text.slice(i, i + chunkSize),
metadata: {
...doc.metadata,
page: Math.floor(i / chunkSize) + 1,
isChunk: true,
originalDocId: doc.id,
},
});
}
}
return chunks;
};
// Load and split
const loader = new MyCustomLoader("your-api-key");
const chunks = await loader.loadAndSplit(chunkByChars);
console.log(`Created ${chunks.length} chunks`);Legal/Enterprise Metadata
The VerydiaDocument metadata supports domain-specific fields for legal and enterprise use cases:
const legalDoc: VerydiaDocument = {
id: "contract-123",
text: "This Agreement is entered into...",
metadata: {
source: "sharepoint",
uri: "https://company.sharepoint.com/contracts/123",
title: "Service Agreement - Acme Corp",
mimeType: "application/pdf",
createdAt: "2024-01-15T10:30:00Z",
updatedAt: "2024-01-20T14:45:00Z",
// Legal-specific metadata
jurisdiction: "CA",
documentType: "contract",
author: "Legal Department",
department: "Legal",
confidentiality: "internal",
tags: ["contracts", "services", "2024"],
expirationDate: "2025-01-15",
parties: ["Acme Corp", "Our Company"],
},
};Environment-Agnostic Design
This package contains zero environment-specific APIs. It works in:
- ✅ Node.js
- ✅ Browsers
- ✅ Edge runtimes (Cloudflare Workers, Vercel Edge, etc.)
- ✅ React Native
- ✅ Electron
Specific loaders (filesystem, Notion, Slack, etc.) will be in separate packages:
@verydia/loaders-fs- Filesystem loader (Node.js only)@verydia/loaders-web- Web scraping loader@verydia/loaders-notion- Notion API loader@verydia/loaders-slack- Slack API loader- And more...
API Reference
Types
VerydiaDocument
The canonical document model for all Verydia ingestion.
Fields:
id: string- Unique identifier (should be stable across re-ingestion)text: string- Document contentmetadata: object- Metadata about source, structure, and domain
VerydiaLoader
Interface for all document loaders.
Methods:
load(): Promise<VerydiaDocument[]>- Load documents from source
DocumentSplitter
Type for document splitting functions.
type DocumentSplitter = (docs: VerydiaDocument[]) => Promise<VerydiaDocument[]>;Classes
BaseLoader
Abstract base class for all loaders.
Methods:
abstract load(): Promise<VerydiaDocument[]>- Implement to load documentsloadAndSplit(splitter?: DocumentSplitter): Promise<VerydiaDocument[]>- Load and optionally split
Testing
pnpm testBuilding
pnpm buildOutputs:
- ESM:
dist/index.js - CJS:
dist/index.cjs - TypeScript declarations:
dist/index.d.ts
License
MIT
