@vectorize-io/iris
v0.1.3
Published
Simple text extraction from files using Vectorize Iris
Readme
Vectorize Iris Node.js SDK
Document text extraction for Node.js & TypeScript
Extract text, tables, and structured data from PDFs, images, and documents with a single async function. Built on Vectorize Iris, the industry-leading AI extraction service.
Why Iris?
Traditional OCR tools struggle with complex layouts, poor scans, and structured data. Iris uses advanced AI to deliver:
- ✨ High accuracy - Even with poor quality or complex documents
- 📊 Structure preservation - Maintains tables, lists, and formatting
- 🎯 Smart chunking - Semantic splitting perfect for RAG pipelines
- 🔍 Metadata extraction - Extract specific fields using natural language
- 🚀 TypeScript native - Full type safety with built-in types
- ⚡ Async-first - Promise-based API for modern Node.js
Quick Start
Installation
npm install @vectorize-io/irisAuthentication
Set your credentials (get them at vectorize.io):
export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"Basic Usage
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('document.pdf');
console.log(result.text);That's it! Iris handles file upload, extraction, and polling automatically.
Features
Basic Text Extraction
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('document.pdf');
console.log(result.text);Output:
This is the extracted text from your PDF document.
All formatting and structure is preserved.
Tables, lists, and other elements are properly extracted.Extract from Buffer
import { extractText } from '@vectorize-io/iris';
import * as fs from 'fs';
const fileBuffer = fs.readFileSync('document.pdf');
const result = await extractText(fileBuffer, 'document.pdf');
console.log(`Extracted ${result.text.length} characters`);Output:
Extracted 5536 charactersChunking for RAG
import { extractTextFromFile } from '@vectorize-io/iris';
import type { ExtractionOptions } from '@vectorize-io/iris';
const options: ExtractionOptions = {
chunkSize: 512
};
const result = await extractTextFromFile('long-document.pdf', options);
result.chunks?.forEach((chunk, i) => {
console.log(`Chunk ${i+1}: ${chunk.substring(0, 100)}...`);
});Output:
Chunk 1: # Introduction
This document covers the basics of machine learning...
Chunk 2: ## Neural Networks
Neural networks are computational models inspired by...
Chunk 3: ### Training Process
The training process involves adjusting weights...Custom Parsing Instructions
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('report.pdf', {
parsingInstructions: 'Extract only tables and numerical data, ignore narrative text'
});
console.log(result.text);Output:
Q1 2024 Revenue: $1,250,000
Q2 2024 Revenue: $1,450,000
Q3 2024 Revenue: $1,680,000
Region | Sales | Growth
----------|--------|-------
North | $500K | +12%
South | $380K | +8%
East | $420K | +15%
West | $380K | +10%Inferred Metadata Schema
import { extractTextFromFile } from '@vectorize-io/iris';
const result = await extractTextFromFile('invoice.pdf', {
inferMetadataSchema: true
});
const metadata = JSON.parse(result.metadata!);
console.log(JSON.stringify(metadata, null, 2));Output:
{
"document_type": "invoice",
"invoice_number": "INV-2024-001",
"date": "2024-01-15",
"total_amount": 1250.00,
"currency": "USD",
"vendor": "Acme Corp"
}Express.js Integration
import express from 'express';
import multer from 'multer';
import { extractText } from '@vectorize-io/iris';
import * as fs from 'fs';
const app = express();
const upload = multer({ dest: 'uploads/' });
app.post('/extract', upload.single('file'), async (req, res) => {
try {
const fileBuffer = fs.readFileSync(req.file!.path);
const result = await extractText(fileBuffer, req.file!.originalname);
res.json({
success: true,
text: result.text,
charCount: result.text?.length || 0
});
} catch (error) {
res.status(500).json({
success: false,
error: error.message
});
}
});
app.listen(3000, () => {
console.log('Server running on port 3000');
});Request:
curl -F "[email protected]" http://localhost:3000/extractResponse:
{
"success": true,
"text": "This is the extracted text...",
"charCount": 5536
}Batch Processing
import { extractTextFromFile } from '@vectorize-io/iris';
import * as fs from 'fs/promises';
import * as path from 'path';
async function processDirectory(dirPath: string) {
const files = await fs.readdir(dirPath);
const pdfFiles = files.filter(f => f.endsWith('.pdf'));
for (const file of pdfFiles) {
const filePath = path.join(dirPath, file);
console.log(`Processing ${file}...`);
const result = await extractTextFromFile(filePath);
const outputPath = filePath.replace('.pdf', '.txt');
await fs.writeFile(outputPath, result.text!);
console.log(` ✓ Saved to ${path.basename(outputPath)}`);
}
}
processDirectory('./documents');Output:
Processing report-q1.pdf...
✓ Saved to report-q1.txt
Processing report-q2.pdf...
✓ Saved to report-q2.txt
Processing report-q3.pdf...
✓ Saved to report-q3.txtParallel Processing
import { extractTextFromFile } from '@vectorize-io/iris';
const files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'];
const results = await Promise.all(
files.map(file => extractTextFromFile(file))
);
results.forEach((result, i) => {
console.log(`${files[i]}: ${result.text?.length || 0} chars`);
});Output:
doc1.pdf: 3421 chars
doc2.pdf: 5892 chars
doc3.pdf: 2156 charsError Handling
import { extractTextFromFile, VectorizeIrisError } from '@vectorize-io/iris';
try {
const result = await extractTextFromFile('document.pdf');
console.log(result.text);
} catch (error) {
if (error instanceof VectorizeIrisError) {
console.error('Extraction failed:', error.message);
} else {
console.error('Unexpected error:', error);
}
}Output:
Extraction failed: File not found: document.pdfTypeScript Types
import type {
ExtractionOptions,
ExtractionResultData,
MetadataExtractionStrategySchema
} from '@vectorize-io/iris';
// Type-safe options with structured schema (OpenAPI spec format)
const options: ExtractionOptions = {
chunkSize: 512,
parsingInstructions: 'Extract code blocks',
metadataSchemas: [{
id: 'doc-meta',
schema: {
title: 'string',
author: 'string',
date: 'string'
}
}],
pollInterval: 2000,
timeout: 300000
};
// Type-safe result
const result: ExtractionResultData = await extractTextFromFile('doc.pdf', options);
if (result.success) {
console.log('Text:', result.text);
console.log('Chunks:', result.chunks?.length);
console.log('Metadata:', result.metadata);
}API Reference
extractTextFromFile(filePath, options?)
Extract text from a file.
Parameters:
filePath(string): Path to the fileoptions(ExtractionOptions, optional): Extraction options
Returns: Promise<ExtractionResultData>
extractText(fileBuffer, fileName, options?)
Extract text from a buffer.
Parameters:
fileBuffer(Buffer): File contentfileName(string): File nameoptions(ExtractionOptions, optional): Extraction options
Returns: Promise<ExtractionResultData>
ExtractionOptions
interface ExtractionOptions {
apiToken?: string; // Override env var
orgId?: string; // Override env var
pollInterval?: number; // ms between checks (default: 2000)
timeout?: number; // max ms to wait (default: 300000)
type?: 'iris'; // Extraction type
chunkSize?: number; // Chunk size (default: 256)
metadataSchemas?: Array<{ // Metadata schemas
id: string;
schema: string;
}>;
inferMetadataSchema?: boolean; // Auto-detect metadata
parsingInstructions?: string; // Custom instructions
}ExtractionResultData
interface ExtractionResultData {
success: boolean;
text?: string; // Extracted text
chunks?: string[]; // Text chunks
metadata?: string; // JSON metadata
metadataSchema?: string; // Schema ID
chunksMetadata?: (string|null)[]; // Per-chunk metadata
chunksSchema?: (string|null)[]; // Per-chunk schemas
error?: string; // Error message
}