npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@vectorize-io/iris

v0.1.3

Published

Simple text extraction from files using Vectorize Iris

Readme

Vectorize Iris Node.js SDK

Document text extraction for Node.js & TypeScript

Extract text, tables, and structured data from PDFs, images, and documents with a single async function. Built on Vectorize Iris, the industry-leading AI extraction service.

npm version TypeScript License: MIT

Why Iris?

Traditional OCR tools struggle with complex layouts, poor scans, and structured data. Iris uses advanced AI to deliver:

  • High accuracy - Even with poor quality or complex documents
  • 📊 Structure preservation - Maintains tables, lists, and formatting
  • 🎯 Smart chunking - Semantic splitting perfect for RAG pipelines
  • 🔍 Metadata extraction - Extract specific fields using natural language
  • 🚀 TypeScript native - Full type safety with built-in types
  • Async-first - Promise-based API for modern Node.js

Quick Start

Installation

npm install @vectorize-io/iris

Authentication

Set your credentials (get them at vectorize.io):

export VECTORIZE_TOKEN="your-token"
export VECTORIZE_ORG_ID="your-org-id"

Basic Usage

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('document.pdf');
console.log(result.text);

That's it! Iris handles file upload, extraction, and polling automatically.

Features

Basic Text Extraction

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('document.pdf');
console.log(result.text);

Output:

This is the extracted text from your PDF document.
All formatting and structure is preserved.

Tables, lists, and other elements are properly extracted.

Extract from Buffer

import { extractText } from '@vectorize-io/iris';
import * as fs from 'fs';

const fileBuffer = fs.readFileSync('document.pdf');
const result = await extractText(fileBuffer, 'document.pdf');

console.log(`Extracted ${result.text.length} characters`);

Output:

Extracted 5536 characters

Chunking for RAG

import { extractTextFromFile } from '@vectorize-io/iris';
import type { ExtractionOptions } from '@vectorize-io/iris';

const options: ExtractionOptions = {
  chunkSize: 512
};

const result = await extractTextFromFile('long-document.pdf', options);

result.chunks?.forEach((chunk, i) => {
  console.log(`Chunk ${i+1}: ${chunk.substring(0, 100)}...`);
});

Output:

Chunk 1: # Introduction
This document covers the basics of machine learning...

Chunk 2: ## Neural Networks
Neural networks are computational models inspired by...

Chunk 3: ### Training Process
The training process involves adjusting weights...

Custom Parsing Instructions

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('report.pdf', {
  parsingInstructions: 'Extract only tables and numerical data, ignore narrative text'
});

console.log(result.text);

Output:

Q1 2024 Revenue: $1,250,000
Q2 2024 Revenue: $1,450,000
Q3 2024 Revenue: $1,680,000

Region    | Sales  | Growth
----------|--------|-------
North     | $500K  | +12%
South     | $380K  | +8%
East      | $420K  | +15%
West      | $380K  | +10%

Inferred Metadata Schema

import { extractTextFromFile } from '@vectorize-io/iris';

const result = await extractTextFromFile('invoice.pdf', {
  inferMetadataSchema: true
});

const metadata = JSON.parse(result.metadata!);
console.log(JSON.stringify(metadata, null, 2));

Output:

{
  "document_type": "invoice",
  "invoice_number": "INV-2024-001",
  "date": "2024-01-15",
  "total_amount": 1250.00,
  "currency": "USD",
  "vendor": "Acme Corp"
}

Express.js Integration

import express from 'express';
import multer from 'multer';
import { extractText } from '@vectorize-io/iris';
import * as fs from 'fs';

const app = express();
const upload = multer({ dest: 'uploads/' });

app.post('/extract', upload.single('file'), async (req, res) => {
  try {
    const fileBuffer = fs.readFileSync(req.file!.path);
    const result = await extractText(fileBuffer, req.file!.originalname);

    res.json({
      success: true,
      text: result.text,
      charCount: result.text?.length || 0
    });
  } catch (error) {
    res.status(500).json({
      success: false,
      error: error.message
    });
  }
});

app.listen(3000, () => {
  console.log('Server running on port 3000');
});

Request:

curl -F "[email protected]" http://localhost:3000/extract

Response:

{
  "success": true,
  "text": "This is the extracted text...",
  "charCount": 5536
}

Batch Processing

import { extractTextFromFile } from '@vectorize-io/iris';
import * as fs from 'fs/promises';
import * as path from 'path';

async function processDirectory(dirPath: string) {
  const files = await fs.readdir(dirPath);
  const pdfFiles = files.filter(f => f.endsWith('.pdf'));

  for (const file of pdfFiles) {
    const filePath = path.join(dirPath, file);
    console.log(`Processing ${file}...`);

    const result = await extractTextFromFile(filePath);
    const outputPath = filePath.replace('.pdf', '.txt');

    await fs.writeFile(outputPath, result.text!);
    console.log(`  ✓ Saved to ${path.basename(outputPath)}`);
  }
}

processDirectory('./documents');

Output:

Processing report-q1.pdf...
  ✓ Saved to report-q1.txt
Processing report-q2.pdf...
  ✓ Saved to report-q2.txt
Processing report-q3.pdf...
  ✓ Saved to report-q3.txt

Parallel Processing

import { extractTextFromFile } from '@vectorize-io/iris';

const files = ['doc1.pdf', 'doc2.pdf', 'doc3.pdf'];

const results = await Promise.all(
  files.map(file => extractTextFromFile(file))
);

results.forEach((result, i) => {
  console.log(`${files[i]}: ${result.text?.length || 0} chars`);
});

Output:

doc1.pdf: 3421 chars
doc2.pdf: 5892 chars
doc3.pdf: 2156 chars

Error Handling

import { extractTextFromFile, VectorizeIrisError } from '@vectorize-io/iris';

try {
  const result = await extractTextFromFile('document.pdf');
  console.log(result.text);
} catch (error) {
  if (error instanceof VectorizeIrisError) {
    console.error('Extraction failed:', error.message);
  } else {
    console.error('Unexpected error:', error);
  }
}

Output:

Extraction failed: File not found: document.pdf

TypeScript Types

import type {
  ExtractionOptions,
  ExtractionResultData,
  MetadataExtractionStrategySchema
} from '@vectorize-io/iris';

// Type-safe options with structured schema (OpenAPI spec format)
const options: ExtractionOptions = {
  chunkSize: 512,
  parsingInstructions: 'Extract code blocks',
  metadataSchemas: [{
    id: 'doc-meta',
    schema: {
      title: 'string',
      author: 'string',
      date: 'string'
    }
  }],
  pollInterval: 2000,
  timeout: 300000
};

// Type-safe result
const result: ExtractionResultData = await extractTextFromFile('doc.pdf', options);

if (result.success) {
  console.log('Text:', result.text);
  console.log('Chunks:', result.chunks?.length);
  console.log('Metadata:', result.metadata);
}

API Reference

extractTextFromFile(filePath, options?)

Extract text from a file.

Parameters:

  • filePath (string): Path to the file
  • options (ExtractionOptions, optional): Extraction options

Returns: Promise<ExtractionResultData>

extractText(fileBuffer, fileName, options?)

Extract text from a buffer.

Parameters:

  • fileBuffer (Buffer): File content
  • fileName (string): File name
  • options (ExtractionOptions, optional): Extraction options

Returns: Promise<ExtractionResultData>

ExtractionOptions

interface ExtractionOptions {
  apiToken?: string;              // Override env var
  orgId?: string;                 // Override env var
  pollInterval?: number;          // ms between checks (default: 2000)
  timeout?: number;               // max ms to wait (default: 300000)
  type?: 'iris';                  // Extraction type
  chunkSize?: number;             // Chunk size (default: 256)
  metadataSchemas?: Array<{       // Metadata schemas
    id: string;
    schema: string;
  }>;
  inferMetadataSchema?: boolean;  // Auto-detect metadata
  parsingInstructions?: string;   // Custom instructions
}

ExtractionResultData

interface ExtractionResultData {
  success: boolean;
  text?: string;                  // Extracted text
  chunks?: string[];              // Text chunks
  metadata?: string;              // JSON metadata
  metadataSchema?: string;        // Schema ID
  chunksMetadata?: (string|null)[]; // Per-chunk metadata
  chunksSchema?: (string|null)[];   // Per-chunk schemas
  error?: string;                 // Error message
}

📚 Full Documentation | 🏠 Back to Main README