npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@virstack/doc-ingest

v1.1.1

Published

A high-performance, parallelized document ingestion and vectorization pipeline.

Downloads

460

Readme

🚀 Virstack Doc Ingest

Virstack Doc Ingest is a high-performance, parallelized document ingestion and vectorization pipeline designed for scalable Retrieval-Augmented Generation (RAG) applications.

Powered by LangGraph for resilient orchestration, OpenRouter / Gemini for advanced vision/text extraction, and natively supporting Upstash Vector (with easily injectable custom adapters), this library acts as a universal bridge between your raw documents and your AI applications.


✨ Key Features

  • Universal Multi-Format Support: Natively processes PDF, DOCX, XLSX, PPTX, CSV, TXT, HTML, EPUB, and Images (JPG, JPEG, PNG, GIF, WEBP, SVG).
  • Dual-Tier Parallelism: Concurrently processes multiple files while simultaneously splitting and routing large PDFs into parallel Vision-API execution nodes.
  • Smart Type Routing: Automatically identifies MIME types and dynamically routes files to the most optimal, parser-specific extraction graph.
  • Provider Agnostic Architecture: Built entirely on Dependency Injection. Easily swap out LLMs, Embeddings, and Vector Databases (Pinecone, Qdrant, etc.) to fit your specific stack.
  • Gorgeous TUI (Text User Interface): Features a beautiful, interactive command-line interface with interactive menus and live, non-tearing spinners.

🛠️ System Prerequisites

IMPORTANT: For parsing complex Office documents (e.g., .docx, .pptx, .xlsx, .epub), the pipeline relies on LibreOffice for high-fidelity conversion.

If you are only parsing PDFs, TXT, or CSV files, LibreOffice is not required.

Installing LibreOffice:

  • macOS: brew install --cask libreoffice
  • Ubuntu/Debian: sudo apt-get install libreoffice
  • Windows: Download the installer from libreoffice.org

📦 Installation

You can install Virstack Doc Ingest globally to use as a standalone CLI tool, or locally to utilize its powerful API in your custom Node.js applications.

Global Install (CLI Usage)

npm install -g virstack-doc-ingest

Local Install (Library Usage)

npm install virstack-doc-ingest

💻 Usage Mode 1: Interactive CLI

The CLI offers a completely interactive, wizard-based experience.

1. Environment Configuration

Create a .env file in the directory where you plan to run the command:

OPENROUTER_API_KEY=sk-or-v1-...
UPSTASH_VECTOR_URL=https://...
UPSTASH_VECTOR_TOKEN=...
LLM_MODEL=google/gemini-2.0-flash-001
EMBEDDING_MODEL=google/gemini-embedding-2-preview
MAX_CONCURRENT_FILES=3
MAX_CONCURRENT_API_CALLS=15

Note on LangGraph Studio: If you plan to visualize and run this pipeline directly within LangGraph Studio, you must also include your LANGSMITH_API_KEY (or LANGCHAIN_API_KEY) in your .env file to enable pipeline auto-initialization.

2. Running the Tool

To launch the interactive wizard (which allows you to select files, folders, or paste raw text):

virstack-doc-ingest

To bypass the wizard and directly ingest a specific file or directory:

# Process a single contract
virstack-doc-ingest ./documents/contract.pdf

# Process all documents in a directory
virstack-doc-ingest ./company-knowledge-base/

# Run with verbose, node-level diagnostics
virstack-doc-ingest ./documents/ --verbose

Example Output

 __     __  _                _                    _        ____                     ___                                _
 \ \   / / (_)  _ __   ___  | |_    __ _    ___  | | __   |  _ \    ___     ___    |_ _|  _ __     __ _    ___   ___  | |_
  \ \ / /  | | | '__| / __| | __|  / _` |  / __| | |/ /   | | | |  / _ \   / __|    | |  | '_ \   / _` |  / _ \ / __| | __|
   \ V /   | | | |    \__ \ | |_  | (_| | | (__  |   <    | |_| | | (_) | | (__     | |  | | | | | (_| | |  __/ \__ \ | |_
    \_/    |_| |_|    |___/  \__|  \__,_|  \___| |_|\_\   |____/   \___/   \___|   |___| |_| |_|  \__, |  \___| |___/  \__|
                                                                                                  |___/
┌   Welcome to Virstack Doc Ingest
│
◇  What file or directory would you like to process?
│  ./docs
│
◇  Found 2 file(s). Ready to process?
│  Yes, start ingestion
│
◇  ✔ Processing complete in 41.8s!
│
◇  Final Results: 2 succeeded, 0 failed
│
│    ✔ PRES1 CIS 6006-Updated Assessment.p │   28 chunks │   28 vectors │ 41.7s
│
│    ✔ VAI-020-021-Webhook-Implementation. │   12 chunks │   12 vectors │ 27.9s
│
└   Pipeline Finished Successfully!

🛠️ Usage Mode 2: Node.js Library (100% Provider Agnostic)

Virstack Doc Ingest is designed to be fully embedded into your own SaaS backends or ETL pipelines. It is rigidly decoupled from concrete implementations.

Validating Supported File Types

You can import the list of natively supported file extensions directly from the library to validate user uploads before sending them to the ingestion pipeline.

import { SUPPORTED_FILE_EXTENSIONS, batchGraph } from "virstack-doc-ingest";

const fileExt = ".jpg"; // e.g. path.extname(file)

if (!SUPPORTED_FILE_EXTENSIONS.includes(fileExt.toLowerCase())) {
  console.error(`Unsupported file type: ${fileExt}`);
  // Return a 400 Bad Request to the user
}

Default Built-In Adapters

The package exports fully functional adapters for typical stacks:

  • OpenRouterLlmAdapter
  • OpenRouterEmbeddingAdapter
  • UpstashAdapter

Custom Adapter Example (Pinecone & Local LLM)

Here is how you inject your own custom logic into the LangGraph pipeline:

import {
  initializeConfig,
  batchGraph,
  type VectorStoreAdapter,
  type LlmAdapter,
  type EmbeddingAdapter,
  OpenRouterEmbeddingAdapter,
} from "virstack-doc-ingest";
import { Pinecone } from "@pinecone-database/pinecone";

// 1. Define your own Vector Store connection
class CustomPineconeAdapter implements VectorStoreAdapter {
  async upsert(records: any[]) {
    /* ... */
  }
}

// 2. Define a custom Local AI processor (e.g. Ollama)
class LocalLLMAdapter implements LlmAdapter {
  async extractText(image: Buffer, mime: string) {
    return "extracted text";
  }
}

// 3. Mount the adapters to the global configuration
initializeConfig({
  llm: new LocalLLMAdapter(),
  embedder: new OpenRouterEmbeddingAdapter(
    process.env.OPENROUTER_API_KEY!,
    "google/gemini-embedding-2-preview",
  ),
  vectorStore: new CustomPineconeAdapter(),
  maxConcurrentFiles: 5,
});

// 4. Invoke the ingestion orchestrator
async function processData() {
  const files = ["./uploads/report_2024.pdf", "./uploads/financials.xlsx"];

  console.log("Orchestrating batch ingestion...");
  const result = await batchGraph.invoke({ files });
  console.log("Success! Extracted documents:", result.results.length);
}

processData();

⚙️ Configuration Reference

When invoking initializeConfig(options), the VirstackDocIngestConfig interface accepts the following properties:

| Property | Type | Default | Description | | :-------------------- | :------------------- | :----------- | :-------------------------------------------------------------------------------- | | llm | LlmAdapter | Required | Provider for extracting text (especially from PDF images via Vision APIs). | | embedder | EmbeddingAdapter | Required | Provider for transforming text chunks into vector arrays. | | vectorStore | VectorStoreAdapter | Required | Provider targeting your target vector database for final persistence. | | openRouterApiKey | string | undefined | Required if utilizing the built-in OpenRouter adapters. | | maxConcurrentFiles | number | 3 | Maximum files mapped into the parallel processing queue simultaneously. | | maxConcurrentApi | number | 15 | Global connection limit to prevent 429 Rate Limit errors across all active nodes. | | maxTokens | number | 16384 | Maximum allowable context window for the Vision LLM extraction. | | embeddingDimensions | number | 1536 | Target dimensions for the output vectors. | | chunkSize | number | 1000 | Target character length for Markdown recursive section chunking. | | chunkOverlap | number | 100 | Overlapping character padding between contiguous chunk segments. | | pdfPagesPerChunk | number | 10 | Number of PDF pages grouped together before a parallel Vision evaluation. | | systemPrompt | string | (default) | Injection of custom instructions overriding the default extraction constraints. |