npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

pageindex

v1.0.1

Published

Vectorless, reasoning-based RAG for document understanding. Multi-runtime (Node.js, Bun, Deno). PDF and Markdown support with OCR.

Downloads

211

Readme

bun-pageindex

Bun-native vectorless, reasoning-based RAG for document understanding. A TypeScript port of PageIndex optimized for the Bun runtime.

Features

  • Vectorless RAG: Uses LLM reasoning to build hierarchical document indices without vector databases
  • PDF Support: Extract structure and content from PDF documents
  • OCR Mode: Process scanned PDFs using GLM-OCR vision model (not in original PageIndex!)
  • Markdown Support: Convert markdown documents to tree structures
  • LLM Agnostic: Works with OpenAI, LM Studio, Ollama, or any OpenAI-compatible API
  • Bun Native: Optimized for Bun runtime with minimal dependencies
  • CLI & API: Use as a library or command-line tool

Installation

bun add bun-pageindex

For OCR Mode (Scanned PDFs)

OCR mode requires Poppler to be installed on your system:

# macOS
brew install poppler

# Ubuntu/Debian
sudo apt-get install poppler-utils

# Windows
# Download from https://github.com/oschwartz10612/poppler-windows/releases

Quick Start

As a Library

import { PageIndex, indexPdf, mdToTree } from "bun-pageindex";

// Process a PDF with OpenAI
const result = await indexPdf("document.pdf", {
  apiKey: process.env.OPENAI_API_KEY,
  model: "gpt-4o",
});

console.log(result.structure);

// Or use the PageIndex class for more control
const pageIndex = new PageIndex({
  model: "gpt-4o",
  addNodeSummary: true,
  addDocDescription: true,
});

const pdfResult = await pageIndex.fromPdf("document.pdf");

// Process markdown
const mdResult = await mdToTree("document.md", {
  addNodeSummary: true,
  thinning: true,
  thinningThreshold: 5000,
});

Using LM Studio (Local LLMs)

import { PageIndex } from "bun-pageindex";

const pageIndex = new PageIndex({
  model: "local-model", // Your LM Studio model name
}).useLMStudio();

const result = await pageIndex.fromPdf("document.pdf");

Using Ollama

import { PageIndex } from "bun-pageindex";

const pageIndex = new PageIndex({
  model: "llama3",
}).useOllama();

const result = await pageIndex.fromPdf("document.pdf");

OCR Mode for Scanned PDFs

OCR mode converts PDF pages to images and uses a vision model (like GLM-OCR) to extract text, then processes with a reasoning model.

import { PageIndex, indexPdfWithOcr, indexPdfWithLMStudioOcr } from "bun-pageindex";

// Using OpenAI
const result = await indexPdfWithOcr("scanned-document.pdf", {
  apiKey: process.env.OPENAI_API_KEY,
  reasoningModel: "gpt-4o",
  ocrModel: "gpt-4o", // OpenAI vision model
});

// Using LM Studio with local models
const result = await indexPdfWithLMStudioOcr(
  "scanned-document.pdf",
  "qwen/qwen3-vl-30b",           // Reasoning model
  "mlx-community/GLM-OCR-bf16"   // OCR vision model
);

// Or use the PageIndex class directly
const pageIndex = new PageIndex({
  model: "qwen/qwen3-vl-30b",
  extractionMode: "ocr",
  ocrModel: "mlx-community/GLM-OCR-bf16",
  imageDpi: 150,
}).useLMStudio();

const result = await pageIndex.fromPdf("scanned-document.pdf");

CLI Usage

# Process a PDF
bun-pageindex --pdf document.pdf

# Process with LM Studio
bun-pageindex --pdf document.pdf --lmstudio --model llama3

# Process scanned PDF with OCR
bun-pageindex --pdf scanned.pdf --ocr --lmstudio --model qwen/qwen3-vl-30b

# Process markdown with options
bun-pageindex --md README.md --add-doc-description --thinning

# See all options
bun-pageindex --help

API Reference

PageIndex Class

const pageIndex = new PageIndex(options);

Options:

  • model: LLM model to use (default: "gpt-4o-2024-11-20")
  • apiKey: OpenAI API key (default: from OPENAI_API_KEY env var)
  • baseUrl: Custom API base URL (for LM Studio, Ollama, etc.)
  • tocCheckPageNum: Pages to check for TOC (default: 20)
  • maxPageNumEachNode: Max pages per node (default: 10)
  • maxTokenNumEachNode: Max tokens per node (default: 20000)
  • addNodeId: Add node IDs (default: true)
  • addNodeSummary: Generate summaries (default: true)
  • addDocDescription: Add document description (default: false)
  • addNodeText: Include raw text (default: false)

OCR Options:

  • extractionMode: "text" (default) or "ocr" for scanned PDFs
  • ocrModel: Vision model for OCR (default: "mlx-community/GLM-OCR-bf16")
  • ocrPromptType: "text", "formula", or "table" (default: "text")
  • imageDpi: DPI for PDF to image conversion (default: 150)
  • imageFormat: "png" or "jpeg" (default: "png")
  • ocrConcurrency: Concurrent OCR requests (default: 3)

Methods:

  • fromPdf(input): Process a PDF file or buffer
  • useLMStudio(): Configure for LM Studio
  • useOllama(): Configure for Ollama
  • useOcrMode(ocrModel?): Enable OCR mode
  • setBaseUrl(url): Set custom API base URL

mdToTree Function

const result = await mdToTree(path, options);

Additional Options:

  • thinning: Apply tree thinning (default: false)
  • thinningThreshold: Min tokens for thinning (default: 5000)
  • summaryTokenThreshold: Token threshold for summaries (default: 200)

Result Structure

interface PageIndexResult {
  docName: string;
  docDescription?: string;
  structure: TreeNode[];
}

interface TreeNode {
  title: string;
  nodeId?: string;
  startIndex?: number;
  endIndex?: number;
  summary?: string;
  prefixSummary?: string;
  text?: string;
  lineNum?: number;
  nodes?: TreeNode[];
}

Benchmarks

Run benchmarks comparing Bun vs Python implementations:

# Requires LM Studio running on localhost:1234
bun run benchmark

Development

# Install dependencies
bun install

# Run tests
bun test

# Build
bun run build

How It Works

PageIndex uses LLM reasoning to:

  1. Detect Table of Contents: Scans initial pages for TOC
  2. Extract Structure: Parses TOC or generates structure from content
  3. Map Page Numbers: Associates logical page numbers with physical pages
  4. Build Tree: Creates hierarchical tree structure
  5. Generate Summaries: Creates summaries for each node (optional)

This approach provides human-like document understanding without the limitations of vector-based retrieval.

OCR Mode (New in bun-pageindex)

For scanned PDFs, OCR mode adds an additional step:

  1. Convert PDF to Images: Uses Poppler to render each page as an image
  2. OCR Extraction: Uses a vision model (GLM-OCR) to extract text from images
  3. Standard Processing: Continues with the same reasoning-based indexing

This enables processing of scanned documents that the original Python PageIndex cannot handle.

Credits

This is a Bun/TypeScript port of PageIndex by VectifyAI.

License

MIT

Author

Antonio Oliveira [email protected] (oakoliver.com)