react-native-pageindex
v0.1.3
Published
Vectorless, reasoning-based RAG — builds a hierarchical tree index from PDF, DOCX, CSV, XLSX or Markdown using any LLM. React Native compatible.
Maintainers
Readme
react-native-pageindex
Vectorless, reasoning-based RAG — builds a hierarchical tree index from any document using any LLM provider. Works in React Native, Node.js, and the browser.
No vector database required. Instead of embeddings, the library uses the LLM to reason about document structure, producing a navigable tree that lets your AI answer questions with precise source attribution.
Demo
A fully interactive React demo app is included in the demo/ directory. It runs in the browser and showcases both index modes against two built-in datasets — no backend required.
Keyword mode — instant, no API key
CSV dataset (100 farmers · 14 columns) indexed in 0.0 s using TF-IDF scoring.

LLM mode — semantic tree via any LLM
32-page farming PDF with a TOC parsed into 31 nodes and 250 indexed terms in ~214 s using
gpt-4o-mini.

Chat mode — conversational AI over any indexed document
Ask natural-language questions and get cited answers backed by the reverse index. Multi-turn conversation with collapsible source references per reply.

How the demo is built
The demo is a Vite + React + TypeScript single-page app that wires react-native-pageindex directly in the browser. Below is a walkthrough of every layer.
1. Data sources (ConfigPanel.tsx)
Three mutually exclusive source modes are offered:
| Mode | What it loads |
|---|---|
| Sample CSV | farmer_dataset.csv — 100 rows, 14 columns (crop, state, soil type, risk…) |
| Sample PDF | crop_production_guide.pdf — 32-page farming guide generated with pdfkit, complete with a dot-leader TOC and 7 chapters |
| Upload | Any .pdf, .html, .md, .csv, or .txt file drag-dropped or file-picked by the user |
2. Index modes (App.tsx)
| Mode | Description | API key needed? |
|---|---|---|
| Keyword | Calls extractCsvPages → buildReverseIndex({ mode: 'keyword' }). Pure TF-IDF, zero LLM calls. | ❌ No |
| Full LLM | Full pipeline: extract → pageIndex / pageIndexMd → buildReverseIndex. LLM reasons about structure, generates summaries, and builds a semantic tree. | ✅ Yes |
3. The build pipeline (App.tsx — handleBuild)
// ── CSV / Keyword mode ────────────────────────────────────────────────────
import { extractCsvPages, buildReverseIndex } from 'react-native-pageindex';
const pages = await extractCsvPages(csvText, { rowsPerPage: 10 });
const result = { // flat PageIndexResult (no LLM)
doc_name: fileName,
structure: { children: pages.map((p, i) => ({ title: `Rows ${i*10+1}–${(i+1)*10}`, node_id: `g${i}`, start_index: i, end_index: i })) },
};
const index = await buildReverseIndex({ result, pages, options: { mode: 'keyword' } });
// ── PDF / LLM mode ────────────────────────────────────────────────────────
import { pageIndex, buildReverseIndex } from 'react-native-pageindex';
import { extractPdfPagesFromBuffer } from './demoExtractors'; // pdfjs-dist wrapper
const pages = await extractPdfPagesFromBuffer(arrayBuffer); // uses pdfjs-dist v5
const result = await pageIndex({
pages,
docName: 'Crop Production Guide',
llm, // passed from LLM config panel
options: {
onProgress: ({ step, percent, detail }) => setProgress({ step, percent, detail }),
},
});
const index = await buildReverseIndex({ result, pages, llm, options: { mode: 'keyword' } });
// ── HTML / Markdown / TXT / Upload mode ──────────────────────────────────
import { pageIndexMd, buildReverseIndex } from 'react-native-pageindex';
import { htmlToMarkdown } from './demoExtractors';
const markdown = fileType === 'html' ? htmlToMarkdown(rawText) : rawText;
const result = await pageIndexMd({
content: markdown,
docName: fileName,
llm,
options: { onProgress: setProgress },
});
const index = await buildReverseIndex({ result, llm, options: { mode: 'keyword' } });4. PDF extraction in the browser (demoExtractors.ts)
pdfjs-dist requires a Web Worker. In a Vite app the worker URL is resolved at build time using the ?url import suffix:
// demoExtractors.ts
import pdfjsWorkerSrc from 'pdfjs-dist/build/pdf.worker.min.mjs?url';
export async function extractPdfPagesFromBuffer(
buffer: ArrayBuffer,
): Promise<PageData[]> {
const pdfjsLib = await import('pdfjs-dist');
pdfjsLib.GlobalWorkerOptions.workerSrc = pdfjsWorkerSrc; // local, not CDN
const pdf = await pdfjsLib.getDocument({ data: new Uint8Array(buffer) }).promise;
const pages: PageData[] = [];
for (let i = 1; i <= pdf.numPages; i++) {
const pg = await pdf.getPage(i);
const content = await pg.getTextContent();
const text = content.items.map((item: any) => item.str).join(' ');
pages.push({ text, tokenCount: Math.ceil(text.length / 4) });
}
return pages;
}Why
?urland not a CDN link? Pointing pdfjs at an external CDN URL fails if the CDN is unreachable or CORS-blocked. The?urlimport makes Vite serve the worker file locally from the same dev-server / bundle.
5. LLM provider wiring (llm.ts)
The demo supports OpenAI and Anthropic out of the box. It bridges each provider's SDK into the LLMProvider interface that react-native-pageindex expects:
// llm.ts — OpenAI adapter (simplified)
import type { LLMProvider } from 'react-native-pageindex';
export function makeOpenAIProvider(apiKey: string, model: string): LLMProvider {
return async (prompt, opts) => {
const res = await fetch(`${OPENAI_BASE}/v1/chat/completions`, {
method: 'POST',
headers: { 'Content-Type': 'application/json', Authorization: `Bearer ${apiKey}` },
body: JSON.stringify({
model,
messages: [...(opts?.chatHistory ?? []), { role: 'user', content: prompt }],
}),
});
const data = await res.json();
if (!res.ok) throw new Error(`OpenAI ${res.status}: ${data.error?.message}`);
return { content: data.choices[0].message.content, finishReason: data.choices[0].finish_reason };
};
}OpenAI calls in the browser are routed through a Vite dev-server proxy (/llm-proxy/openai → https://api.openai.com) to avoid CORS errors. Anthropic supports the anthropic-dangerous-direct-browser-access: true header so no proxy is needed.
6. Progress tracking (ProgressDisplay.tsx)
pageIndex and pageIndexMd both fire onProgress callbacks after every named step. The demo displays a live progress bar and step label:
options: {
onProgress: ({ step, percent, detail }) => {
setProgress({ step, percent, detail }); // drives the progress bar in ProgressDisplay.tsx
},
}The PDF pipeline emits 13 named steps (Initializing → Extracting PDF pages → Scanning for TOC → … → Done); the Markdown pipeline emits 8 steps.
7. Reverse index & search (SearchPanel.tsx)
After the index is built, the demo calls buildReverseIndex then passes the result to searchReverseIndex on every keystroke:
import { searchReverseIndex } from 'react-native-pageindex';
const hits = searchReverseIndex(reverseIndex, query, 20);
// hits[0] = { nodeTitle, nodeId, score, matchedTerm, totalScore, pageRange, ... }Results are ranked by totalScore and each card shows the matched term, score, confidence level (High / Medium / Low), and the page range covered by that tree node.
8. Chat mode (ChatPanel.tsx)
After the index is built a 💬 Chat tab appears alongside Tree View, Search and Raw Pages. It implements a full multi-turn conversational loop over the indexed document:
- Retrieve —
searchReverseIndex(reverseIndex, question, 5)fetches the top-5 relevant tree nodes. - Build context — actual page text (or node summaries as fallback) from those nodes is injected into the prompt, capped to 1 200 chars per node.
- Chat history — the last 10 conversation turns are passed as
chatHistoryfor multi-turn continuity. - Stream answer — the configured
LLMProviderreturns the answer, which is displayed with collapsible citations (node title, node ID, page range, relevance score).
import { searchReverseIndex } from 'react-native-pageindex';
import type { LLMProvider, ReverseIndex, PageIndexResult, PageData } from 'react-native-pageindex';
async function chat(
question: string,
reverseIndex: ReverseIndex,
result: PageIndexResult,
pages: PageData[],
llm: LLMProvider,
chatHistory: { role: 'user' | 'assistant'; content: string }[] = [],
) {
// 1. Retrieve relevant nodes
const hits = searchReverseIndex(reverseIndex, question, 5);
// 2. Build grounded context from page text / node summaries
const contextParts = hits.map(hit => {
const pageRange = `${hit.startIndex ?? '?'}–${hit.endIndex ?? '?'}`;
const body = pages
.slice((hit.startIndex ?? 1) - 1, hit.endIndex ?? 1)
.map(p => p.text)
.join('\n')
.slice(0, 1200);
return `[${hit.nodeTitle} | pages ${pageRange}]\n${body || hit.summary}`;
});
const systemPrompt =
`You are a helpful assistant for "${result.doc_name}". ` +
`Use the provided sections as your primary source. ` +
`Always cite which section your answer comes from.`;
const userTurn =
`Relevant sections:\n\n${contextParts.join('\n\n---\n\n')}` +
`\n\nQuestion: ${question}`;
// 3. Call LLM with chat history
const response = await llm(userTurn, {
chatHistory: [
{ role: 'system' as any, content: systemPrompt },
...chatHistory,
],
});
// 4. Return answer + citation metadata
return {
answer: response.content,
citations: hits.map(h => ({
title: h.nodeTitle,
nodeId: h.nodeId,
pages: `${h.startIndex}–${h.endIndex}`,
score: Math.round(h.totalScore * 100),
})),
};
}Chat requires an LLM provider. In Keyword mode the Chat tab is visible but the
llmref isnull— configure an API key in the sidebar and switch to Full LLM mode before building the index to enable chat.
9. Running the demo locally
git clone https://github.com/subham11/react-native-pageindex.git
cd react-native-pageindex/demo
npm install
npm run dev # → http://localhost:5173Select Sample CSV → Keyword for an instant zero-API-key demo, or select Sample PDF → Full LLM, enter an OpenAI or Anthropic key, and click Build LLM Index to see the full semantic-tree pipeline in action. Once the index is built, switch to the 💬 Chat tab to start a conversation with the document.
Features
| Feature | Detail |
|---|---|
| Multi-format | PDF, Word (.docx), CSV, Spreadsheet (.xlsx/.xls), Markdown |
| Forward index | Hierarchical tree: chapters → sections → subsections |
| Reverse index | Inverted index: term → node locations for fast lookup |
| Conversational chat | Multi-turn Q&A with cited answers, backed by the reverse index |
| Provider-agnostic | Pass any LLM (OpenAI, Anthropic, Ollama, Gemini…) |
| Progress tracking | Fine-grained per-step callbacks (13 PDF steps, 8 MD steps) |
| Fully typed | 100% TypeScript, .d.ts declarations included |
| Optional deps | pdfjs-dist / mammoth / xlsx are opt-in; CSV & MD have zero deps |
Installation
npm install react-native-pageindexOptional format dependencies
Install only what you need:
# PDF support
npm install pdfjs-dist
# Word .docx support
npm install mammoth
# Excel / spreadsheet support
npm install xlsxQuick Start
1. Wire up your LLM provider
import OpenAI from 'openai';
import { LLMProvider } from 'react-native-pageindex';
const openai = new OpenAI({ apiKey: process.env.OPENAI_API_KEY });
const llm: LLMProvider = async (prompt, opts) => {
const res = await openai.chat.completions.create({
model: 'gpt-4o',
messages: [
...(opts?.chatHistory ?? []),
{ role: 'user', content: prompt },
],
});
return {
content: res.choices[0].message.content ?? '',
finishReason: res.choices[0].finish_reason ?? 'stop',
};
};2. Index any document
import { pageIndexDocument } from 'react-native-pageindex';
import { readFileSync } from 'fs';
// Works with PDF, DOCX, XLSX, CSV, or Markdown
const data = readFileSync('report.pdf');
const result = await pageIndexDocument({
data,
fileName: 'report.pdf', // used to auto-detect format
docName: 'Annual Report 2024',
llm,
options: {
onProgress: ({ step, percent, detail }) =>
console.log(`[${percent}%] ${step}${detail ? ` — ${detail}` : ''}`),
},
});
console.log(result.structure); // hierarchical tree3. Build a reverse index for fast search
import { buildReverseIndex, searchReverseIndex } from 'react-native-pageindex';
const reverseIndex = await buildReverseIndex({
result, // forward index from pageIndexDocument()
options: {
mode: 'keyword', // 'keyword' (fast, no LLM) | 'llm' (semantic)
},
});
const hits = searchReverseIndex(reverseIndex, 'revenue growth', 5);
// hits[0] = { nodeTitle, nodeId, score, matchedTerm, totalScore, ... }API
pageIndexDocument(input) — Unified entrypoint
Accepts any supported file format and returns a hierarchical PageIndexResult.
interface PageIndexDocumentInput {
data?: ArrayBuffer | Uint8Array | string; // binary for PDF/DOCX/XLSX; string for CSV/MD
text?: string; // convenience alias for Markdown / CSV
fileType?: 'pdf' | 'docx' | 'csv' | 'xlsx' | 'md'; // inferred from fileName if omitted
fileName?: string;
docName?: string;
llm: LLMProvider;
options?: PageIndexDocumentOptions;
}PageIndexDocumentOptions:
| Option | Type | Default | Description |
|---|---|---|---|
| onProgress | ProgressCallback | — | Per-step progress updates |
| pdfOptions | PageIndexOptions | — | Forwarded to the PDF pipeline |
| mdOptions | MdPageIndexOptions | — | Forwarded to the Markdown pipeline |
| csvOptions | CsvParseOptions | — | CSV row-grouping & delimiter options |
| xlsxOptions | XlsxParseOptions | — | XLSX sheet selection & row-grouping |
| tokenCounter | TokenCounter | ~4 chars/token | Custom tokeniser |
pageIndex(input) — PDF pipeline (direct)
Use when you already have extracted pages or want PDF-specific options.
import { pageIndex, extractPdfPages } from 'react-native-pageindex';
const pages = await extractPdfPages(pdfBuffer); // requires pdfjs-dist
const result = await pageIndex({ pages, llm, docName: 'Report' });PageIndexOptions:
| Option | Default | Description |
|---|---|---|
| tocCheckPageNum | 20 | Pages to scan for table of contents |
| maxPageNumEachNode | 10 | Max pages per tree node |
| maxTokenNumEachNode | 20000 | Max tokens per tree node |
| ifAddNodeId | true | Attach unique IDs to each node |
| ifAddNodeSummary | true | LLM-generated summary per node |
| ifAddDocDescription | false | Generate overall document description |
| ifAddNodeText | false | Attach raw page text to nodes |
pageIndexMd(input) — Markdown pipeline (direct)
import { pageIndexMd } from 'react-native-pageindex';
const result = await pageIndexMd({
content: markdownString,
docName: 'README',
llm,
options: { ifThinning: true, minTokenThreshold: 3000 },
});MdPageIndexOptions:
| Option | Default | Description |
|---|---|---|
| ifThinning | false | Merge small sections below threshold |
| minTokenThreshold | 5000 | Min tokens before thinning kicks in |
| ifAddNodeSummary | true | LLM-generated summary per node |
| summaryTokenThreshold | 200 | Only summarise nodes above this size |
| ifAddDocDescription | false | Generate overall document description |
| ifAddNodeText | false | Attach raw section text to nodes |
buildReverseIndex(input) — Inverted index
const reverseIndex = await buildReverseIndex({
result, // PageIndexResult
pages?, // original PageData[] (optional enrichment)
llm?, // required only for mode: 'llm'
options?: {
mode: 'keyword' | 'llm', // default: 'keyword'
minTermLength: number, // default: 3
maxTermsPerNode: number, // default: 10
onProgress: ProgressCallback,
},
});searchReverseIndex(index, query, topK?) — Query the index
const results = searchReverseIndex(reverseIndex, 'machine learning', 10);
// SearchResult[]
results.forEach(r => {
console.log(r.nodeTitle, r.totalScore, r.matchedTerm);
});Format parsers (lower-level)
import {
extractPdfPages, // requires pdfjs-dist
extractDocxPages, // requires mammoth
extractCsvPages, // no deps
extractXlsxPages, // requires xlsx
} from 'react-native-pageindex';
// All return: Promise<PageData[]>
// PageData = { text: string; tokenCount: number }Key Types
// LLM provider — wire up any AI
type LLMProvider = (
prompt: string,
options?: { chatHistory?: LLMMessage[] }
) => Promise<{ content: string; finishReason: string }>;
// Progress tracking
type ProgressCallback = (info: {
step: string;
percent: number;
detail?: string;
}) => void;
// Forward index result
interface PageIndexResult {
structure: TreeNode; // root of the hierarchy
doc_name: string;
description?: string;
}
// Tree node
interface TreeNode {
title?: string;
node_id?: string;
summary?: string;
start_index?: number;
end_index?: number;
children?: TreeNode[];
[key: string]: unknown;
}
// Reverse index search result
interface SearchResult extends ReverseIndexEntry {
matchedTerm: string;
totalScore: number;
}Progress Tracking
Both pipelines emit fine-grained progress events:
options: {
onProgress: ({ step, percent, detail }) => {
// PDF pipeline steps (0–100%):
// Initializing → Extracting PDF pages → Scanning for table of contents
// → Transforming TOC → Mapping page numbers → Building tree
// → Verifying TOC → Fixing inaccuracies → Resolving large sections
// → Attaching page text → Generating node summaries
// → Generating document description → Done
// Markdown pipeline steps:
// Initializing → Parsing headings → Extracting section text
// → Optimizing tree → Building tree → Generating summaries
// → Generating description → Done
updateProgressBar(percent);
setStatusText(`${step}${detail ? ': ' + detail : ''}`);
},
}LLM Provider Examples
Anthropic Claude
import Anthropic from '@anthropic-ai/sdk';
const client = new Anthropic();
const llm: LLMProvider = async (prompt) => {
const msg = await client.messages.create({
model: 'claude-opus-4-5',
max_tokens: 4096,
messages: [{ role: 'user', content: prompt }],
});
const block = msg.content[0];
return {
content: block.type === 'text' ? block.text : '',
finishReason: msg.stop_reason ?? 'stop',
};
};Ollama (local)
const llm: LLMProvider = async (prompt) => {
const res = await fetch('http://localhost:11434/api/generate', {
method: 'POST',
headers: { 'Content-Type': 'application/json' },
body: JSON.stringify({ model: 'llama3', prompt, stream: false }),
});
const data = await res.json();
return { content: data.response, finishReason: 'stop' };
};Google Gemini
import { GoogleGenerativeAI } from '@google/generative-ai';
const genAI = new GoogleGenerativeAI(process.env.GEMINI_API_KEY!);
const model = genAI.getGenerativeModel({ model: 'gemini-1.5-pro' });
const llm: LLMProvider = async (prompt) => {
const result = await model.generateContent(prompt);
return {
content: result.response.text(),
finishReason: 'stop',
};
};Conversational Chat Mode
Once you have a PageIndexResult and a ReverseIndex you can add a full multi-turn chat interface to your app. The pattern is:
User question
→ searchReverseIndex() ← retrieves the most relevant tree nodes
→ build grounded context ← page text / node summaries (no embeddings)
→ LLMProvider() ← any provider, with chat history
→ cited answer ← response + source metadataMinimal example
import {
pageIndex,
buildReverseIndex,
searchReverseIndex,
} from 'react-native-pageindex';
// 1. Build the forward index (once per document)
const result = await pageIndex({ pages, llm, docName: 'My Docs' });
// 2. Build the reverse index (once per document)
const reverseIndex = await buildReverseIndex({ result, pages, options: { mode: 'keyword' } });
// 3. Chat loop
const history: { role: 'user' | 'assistant'; content: string }[] = [];
async function ask(question: string) {
// Retrieve top-5 relevant nodes
const hits = searchReverseIndex(reverseIndex, question, 5);
// Build grounded context
const context = hits
.map(h => `[${h.nodeTitle}]\n${h.summary ?? ''}`)
.join('\n\n---\n\n');
const userTurn = `Context:\n${context}\n\nQuestion: ${question}`;
// Call LLM with running history
const { content } = await llm(userTurn, { chatHistory: history });
// Update history for next turn
history.push({ role: 'user', content: question });
history.push({ role: 'assistant', content });
return {
answer: content,
sources: hits.map(h => ({ title: h.nodeTitle, pages: `${h.startIndex}–${h.endIndex}` })),
};
}
// Usage
const { answer, sources } = await ask('Best season to grow paddy in Odisha?');
console.log(answer);
// → "According to the 'Rice (Paddy) Cultivation' section, rice is primarily
// a kharif crop … the best season is during the kharif / monsoon period."
console.log(sources);
// → [{ title: 'Rice (Paddy) Cultivation', pages: '12–15' }, ...]Chat in the browser demo
The demo app's 💬 Chat tab is a fully-featured implementation built on top of the pattern above:
- Multi-turn — up to 10 previous messages are sent as
chatHistory, preserving conversational context. - Cited answers — each response includes expandable source cards with node title, node ID, page range and relevance score (0–100).
- Grounded context — actual page text is preferred over summaries; each node's contribution is capped at 1 200 chars to stay within token budgets.
- Keyboard shortcuts — Enter to send, Shift+Enter for a newline.
- LLM providers — OpenAI (via Vite dev-server proxy), Anthropic (direct), or Ollama (local) — configured in the sidebar before building the index.
Tip: For best chat quality, build the index in Full LLM mode (not Keyword mode) so each node has a rich LLM-generated summary the chat can draw on when no page text is available.
React Native Usage
// Use RNFS or fetch to get file bytes
import RNFS from 'react-native-fs';
import { pageIndexDocument } from 'react-native-pageindex';
const base64 = await RNFS.readFile(filePath, 'base64');
const bytes = Uint8Array.from(atob(base64), c => c.charCodeAt(0));
const result = await pageIndexDocument({
data: bytes,
fileName: 'document.pdf',
llm,
options: { onProgress: setProgress },
});Note: pdfjs-dist has a web worker that may need special Metro configuration. Alternatively, pass pre-extracted
pages: PageData[]directly topageIndex()to skip pdfjs entirely.
Versioning
This package follows Semantic Versioning:
- Patch (
0.1.x) — bug fixes, no API changes - Minor (
0.x.0) — new features, backward compatible - Major (
x.0.0) — breaking changes to the public API
License
MIT © subham11
