redhop

v0.4.0

Published

2 days ago

Reasoning-aware context runtime for RAG — chunk, retrieve, and allocate document context, with citations and a Decision Report. No vector DB, in-process.

0High
0Medium
0Low

vysakh0

rag retrieval llm context nlp embeddings

redhop (Node.js)

Reasoning-aware context runtime for RAG: hand it a document and a question, get back the context the model should actually see, with citations and a Decision Report. No vector database, no LLM, in-process. A native addon (napi-rs) over the RedHop Rust core. The embedding engine and document parsers are bundled (no extra deps).

Get started in 60 seconds

npm install redhop

const { Document } = require("redhop");

const doc = Document.fromFile("contract.pdf");          // parses + chunks + indexes
const ctx = doc.context("What is the governing law?");  // retrieves + assembles

llm.generate(ctx.text);          // feed any provider — no lock-in
for (const c of ctx.citations) { // where the answer's context came from
  console.log(c.source, c.page, c.heading); // e.g. contract.pdf 3 null
}
console.log(ctx.report.rendered); // the Decision Report — what it kept, and why

That's it.

How it compares

Measured on identical documents + budgets + BM25 retrieval, RedHop beats both frameworks on multi-hop evidence retention (80% vs LangChain 71%, LlamaIndex 72%) and beats LangChain on contracts (82% vs 73%). On CUAD's raw-template query LlamaIndex leads by 4 (LlamaIndex 86% vs RedHop 82% ≥0.8 retention).

Honest fair-preprocessing result (bench/compare.py, n=300, 2026-06-08): applying Stripper(boilerplate) to every system's query lifts everyone: LlamaIndex 86% → 94%, RedHop 82% → 88%, LangChain 73% → 79%. LlamaIndex actually benefits more from the same Stripper than RedHop does. RedHop reaches 90.7% by additionally layering a hand-authored 34-key clause-name Vocabulary on top, but that recipe was not applied to LlamaIndex, and the +4.7 framing previously reported here is RedHop-with- recipe vs LlamaIndex-default, not a like-for-like comparison.

RedHop's clearer architectural lead is multi-hop retention, replicated on two datasets at n=300: HotpotQA ≥0.8 retention 80% vs LlamaIndex 72%, LangChain 71% (+8). MuSiQue ≥0.8 retention 22% vs LlamaIndex 17%, LangChain 19% (+3 to +5). Compositional multi-hop is harder, the magnitude shrinks but the lead holds at the ≥0.8 threshold. raw_topk matches reasoning_preserving on both, so the edge is RedHop's chunking + BM25 defaults rather than the assembly strategy.

Push multi-hop further with retrieval="hybrid": measured +12 ≥0.8 on HotpotQA (71% → 83%) and +8 ≥0.5 on MuSiQue (66% → 74%) at n=100, at ~90-120× per-query latency (3ms → 250-400ms). Stripper and candidate_k tuning don't help on multi-hop: only dense rerank pierces the lexical-vs-semantic gap on bridge passages.

Apples-to-apples hybrid vs LangChain/LlamaIndex (same bge-small, n=100, post pure-rerank fix): HotpotQA: RedHop hybrid wins (81% ≥0.8 vs LangChain 77%, LlamaIndex 67%). MuSiQue: LangChain leads narrowly (39% vs RedHop 34%, LlamaIndex 31%). The 0.3.1 audit traced the MuSiQue gap to RedHop's RRF fusion burying bridge passages with low BM25 + high dense rank. This release switches the default to pure rerank. Net: HotpotQA −2, MuSiQue +8 (close to predicted +10). Latency profile (2-5× slower than competitors' hybrid) is a separate open item. See MULTIHOP_HYBRID_COMPETITORS.md

MULTIHOP_CONSTANT_CHUNKING.md.

What RedHop's CUAD recipe offers is a reproducible, in-process, audited path from 82% → 87.7% → 90.7% using Stripper + Vocabulary with a Decision Report. The primitives are reusable on any templated workload. See CUAD_CLAUSE_EXPANSION.md, MUSIQUE_MULTIHOP.md, and MULTIHOP_HYBRID.md.

Methodology + raw runs: FRAMEWORK_COMPARISON.md · framework_comparison_2026-06-06.txt.

How it works

Five stages: you bring documents and a query, RedHop owns parsing, chunking, retrieval, and context allocation, and you get a BuiltContext with the assembled prompt, citations, and a Decision Report. Each stage has an evidence-backed default that traces to a finding in docs/findings/.

The Decision Report

ctx.report.rendered carries the human-readable text above. Individual fields (autoDecision, totalTokens, retainedEvidenceRatio, etc.) are on ctx.report directly. Document.analyze(query) returns the same Report shape without paying assembly cost. When retrieval looks weak, ctx.report.diagnosis lists the query terms that appear nowhere in the corpus and fires bounded hints (vocab mismatch, templated boilerplate, polysemy) with a link to the measured finding behind each one. Example: 12_diagnosis.cjs.

Already running retrieval somewhere else?

Point the same diagnostics at your existing LangChain.js / LlamaIndex.ts / hand-rolled pipeline without migrating. Three calls, no behavior change:

const { Chunk, analyzeContext, summarizeDiagnoses } = require("redhop");

// Hand the candidates your retriever returned to RedHop for diagnosis.
const texts = await yourRetrieve(query);
const chunks = texts.map((t, i) => new Chunk(t, { id: String(i), source: "external" }));
const report = analyzeContext(query, chunks);

// Aggregate across a workload: one focus recommendation, citing the finding.
const reports = await Promise.all(productionQueries.map(async (q) =>
  analyzeContext(q, (await yourRetrieve(q)).map((t, i) => new Chunk(t, { id: String(i), source: "external" })))
));
const summary = summarizeDiagnoses(reports);
console.log(summary.rendered);

Walk-through: docs/DIAGNOSE_YOUR_PIPELINE.md. Example: 13_workload_audit.cjs. OTel / Langfuse attribute mapping snippet is on the docs page.

Show your work: query rewrites with an audit trail

Every transformation between the raw query and what BM25 actually saw is recorded on the same Decision Report. Compile a Stripper (boilerplate removal), a Vocabulary (workload-curated synonyms), or both, run them as a chain via doc.contextWithRewrites(...), and the per-stage records land on ctx.report.queryRewrites:

const stripper = new Stripper(["highlight", "the", "parts", "of", "this", "contract"]);
const vocab    = new Vocabulary({ "change of control": ["merger", "successor", "acquisition"] });

const ctx = doc.contextWithRewrites(query, [stripper, vocab]);

for (const rec of ctx.report.queryRewrites) {
  console.log(rec.stage, "matched=", rec.matched, "added=", rec.added);
}

The same Vocabulary works chunk-side at ingest via vocab.enrich(chunkText). It lifts retrieval +0.19 mean recall on schema-style corpora (SPIDER_ENRICH), and is measured to hurt (−2.0pt) on long prose chunks (CUAD_ENRICH_DEFINITIONS_NULL). A/B with redhop.evaluate(...) to confirm before adopting.

Score the change: deterministic, or LLM-judged when you need it

Two modes. Use deterministic in CI on every PR. Opt into a judge when you want faithfulness / relevancy / correctness against generated answers.

Deterministic: no API calls, ~ms per query. Returns contextRecall / contextPrecision / answerTokenRecall / faithfulnessLexical / relevancyLexical / correctnessLexical + a composite overall. Same primitives the Decision Report uses.

const { evaluate } = require("redhop");
const ctxA = doc.context(userQuery);
const ctxB = doc.contextWithRewrites(userQuery, [stripper, vocab]);
const evalA = evaluate(userQuery, ctxA, { goldChunks });
const evalB = evaluate(userQuery, ctxB, { goldChunks });
console.log("lift on overall:", evalB.overall - evalA.overall);

LLM-judged: via the async evaluateWithJudge. Supply your own LLM caller (OpenAI, Anthropic, OpenRouter, local). Adds faithfulnessJudged / relevancyJudged / correctnessJudged. Claim-decomposed faithfulness (decomposeFaithfulness: true) is substantively equivalent to Ragas: r=+0.664, MAE=0.151 on n=200 HotpotQA, see COMPARISON_RAGAS. TP/FP/FN F₁ via decomposeCorrectness: true.

const { Judge, evaluateWithJudge, critique } = require("redhop");
const judge = Judge.fromCallable(async (err, prompt, system) => {
  // Your LLM SDK call — return a number or { score: number }
  return await myLlm({ prompt, system });
}, "openai-mini").cached();
const report = await evaluateWithJudge(userQuery, ctx, judge, {
  answer: "The refund window is thirty days.",
  goldAnswer: "thirty days",
  decomposeFaithfulness: true,
  decomposeCorrectness: true,
});

For user-defined aspects (harmfulness, conciseness, brand voice…), critique(answer, aspects, judge) runs one judge call per aspect with polarity-corrected scores. Aggregate a test set via summarize(reports): same shape as Python's redhop.summarize, returns means + medians + per-metric subset counters.

Full API + field list: ANSWER_QUALITY_EVAL.

Loaders

Document.fromText(text, options?)
Document.fromChunks([new Chunk(text, { source, id, metadata }), ...], options?)
Document.fromFile(path, options?)                 // PDF/DOCX/PPTX/XLSX + text/code
Document.fromBytes(buffer, "key.pdf", options?)   // S3 / GCS / Azure / HTTP / DB blobs
Document.fromFolder(path, folderOptions?)         // one combined index over a dir

fromFolder honors .gitignore and accepts extra ignore globs:

Document.fromFolder("./repo", { recursive: true, gitignore: true,
  ignore: ["*.lock", "tests/**"], options: { retrieval: "hybrid", model: "bge-small" } });

Retrieval: start with the default

Start at the lexical default (it handles most document QA because the words in the question are usually the words in the answer) and climb only when the failure shape calls for it.

// Default — most docs (code, API refs, runbooks, financial reports, handbooks)
Document.fromFile("contract.pdf").context("What is the governing law?");

// Structured docs with parallel clauses (regional overrides, per-region sub-sections):
Document.fromFile("msa.pdf", { retrieval: "hybrid", model: "bge-small" })
  .context("What law applies in the UK?", undefined, 1, true);  // neighbors=1, includeHeading=true

// Synonym-mismatch corpora (HR FAQs, support tickets where users phrase
// things very differently from the docs). Cross-encoder adds 5–10× latency
// — verify it helps on your corpus before enabling.
Document.fromFile("support.md",
  { retrieval: "hybrid", model: "bge-small", rerank: "cross-encoder" });

options.retrieval is "lexical" (default), "hybrid" (BM25 → dense rerank), or "semantic" (dense over every chunk). Dense tiers download a small model named by options.model ("bge-small" / "bge-base"). The 60-second decision guide: CHOOSING_A_CONFIG.

Non-English content

Default is a minimal analyzer (tokenize + lowercase + ASCII fold, no stemmer), measured to beat English Snowball on every English workload we tested (RAW_ANALYZER). Swap with options.language: "english" for code search / inflection-heavy English content, or any of the 18 Snowball Porter2 languages (arabic, danish, dutch, english, finnish, french, german, greek, hungarian, italian, norwegian, portuguese, romanian, russian, spanish, swedish, tamil, turkish):

const doc = Document.fromText(germanText, { language: "german" });
// Now `Buch` finds chunks containing `Bücher` (and vice versa)

One analyzer drives both BM25 retrieval AND the grounding scorer, so they can't drift on what "the same term" means. Unknown names throw (we don't silently fall back to English). See the language guide for the full breakdown and a calibration disclaimer.

The result

context(query, budget?, neighbors?, includeHeading?) returns:

text: the assembled prompt string
chunks: the selected chunk texts, in order
citations: { source, page, heading, line, text } per chunk (null/absent fields where a format doesn't provide them)
report: the Decision Report, with the same field surface as Python's ctx.report. Read strategy / requestedStrategy for the resolved vs requested allocation, autoDecision for the Auto gate's verdict ("passthrough" | "prune" | "not_auto"), inputTokens / tokenBudget / totalTokens / tokenUtilization for budget accounting, nInputChunks / nSelected / nExpanded for chunk counts, inputDistractorRatio / retainedEvidenceRatio / evidenceDensity / distractorRatio / estimatedWasteTokens for context economics, secondHopRescues (or its longer alias secondHopRescueCount) and reasoningPreservationDelta for the reasoning-preserving accounting, lowConfidenceRetrieval / lowConfidenceThreshold for the "did anything actually match" signal, and rendered for the human-readable Decision Report string. The full shape is in index.d.ts.

neighbors / includeHeading turn on structural context expansion (adjacent chunks / section headings, in document order).

analyze(query) is the same retrieve + score pass without assembling the prompt, useful for auditing what RedHop would do before paying assembly cost. Returns just the report.

Standalone observability primitives (the same scoring the strategies use internally, exposed so external code never has to reimplement and drift):

const { groundingScore, linkStrength } = require("redhop");

groundingScore("refund window", chunkText);   // → number in [0, 1]
linkStrength(chunkA, chunkB);                  // → number in [0, 1]

Both use the default English analyzer. Non-English content reaches the configured analyzer through Document.context(...).report instead.

fromFolder exposes two more getters: doc.nFiles (count of indexed files) and doc.skippedFiles ({ source, reason }[], files that couldn't be parsed: unsupported formats, unreadable bytes, scanned PDFs without OCR, etc.). Single-source constructors default to nFiles=1 and skippedFiles=[].

Templated workloads: the +9 retention lift (BM25, no model needed)

If every query in your workload follows a fixed template, such as legal QA ("Highlight the parts (if any) of this contract related to X. Details: …"), support-ticket triage ("Help me with X, my account is Y, the error is Z"), or form-filled queries from a structured UI, BM25 weights every query term by corpus IDF, not by how often the term repeats across your query set. The boilerplate words dilute the real signal words, and retention suffers. This is the mechanism behind the 4-point CUAD gap on the head-to-head. Closing it doesn't need a vector DB or a different retriever: it needs two small preprocessing helpers on the query side.

Measured on the CUAD framework comparison (n=300, BM25, budget 2,000 tok):

| step | helper | retention | Δ | | ---- | ------ | ---------:| -:| | raw 24-word template | — | 81.3% | — | | + strip the wrapper | Stripper | 87.7% | +6.4 | | + add workload synonyms | Vocabulary | 90.7% | +3.0 |

RedHop with the full workflow is at 90.7%, beating LlamaIndex by 4 points on the same setup, at native BM25 latency (~2.5ms/query). Mechanism + worked clause dict: CUAD_CLAUSE_EXPANSION.md.

Recommended workflow: detect → strip → (optional) expand → A/B. The rewrite chain runs inside doc.contextWithRewrites(...) so each stage's audit trail lands on ctx.report.queryRewrites automatically.

const redhop = require("redhop");

// 1 — Detect. Hand a representative sample of your queries to the analyzer.
const report = redhop.analyzeQuerySet(myQueries.slice(0, 300));
// report.isTemplated            → true / false
// report.templateWordShare      → e.g. 0.66 on CUAD
// report.boilerplateTerms       → ["highlight", "contract", "lawyer", …]
// report.estimatedDilutionCost  → "high" | "medium" | "low" | "none"

if (report.isTemplated) {
  // 2 — Compile the rewrite chain.
  const stripper = new redhop.Stripper(report.boilerplateTerms);

  // 3 — (optional) Vocabulary. If your workload has known topic synonyms
  //     (clause types, error codes), compile them once.
  const vocab = new redhop.Vocabulary({
    // YOUR keys → synonyms; CUAD example in CUAD_CLAUSE_EXPANSION.md
    "change of control": ["merger", "successor", "acquisition"],
  });

  // 4 — Run the chain through retrieval; audit lands on report.queryRewrites.
  const doc = await redhop.Document.fromFile("contract.pdf");
  const ctxA = doc.context(userQuery);                              // baseline
  const ctxB = doc.contextWithRewrites(userQuery, [stripper, vocab]);
  const evalA = redhop.evaluate(userQuery, ctxA, { goldChunks });
  const evalB = redhop.evaluate(userQuery, ctxB, { goldChunks });
  console.log(evalB.overall - evalA.overall);  // the lift, deterministically
}

Only matters if your queries are templated. analyzeQuerySet is conservative by design: HotpotQA and MuSiQue both register quiet (isTemplated: false) in the cross-workload probe, while CUAD fires. If yours doesn't fire, skip this section.
The analyzer measures the shape of your query set, not your retention. It says "this looks like a templated workload" with the boilerplate terms it found. It does not promise a specific lift. Always A/B on your gold-evidence sample before committing.
For single-doc extraction workloads also set strategy: "raw_topk". auto routes large contexts to reasoning_preserving, which solves a multi-hop problem contract extraction doesn't have. RawTopK beats it by ~4 points at every chunk size on CUAD.
We deliberately don't ship a CUAD-specific stripTemplate() helper. Templates are workload-specific. Baking one in would make the wrong call for the next workload. new Stripper(...) and new Vocabulary({...}) take your boilerplate / synonym dict so the call stays on your side.
Or take the one-knob alternative: retrieval="hybrid". Dense reads chunks as semantic content rather than counting tokens, so the boilerplate ratio stops mattering. Substitutes for stripping by a different mechanism (+5.3 on raw CUAD at ~10ms/query). On CUAD specifically, BM25 + strip + vocabulary still wins: 90.7% / 2.5ms vs hybrid+CE 89.0% / 683ms. The two paths are substitutes, not complements, so pick one. See CUAD_HYBRID_RERANK.md.

| helper | what it does | finding | | ------ | ------------ | ------- | | analyzeQuerySet(queries) | Inspects your queries and flags whether they're templated and which terms are doing the dilution | QUERY_SET_ANALYZER | | new Stripper(boilerplate) | Compiled token-level boilerplate strip, word-boundary safe (an "of" strip does not erase "of" inside "office"). Plugs into the rewrite chain so the audit trail is captured | CUAD_RECALL_GAP · MULTILINGUAL_ANALYZER | | new Vocabulary({key: [synonyms]}) | Compiled workload-curated equivalence classes: appends high-IDF synonyms when the token-level key matches. Vocabulary.bidirectional({...}) for symmetric maps (PTO ↔ paid time off). Opposite mechanism to PRF (falsified) | CUAD_CLAUSE_EXPANSION | | vocab.enrich(chunkText) | Chunk-side mirror. Measured to lift retrieval +0.19 mean recall on Spider-shape schemas. Use it when your retrieval units are short and opaque (schema columns, error codes, API symbols, defined contract terms). Measured to hurt (−2.0pt) on long prose chunks, so don't use it there. A/B with redhop.evaluate(...) against your gold before adopting | SPIDER_ENRICH + VOCABULARY_ENRICH + CUAD_ENRICH_DEFINITIONS_NULL | | doc.contextWithRewrites(query, [stripper, vocab]) | Runs the chain through retrieval. Per-stage audit lands on report.queryRewrites | (same finding as above) | | evaluate(query, ctx, { goldChunks, goldAnswer }) · evaluateWithJudge(query, ctx, judge, { answer, goldAnswer, decomposeFaithfulness, decomposeCorrectness }) | A/B scoring against gold. Sync evaluate is deterministic-only (no LLM). Async evaluateWithJudge opts into LLM-judged faithfulness/relevancy/correctness, with claim-decomposition and TP/FP/FN modes. Same primitives the Decision Report uses | ANSWER_QUALITY_EVAL · COMPARISON_RAGAS | | critique(answer, aspects, judge) | LLM-judged scoring for user-defined dimensions (harmfulness, conciseness, brand voice…). One judge call per aspect, polarity-corrected so high = good | ANSWER_QUALITY_EVAL | | summarize(reports) | Test-set aggregation: means + medians + per-metric subset counts (meanOverall, meanFaithfulnessJudged + nWithFaithfulnessJudged, …). Same shape as Python's redhop.summarize and Rust's redhop::summarize(&[…]) | ANSWER_QUALITY_EVAL | | ctx.report.diagnosis | Query-level facts (queryTerms, zeroMatchTerms, termStats, scoreSpread) plus a closed registry of bounded hints (vocab mismatch, polysemy, templated boilerplate). Every hint cites the measured finding behind it. Always computed, observation only | REPORT_DIAGNOSIS · CHOOSING_A_CONFIG | | summarizeDiagnoses(reports) | Workload-level aggregation: hint histogram, failure rates, top vocabulary gaps, and at most one focus recommendation citing the finding behind it. Six focus codes (vocab_mismatch, templated_queries, underdetermined_queries, weak_retrieval, healthy, sample_too_small) | WORKLOAD_AUDIT | | analyzeContext(query, chunks) | Observe what an external retriever returned without modifying it. Returns a Decision Report with Layer-1 diagnosis. Pair with the OTel snippet on the docs page to instrument a LangChain.js / LlamaIndex.ts pipeline | DIAGNOSE_YOUR_PIPELINE |

Decision rule + the recipe on the docs site: Choosing a configuration → "Templated queries with heavy boilerplate".

Build from source

npm install        # gets @napi-rs/cli
npm run build      # builds the native .node (release)
npm test