redhop
v0.4.0
Published
Reasoning-aware context runtime for RAG — chunk, retrieve, and allocate document context, with citations and a Decision Report. No vector DB, in-process.
Maintainers
Readme
redhop (Node.js)
Reasoning-aware context runtime for RAG: hand it a document and a question, get back the context the model should actually see, with citations and a Decision Report. No vector database, no LLM, in-process. A native addon (napi-rs) over the RedHop Rust core. The embedding engine and document parsers are bundled (no extra deps).
Get started in 60 seconds
npm install redhopconst { Document } = require("redhop");
const doc = Document.fromFile("contract.pdf"); // parses + chunks + indexes
const ctx = doc.context("What is the governing law?"); // retrieves + assembles
llm.generate(ctx.text); // feed any provider — no lock-in
for (const c of ctx.citations) { // where the answer's context came from
console.log(c.source, c.page, c.heading); // e.g. contract.pdf 3 null
}
console.log(ctx.report.rendered); // the Decision Report — what it kept, and whyThat's it.
How it compares
Measured on identical documents + budgets + BM25 retrieval, RedHop beats both frameworks on multi-hop evidence retention (80% vs LangChain 71%, LlamaIndex 72%) and beats LangChain on contracts (82% vs 73%). On CUAD's raw-template query LlamaIndex leads by 4 (LlamaIndex 86% vs RedHop 82% ≥0.8 retention).
Honest fair-preprocessing result (bench/compare.py, n=300, 2026-06-08):
applying Stripper(boilerplate) to every system's query lifts everyone:
LlamaIndex 86% → 94%, RedHop 82% → 88%, LangChain 73% → 79%. LlamaIndex
actually benefits more from the same Stripper than RedHop does. RedHop
reaches 90.7% by additionally layering a hand-authored 34-key
clause-name Vocabulary on top, but that recipe was not applied to
LlamaIndex, and the +4.7 framing previously reported here is RedHop-with-
recipe vs LlamaIndex-default, not a like-for-like comparison.
RedHop's clearer architectural lead is multi-hop retention, replicated on
two datasets at n=300: HotpotQA ≥0.8 retention 80% vs LlamaIndex 72%, LangChain
71% (+8). MuSiQue ≥0.8 retention 22% vs LlamaIndex 17%, LangChain 19% (+3 to
+5). Compositional multi-hop is harder, the magnitude shrinks but the lead
holds at the ≥0.8 threshold. raw_topk matches reasoning_preserving on both,
so the edge is RedHop's chunking + BM25 defaults rather than the assembly
strategy.
Push multi-hop further with retrieval="hybrid": measured +12 ≥0.8 on
HotpotQA (71% → 83%) and +8 ≥0.5 on MuSiQue (66% → 74%) at n=100, at
~90-120× per-query latency (3ms → 250-400ms). Stripper and candidate_k tuning
don't help on multi-hop: only dense rerank pierces the lexical-vs-semantic
gap on bridge passages.
Apples-to-apples hybrid vs LangChain/LlamaIndex (same bge-small, n=100, post pure-rerank fix): HotpotQA: RedHop hybrid wins (81% ≥0.8 vs LangChain 77%, LlamaIndex 67%). MuSiQue: LangChain leads narrowly (39% vs RedHop 34%, LlamaIndex 31%). The 0.3.1 audit traced the MuSiQue gap to RedHop's RRF fusion burying bridge passages with low BM25 + high dense rank. This release switches the default to pure rerank. Net: HotpotQA −2, MuSiQue +8 (close to predicted +10). Latency profile (2-5× slower than competitors' hybrid) is a separate open item. See MULTIHOP_HYBRID_COMPETITORS.md
What RedHop's CUAD recipe offers is a reproducible, in-process, audited path
from 82% → 87.7% → 90.7% using Stripper + Vocabulary with a Decision
Report. The primitives are reusable on any templated workload. See
CUAD_CLAUSE_EXPANSION.md,
MUSIQUE_MULTIHOP.md,
and MULTIHOP_HYBRID.md.
Methodology + raw runs: FRAMEWORK_COMPARISON.md · framework_comparison_2026-06-06.txt.
How it works
Five stages: you bring documents and a query, RedHop owns parsing, chunking,
retrieval, and context allocation, and you get a BuiltContext with the
assembled prompt, citations, and a Decision Report. Each stage has an
evidence-backed default that traces to a finding in
docs/findings/.
The Decision Report
ctx.report.rendered carries the human-readable text above. Individual fields
(autoDecision, totalTokens, retainedEvidenceRatio, etc.) are on
ctx.report directly. Document.analyze(query) returns the same Report
shape without paying assembly cost. When retrieval looks weak,
ctx.report.diagnosis lists the query terms that appear nowhere in the
corpus and fires bounded hints (vocab mismatch, templated boilerplate,
polysemy) with a link to the measured finding behind each one. Example:
12_diagnosis.cjs.
Already running retrieval somewhere else?
Point the same diagnostics at your existing LangChain.js / LlamaIndex.ts / hand-rolled pipeline without migrating. Three calls, no behavior change:
const { Chunk, analyzeContext, summarizeDiagnoses } = require("redhop");
// Hand the candidates your retriever returned to RedHop for diagnosis.
const texts = await yourRetrieve(query);
const chunks = texts.map((t, i) => new Chunk(t, { id: String(i), source: "external" }));
const report = analyzeContext(query, chunks);
// Aggregate across a workload: one focus recommendation, citing the finding.
const reports = await Promise.all(productionQueries.map(async (q) =>
analyzeContext(q, (await yourRetrieve(q)).map((t, i) => new Chunk(t, { id: String(i), source: "external" })))
));
const summary = summarizeDiagnoses(reports);
console.log(summary.rendered);Walk-through: docs/DIAGNOSE_YOUR_PIPELINE.md.
Example: 13_workload_audit.cjs.
OTel / Langfuse attribute mapping snippet is on the docs page.
Show your work: query rewrites with an audit trail
Every transformation between the raw query and what BM25 actually saw is
recorded on the same Decision Report. Compile a Stripper (boilerplate
removal), a Vocabulary (workload-curated synonyms), or both, run them as
a chain via doc.contextWithRewrites(...), and the per-stage records land
on ctx.report.queryRewrites:
const stripper = new Stripper(["highlight", "the", "parts", "of", "this", "contract"]);
const vocab = new Vocabulary({ "change of control": ["merger", "successor", "acquisition"] });
const ctx = doc.contextWithRewrites(query, [stripper, vocab]);
for (const rec of ctx.report.queryRewrites) {
console.log(rec.stage, "matched=", rec.matched, "added=", rec.added);
}The same Vocabulary works chunk-side at ingest via vocab.enrich(chunkText).
It lifts retrieval +0.19 mean recall on schema-style corpora
(SPIDER_ENRICH),
and is measured to hurt (−2.0pt) on long prose chunks
(CUAD_ENRICH_DEFINITIONS_NULL).
A/B with redhop.evaluate(...) to confirm before adopting.
Score the change: deterministic, or LLM-judged when you need it
Two modes. Use deterministic in CI on every PR. Opt into a judge when you want faithfulness / relevancy / correctness against generated answers.
Deterministic: no API calls, ~ms per query. Returns
contextRecall / contextPrecision / answerTokenRecall /
faithfulnessLexical / relevancyLexical / correctnessLexical +
a composite overall. Same primitives the Decision Report uses.
const { evaluate } = require("redhop");
const ctxA = doc.context(userQuery);
const ctxB = doc.contextWithRewrites(userQuery, [stripper, vocab]);
const evalA = evaluate(userQuery, ctxA, { goldChunks });
const evalB = evaluate(userQuery, ctxB, { goldChunks });
console.log("lift on overall:", evalB.overall - evalA.overall);LLM-judged: via the async evaluateWithJudge. Supply your own
LLM caller (OpenAI, Anthropic, OpenRouter, local). Adds
faithfulnessJudged / relevancyJudged / correctnessJudged.
Claim-decomposed faithfulness (decomposeFaithfulness: true) is
substantively equivalent to Ragas: r=+0.664, MAE=0.151 on n=200
HotpotQA, see COMPARISON_RAGAS.
TP/FP/FN F₁ via decomposeCorrectness: true.
const { Judge, evaluateWithJudge, critique } = require("redhop");
const judge = Judge.fromCallable(async (err, prompt, system) => {
// Your LLM SDK call — return a number or { score: number }
return await myLlm({ prompt, system });
}, "openai-mini").cached();
const report = await evaluateWithJudge(userQuery, ctx, judge, {
answer: "The refund window is thirty days.",
goldAnswer: "thirty days",
decomposeFaithfulness: true,
decomposeCorrectness: true,
});For user-defined aspects (harmfulness, conciseness, brand voice…),
critique(answer, aspects, judge) runs one judge call per aspect
with polarity-corrected scores. Aggregate a test set via
summarize(reports): same shape as Python's redhop.summarize,
returns means + medians + per-metric subset counters.
Full API + field list: ANSWER_QUALITY_EVAL.
Loaders
Document.fromText(text, options?)
Document.fromChunks([new Chunk(text, { source, id, metadata }), ...], options?)
Document.fromFile(path, options?) // PDF/DOCX/PPTX/XLSX + text/code
Document.fromBytes(buffer, "key.pdf", options?) // S3 / GCS / Azure / HTTP / DB blobs
Document.fromFolder(path, folderOptions?) // one combined index over a dirfromFolder honors .gitignore and accepts extra ignore globs:
Document.fromFolder("./repo", { recursive: true, gitignore: true,
ignore: ["*.lock", "tests/**"], options: { retrieval: "hybrid", model: "bge-small" } });Retrieval: start with the default
Start at the lexical default (it handles most document QA because the words in the question are usually the words in the answer) and climb only when the failure shape calls for it.
// Default — most docs (code, API refs, runbooks, financial reports, handbooks)
Document.fromFile("contract.pdf").context("What is the governing law?");
// Structured docs with parallel clauses (regional overrides, per-region sub-sections):
Document.fromFile("msa.pdf", { retrieval: "hybrid", model: "bge-small" })
.context("What law applies in the UK?", undefined, 1, true); // neighbors=1, includeHeading=true
// Synonym-mismatch corpora (HR FAQs, support tickets where users phrase
// things very differently from the docs). Cross-encoder adds 5–10× latency
// — verify it helps on your corpus before enabling.
Document.fromFile("support.md",
{ retrieval: "hybrid", model: "bge-small", rerank: "cross-encoder" });options.retrieval is "lexical" (default), "hybrid" (BM25 → dense rerank),
or "semantic" (dense over every chunk). Dense tiers download a small model
named by options.model ("bge-small" / "bge-base"). The 60-second
decision guide:
CHOOSING_A_CONFIG.
Non-English content
Default is a minimal analyzer (tokenize + lowercase + ASCII fold, no
stemmer), measured to beat English Snowball on every English workload
we tested (RAW_ANALYZER).
Swap with options.language: "english" for code search /
inflection-heavy English content, or any of the 18 Snowball Porter2
languages (arabic, danish, dutch, english, finnish, french, german,
greek, hungarian, italian, norwegian, portuguese, romanian, russian,
spanish, swedish, tamil, turkish):
const doc = Document.fromText(germanText, { language: "german" });
// Now `Buch` finds chunks containing `Bücher` (and vice versa)One analyzer drives both BM25 retrieval AND the grounding scorer, so they can't drift on what "the same term" means. Unknown names throw (we don't silently fall back to English). See the language guide for the full breakdown and a calibration disclaimer.
The result
context(query, budget?, neighbors?, includeHeading?) returns:
text: the assembled prompt stringchunks: the selected chunk texts, in ordercitations:{ source, page, heading, line, text }per chunk (null/absent fields where a format doesn't provide them)report: the Decision Report, with the same field surface as Python'sctx.report. Readstrategy/requestedStrategyfor the resolved vs requested allocation,autoDecisionfor the Auto gate's verdict ("passthrough"|"prune"|"not_auto"),inputTokens/tokenBudget/totalTokens/tokenUtilizationfor budget accounting,nInputChunks/nSelected/nExpandedfor chunk counts,inputDistractorRatio/retainedEvidenceRatio/evidenceDensity/distractorRatio/estimatedWasteTokensfor context economics,secondHopRescues(or its longer aliassecondHopRescueCount) andreasoningPreservationDeltafor the reasoning-preserving accounting,lowConfidenceRetrieval/lowConfidenceThresholdfor the "did anything actually match" signal, andrenderedfor the human-readable Decision Report string. The full shape is inindex.d.ts.
neighbors / includeHeading turn on structural context expansion (adjacent
chunks / section headings, in document order).
analyze(query) is the same retrieve + score pass without assembling the
prompt, useful for auditing what RedHop would do before paying assembly
cost. Returns just the report.
Standalone observability primitives (the same scoring the strategies use internally, exposed so external code never has to reimplement and drift):
const { groundingScore, linkStrength } = require("redhop");
groundingScore("refund window", chunkText); // → number in [0, 1]
linkStrength(chunkA, chunkB); // → number in [0, 1]Both use the default English analyzer. Non-English content reaches the
configured analyzer through Document.context(...).report instead.
fromFolder exposes two more getters: doc.nFiles (count of indexed
files) and doc.skippedFiles ({ source, reason }[], files that
couldn't be parsed: unsupported formats, unreadable bytes, scanned PDFs
without OCR, etc.). Single-source constructors default to nFiles=1
and skippedFiles=[].
Templated workloads: the +9 retention lift (BM25, no model needed)
If every query in your workload follows a fixed template, such as legal QA
("Highlight the parts (if any) of this contract related to X. Details: …"),
support-ticket triage ("Help me with X, my account is Y, the error is Z"),
or form-filled queries from a structured UI, BM25 weights every query term
by corpus IDF, not by how often the term repeats across your query set.
The boilerplate words dilute the real signal words, and retention suffers.
This is the mechanism behind the 4-point CUAD gap on the head-to-head.
Closing it doesn't need a vector DB or a different retriever: it needs two
small preprocessing helpers on the query side.
Measured on the CUAD framework comparison (n=300, BM25, budget 2,000 tok):
| step | helper | retention | Δ |
| ---- | ------ | ---------:| -:|
| raw 24-word template | — | 81.3% | — |
| + strip the wrapper | Stripper | 87.7% | +6.4 |
| + add workload synonyms | Vocabulary | 90.7% | +3.0 |
RedHop with the full workflow is at 90.7%, beating LlamaIndex by 4 points on the same setup, at native BM25 latency (~2.5ms/query). Mechanism + worked clause dict: CUAD_CLAUSE_EXPANSION.md.
Recommended workflow: detect → strip → (optional) expand → A/B. The
rewrite chain runs inside doc.contextWithRewrites(...) so each stage's
audit trail lands on ctx.report.queryRewrites automatically.
const redhop = require("redhop");
// 1 — Detect. Hand a representative sample of your queries to the analyzer.
const report = redhop.analyzeQuerySet(myQueries.slice(0, 300));
// report.isTemplated → true / false
// report.templateWordShare → e.g. 0.66 on CUAD
// report.boilerplateTerms → ["highlight", "contract", "lawyer", …]
// report.estimatedDilutionCost → "high" | "medium" | "low" | "none"
if (report.isTemplated) {
// 2 — Compile the rewrite chain.
const stripper = new redhop.Stripper(report.boilerplateTerms);
// 3 — (optional) Vocabulary. If your workload has known topic synonyms
// (clause types, error codes), compile them once.
const vocab = new redhop.Vocabulary({
// YOUR keys → synonyms; CUAD example in CUAD_CLAUSE_EXPANSION.md
"change of control": ["merger", "successor", "acquisition"],
});
// 4 — Run the chain through retrieval; audit lands on report.queryRewrites.
const doc = await redhop.Document.fromFile("contract.pdf");
const ctxA = doc.context(userQuery); // baseline
const ctxB = doc.contextWithRewrites(userQuery, [stripper, vocab]);
const evalA = redhop.evaluate(userQuery, ctxA, { goldChunks });
const evalB = redhop.evaluate(userQuery, ctxB, { goldChunks });
console.log(evalB.overall - evalA.overall); // the lift, deterministically
}- Only matters if your queries are templated.
analyzeQuerySetis conservative by design: HotpotQA and MuSiQue both register quiet (isTemplated: false) in the cross-workload probe, while CUAD fires. If yours doesn't fire, skip this section. - The analyzer measures the shape of your query set, not your retention. It says "this looks like a templated workload" with the boilerplate terms it found. It does not promise a specific lift. Always A/B on your gold-evidence sample before committing.
- For single-doc extraction workloads also set
strategy: "raw_topk".autoroutes large contexts toreasoning_preserving, which solves a multi-hop problem contract extraction doesn't have. RawTopK beats it by ~4 points at every chunk size on CUAD. - We deliberately don't ship a CUAD-specific
stripTemplate()helper. Templates are workload-specific. Baking one in would make the wrong call for the next workload.new Stripper(...)andnew Vocabulary({...})take your boilerplate / synonym dict so the call stays on your side. - Or take the one-knob alternative:
retrieval="hybrid". Dense reads chunks as semantic content rather than counting tokens, so the boilerplate ratio stops mattering. Substitutes for stripping by a different mechanism (+5.3 on raw CUAD at ~10ms/query). On CUAD specifically, BM25 + strip + vocabulary still wins: 90.7% / 2.5ms vs hybrid+CE 89.0% / 683ms. The two paths are substitutes, not complements, so pick one. See CUAD_HYBRID_RERANK.md.
| helper | what it does | finding |
| ------ | ------------ | ------- |
| analyzeQuerySet(queries) | Inspects your queries and flags whether they're templated and which terms are doing the dilution | QUERY_SET_ANALYZER |
| new Stripper(boilerplate) | Compiled token-level boilerplate strip, word-boundary safe (an "of" strip does not erase "of" inside "office"). Plugs into the rewrite chain so the audit trail is captured | CUAD_RECALL_GAP · MULTILINGUAL_ANALYZER |
| new Vocabulary({key: [synonyms]}) | Compiled workload-curated equivalence classes: appends high-IDF synonyms when the token-level key matches. Vocabulary.bidirectional({...}) for symmetric maps (PTO ↔ paid time off). Opposite mechanism to PRF (falsified) | CUAD_CLAUSE_EXPANSION |
| vocab.enrich(chunkText) | Chunk-side mirror. Measured to lift retrieval +0.19 mean recall on Spider-shape schemas. Use it when your retrieval units are short and opaque (schema columns, error codes, API symbols, defined contract terms). Measured to hurt (−2.0pt) on long prose chunks, so don't use it there. A/B with redhop.evaluate(...) against your gold before adopting | SPIDER_ENRICH + VOCABULARY_ENRICH + CUAD_ENRICH_DEFINITIONS_NULL |
| doc.contextWithRewrites(query, [stripper, vocab]) | Runs the chain through retrieval. Per-stage audit lands on report.queryRewrites | (same finding as above) |
| evaluate(query, ctx, { goldChunks, goldAnswer }) · evaluateWithJudge(query, ctx, judge, { answer, goldAnswer, decomposeFaithfulness, decomposeCorrectness }) | A/B scoring against gold. Sync evaluate is deterministic-only (no LLM). Async evaluateWithJudge opts into LLM-judged faithfulness/relevancy/correctness, with claim-decomposition and TP/FP/FN modes. Same primitives the Decision Report uses | ANSWER_QUALITY_EVAL · COMPARISON_RAGAS |
| critique(answer, aspects, judge) | LLM-judged scoring for user-defined dimensions (harmfulness, conciseness, brand voice…). One judge call per aspect, polarity-corrected so high = good | ANSWER_QUALITY_EVAL |
| summarize(reports) | Test-set aggregation: means + medians + per-metric subset counts (meanOverall, meanFaithfulnessJudged + nWithFaithfulnessJudged, …). Same shape as Python's redhop.summarize and Rust's redhop::summarize(&[…]) | ANSWER_QUALITY_EVAL |
| ctx.report.diagnosis | Query-level facts (queryTerms, zeroMatchTerms, termStats, scoreSpread) plus a closed registry of bounded hints (vocab mismatch, polysemy, templated boilerplate). Every hint cites the measured finding behind it. Always computed, observation only | REPORT_DIAGNOSIS · CHOOSING_A_CONFIG |
| summarizeDiagnoses(reports) | Workload-level aggregation: hint histogram, failure rates, top vocabulary gaps, and at most one focus recommendation citing the finding behind it. Six focus codes (vocab_mismatch, templated_queries, underdetermined_queries, weak_retrieval, healthy, sample_too_small) | WORKLOAD_AUDIT |
| analyzeContext(query, chunks) | Observe what an external retriever returned without modifying it. Returns a Decision Report with Layer-1 diagnosis. Pair with the OTel snippet on the docs page to instrument a LangChain.js / LlamaIndex.ts pipeline | DIAGNOSE_YOUR_PIPELINE |
Decision rule + the recipe on the docs site: Choosing a configuration → "Templated queries with heavy boilerplate".
Build from source
npm install # gets @napi-rs/cli
npm run build # builds the native .node (release)
npm test