npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@tangle-network/agent-eval

v0.20.2

Published

Trace-first evaluation framework for Tangle agents. Core (spans, pipelines, sandbox harness, OTLP export), trust (dataset, red-team, calibration, behavior DSL), builder-of-builders (three-layer eval, resumable sessions, meta-runtime correlation), and fron

Downloads

4,747

Readme

@tangle-network/agent-eval

A library for deciding whether an LLM-driven generator did its job.

You hand it the thing the generator produced — a code scaffold, a patch, a tweet, a JSON config — and you get back a structured verdict: pass/fail, dimension scores, plain-English rationale. Built to catch the LLM failure modes that LLM-as-judge alone misses.

import { BuilderSession, SubprocessSandboxDriver, InMemoryTraceStore } from '@tangle-network/agent-eval'

const session = new BuilderSession(new InMemoryTraceStore(), { projectId: 'my-app' }, new SubprocessSandboxDriver())
await session.startChat()
const ship = await session.ship({
  harness: { setupCommand: 'pnpm install', testCommand: 'pnpm exec tsc --noEmit', cwd: scaffoldDir, timeoutMs: 180_000 },
})
console.log(ship.result.passed, ship.result.score)

Who this is for

  • You ship a code generator (scaffolder, patcher, refactor agent) and need to gate on whether its output actually works.
  • You ship a content generator and need quality signal beyond "the LLM said it's good".
  • You want a release gate that fails on regressions you can name, not vibes.

If that's you, start with docs/concepts.md — 5-minute mental model — then use docs/feature-guide.md to choose the right primitive.

Quickstart

From any language: HTTP or RPC

The fastest path. agent-eval ships a CLI that runs as either an HTTP server or a stdio RPC binary. Drive it from Python, Rust, Go, anything.

npm i -g @tangle-network/agent-eval

# HTTP — long-running
agent-eval serve --port 5005

# stdio RPC — one-shot, batch
echo '{"rubricName":"anti-slop","content":"…"}' | agent-eval rpc judge

Python:

pip install tangle-agent-eval
from tangle_agent_eval import Client
c = Client()
r = c.judge(content="our scaffold ships zero-copy IO", rubric_name="anti-slop")
print(r.composite, r.failure_modes)

See docs/wire-protocol.md for the full surface.

From TypeScript: import directly

In-process; no wire round-trip. Use this when your eval lives in the same Node process as your generator.

pnpm add @tangle-network/agent-eval

The recipe for a code-generator eval is in SKILL.md §Minimal working path.

Two ways to read this repo

  • You're a human onboarding — read docs/concepts.md for the mental model, then docs/wire-protocol.md if you'll call from another language, or SKILL.md if you'll embed in TS.
  • You're deciding what to integrate — read docs/feature-guide.md for the layman explanation, use cases, feature map, and guardrails.
  • You're an LLM agent writing integration code — read SKILL.md. Every directive there encodes a shipped bug; skipping one reintroduces the bug class.

What's in the box

| Module | What it does | Doc | |---|---|---| | BuilderSession | Three-layer eval orchestrator (builder → app-build → app-runtime) for code generators. | concepts.md §three-layer eval | | MultiLayerVerifier | Pipeline of layers (install → typecheck → build → semantic). Skip-on-fail, weighted aggregate. | concepts.md §verifiers | | judges, createCustomJudge, createAntiSlopJudge | LLM and deterministic judges. | SKILL.md | | Wire protocol (agent-eval serve / rpc) | HTTP and stdio RPC interface for cross-language clients. | wire-protocol.md | | clients/python/ | First-party Python client (tangle-agent-eval on PyPI). Version-locked to npm. | clients/python/README.md | | BenchmarkRunner, executeScenario, ConvergenceTracker | Multi-turn scenario execution + cross-run tracking. | SKILL.md | | runAgentControlLoop | Policy-based runtime for agentic tasks: observe typed state, validate, decide, act, repeat with budgets, tracing, and stuck-loop guards. | control-runtime.md | | FeedbackTrajectory, InMemoryFeedbackTrajectoryStore, FileSystemFeedbackTrajectoryStore | Human/environment feedback loops: capture approvals, rejections, choices, revisions, metrics, and policy blocks as train/dev/test/holdout examples. | feedback-trajectories.md | | evaluateActionPolicy | Generic action preflight for approval, budget, expected-outcome, and kill-criteria checks. | feature-guide.md | | ExperimentTracker, steering optimizers, bisector | A/B prompts, optimize steering, bisect regressions. | SKILL.md | | runMultiShotOptimization, trialTraceFromMultiShotTrial | GEPA-style optimization for variable-length agent trajectories with ASI, paired seeds, and optional held-out promotion gating. | multi-shot-optimization.md | | evaluateReleaseConfidence, assertReleaseConfidence | Release scorecard that composes corpus coverage, search/holdout run evidence, ASI diagnostics, overfit checks, and cost/latency budgets. | §Release confidence | | runPromptEvolution, createCompositeMutator, createSandboxPool, createSandboxCodeMutator, MutationTelemetry, LineageRecorder, CostLedger, JsonlTrialCache | Prompt + code evolution loops with bounded sandbox pools, durable JSONL telemetry, plateau-detecting composite mutators, crash-resumable trial cache. | §Evolution loop | | reflective-mutation (buildReflectionPrompt, parseReflectionResponse, DEFAULT_MUTATION_PRIMITIVES) | Trace-conditioned LLM mutator that reasons over top/bottom trials instead of blind rewrites. | inline JSDoc | | correlationStudy, OutcomeStore, ProductRegistry | Meta-eval: do our scores predict deployment outcomes (revenue, retention)? | inline JSDoc | | Telemetry (telemetry/, telemetry/file) | OTLP export, trace replay, file sinks. | inline JSDoc |

Release confidence

Use evaluateReleaseConfidence at the release boundary for every consuming agent surface. It fails closed unless the release has a versioned corpus, search and holdout run evidence, score/pass-rate evidence, ASI for failures, and budget/overfit checks. Single-shot and multi-shot apps use the same path: single-shot traces are just trace evidence with turnCount: 1.

import {
  evaluateReleaseConfidence,
  releaseTraceEvidenceFromMultiShotTrials,
} from '@tangle-network/agent-eval'

const scorecard = evaluateReleaseConfidence({
  target: 'blueprint-agent/autoresearch',
  candidateId: 'candidate-v3',
  baselineId: 'baseline',
  dataset: await dataset.manifest(),
  runs: [...candidateRuns, ...baselineRuns],
  traces: releaseTraceEvidenceFromMultiShotTrials(result.evolution.generations.flatMap((g) => g.trials)),
  gateDecision: result.gate?.decision,
  thresholds: {
    minScenarioCount: 50,
    minSearchRuns: 50,
    minHoldoutRuns: 20,
    minPassRate: 0.9,
    minMeanScore: 0.8,
    maxOverfitGap: 0.1,
    maxMeanCostUsd: 0.05,
    maxP95WallMs: 120_000,
  },
})

if (!scorecard.promote) throw new Error(scorecard.summary)

Evolution loop

For agent tasks that run across many chat turns or tool calls, start with runMultiShotOptimization. It runs the same prompt-evolution core over full trajectories, carries actionable side information into reflection, and separates the search winner from the variant that actually passes held-out promotion.

Closing the loop on a prompt or codebase is two adapters + a config. Compose runPromptEvolution with createCompositeMutator (plateau policy) and you get prompt-only optimization until improvement stalls, then automatic switch to code-channel mutations from a coding agent inside a SandboxPool.

import {
  createSandboxPool,
  createSandboxCodeMutator,
  createCompositeMutator,
  buildReflectionPrompt,
  parseReflectionResponse,
  runPromptEvolution,
  MutationTelemetry,
  LineageRecorder,
  CostLedger,
  JsonlTrialCache,
} from '@tangle-network/agent-eval'

// 1. Prompt mutator — reflective-mutation reasons over top/bottom trials
const promptMutator = {
  async mutate({ parent, topTrials, bottomTrials, childCount }) {
    const ctx = { target: 'forge-prompt', parentPayload: parent.payload, topTrials, bottomTrials, childCount }
    const reflection = buildReflectionPrompt(ctx)
    const raw = await yourLlm(reflection)
    return parseReflectionResponse(raw, childCount).map((p, i) => ({
      id: `${parent.id}.g${parent.generation + 1}.prompt.${i}`,
      payload: p.payload,
      generation: parent.generation + 1,
      parentId: parent.id,
      label: p.label,
      rationale: p.rationale,
    }))
  },
}

// 2. Code mutator — runs a coding agent in a sandbox slot, captures the diff
const pool = createSandboxPool({
  size: 4,
  factory: {
    async create(id) { return await yourSandboxClient.create({ name: id }) },
    async reset(slot) { await slot.resource.exec('git reset --hard origin/main && git clean -fd') },
    async destroy(slot) { await slot.resource.delete() },
  },
})
const codeMutator = createSandboxCodeMutator({
  pool,
  runner: async ({ slot, parent, topTrials, bottomTrials }) => {
    const result = await slot.resource.task(`Improve the prompt at /repo/forge-prompt.ts...`)
    return [{ ok: true, latencyMs: result.durationMs, costUsd: result.costUsd, artifact: { diff: result.diff } }]
  },
  toVariantPayload: (outcome, parent) => ({ ...parent.payload, codeMutation: outcome.artifact }),
})

// 3. Compose — plateau policy auto-switches when prompt evolution stalls
const composite = createCompositeMutator({
  primary: promptMutator,
  secondary: codeMutator,
  policy: 'plateau',
  plateauThreshold: 0.02,
  plateauPatience: 2,
})

// 4. Run — durable telemetry to disk, crash-resumable
const result = await runPromptEvolution({
  runId: `forge_${Date.now()}`,
  target: 'forge-prompt',
  seedVariants: [{ id: 'v0', payload: { text: currentPrompt }, generation: 0, label: 'baseline' }],
  scenarioIds: referenceCorpus.map(s => s.id),
  reps: 3,
  generations: 5,
  populationSize: 4,
  scoreAdapter: { /* runs your eval against (variant, scenario, rep) */ },
  mutateAdapter: composite,
  cache: new JsonlTrialCache('.evolve/cache.jsonl'),
  objectives: [
    { name: 'score', direction: 'maximize', value: a => a.meanScore },
    { name: 'cost', direction: 'minimize', value: a => a.meanCost },
  ],
})

The MutationTelemetry, LineageRecorder, and CostLedger pass into the code-mutator (and any consumer that wants them) — they emit append-only JSONL of every attempt (success + failure with reason) and a snapshot lineage tree, so a finished run leaves a forensically complete trail under one directory.

For the full primitive surface and rationale, read each module's JSDoc — prompt-evolution.ts, composite-mutator.ts, sandbox-pool.ts, code-mutator.ts, reflective-mutation.ts, evolution-telemetry.ts.

Feedback trajectory loop

When normal agent usage should generate training/eval signal, use feedback trajectories. They turn approvals, rejections, option choices, edits, metrics, and policy blocks into reusable examples.

import {
  createFeedbackTrajectory,
  summarizePreferenceMemory,
  feedbackTrajectoriesToDatasetScenarios,
  feedbackTrajectoriesToOptimizerRows,
} from '@tangle-network/agent-eval'

const trajectory = createFeedbackTrajectory({
  projectId: 'research-agent',
  scenarioId: 'brief-review',
  task: { intent: 'Revise a research brief until it is specific and sourced.' },
  attempts: [{
    id: 'draft-1',
    stepIndex: 0,
    artifactType: 'research',
    artifact: { summary: 'Initial brief with weak sourcing.' },
    createdAt: new Date().toISOString(),
  }],
  labels: [{
    source: 'user',
    kind: 'revision_request',
    value: 'needs stronger evidence',
    reason: 'add primary sources and remove unsupported claims',
    severity: 'error',
    createdAt: new Date().toISOString(),
  }],
})

const memory = summarizePreferenceMemory([trajectory])
const scenarios = feedbackTrajectoriesToDatasetScenarios([trajectory])
const optimizerRows = feedbackTrajectoriesToOptimizerRows([trajectory])

This is the bridge between feedback and optimization: review signals become immediate memory, replayable eval scenarios, and prompt/signature/code optimizer input. See docs/feedback-trajectories.md.

v0.16 highlights — production-rigor primitives

These are the primitives any team running prompt-optimization in production needs, regardless of whether they're writing a paper. v0.15 shipped them under "paper-grade" naming; v0.16 corrects that — they're production-first, paper-grade as a side effect.

  • HeldOutGate — held-out paired-delta gate with few_runs / negative_delta / overfit_gap rejection codes and a full evidence block on every decision. Sits alongside the existing bootstrap-CI promotion-gate.ts: that one asks "is this real or noise?", this one asks "is this a real win on held-out and not overfit?". Use both.
  • RunRecord — typed run schema with mandatory snapshot-pinned model, promptHash, configHash, commitSha, costUsd, splitTag. Runtime validator throws on missing fields. Reproducibility falls out for free.
  • pairedBootstrap, pairedWilcoxon, bhAdjust — statistical primitives every rigorous A/B test needs. Already-existing primitives are re-exported for paper-style aliases.
  • runCanaries — silent judge-fallback, calibration drift (KS test), distribution shift (chi-square). Catches the failure mode where your judge silently degrades to a constant-0.30 confidence and you ship configs graded by a stub.
  • summaryTable, paretoChart, gainHistogram — A/B reporting helpers. summaryTable emits markdown with means + 95% bootstrap CIs + paired Wilcoxon p (BH-adjusted) + Cohen's d. Useful for both internal status reports and paper Table 1s.
  • Researcher — stable interface for an external agent that drives the meta-loop (inspectFailuresproposeChangeapplyChangeevaluateChange). Ship a NoopResearcher as a placeholder; real implementations live downstream.
  • benchmarks/routing — synthetic 16-task router benchmark we own. Ships in the package. Reference wrappers for GSM8K and SWE-Bench Lite live under examples/benchmarks/ — read, copy, adapt. All three implement one BenchmarkAdapter shape with deterministic splits and fail-loud env-var configuration.

v0.16 changes from v0.15

  • Renamed paperTablesummaryTable, paretoFigureparetoChart, gainDistributionFiguregainHistogram. Underlying semantics unchanged. Type names follow (SummaryTable, SummaryTableOptions, SummaryTableRow).
  • File: src/paper-report.tssrc/summary-report.ts.
  • Drop the "paper-grade" framing — the primitives are production-first.

See CHANGELOG.md for the full list. .claude/skills/agent-eval/SKILL.md covers usage directives and pitfalls.

Tech stack

  • TypeScript strict, no semicolons, single quotes, 2-space indent
  • tsup for bundling, vitest for tests
  • @tangle-network/tcloud for LLM calls (judges, driver)
  • hono + @asteasolutions/zod-to-openapi for the wire protocol

Develop

pnpm install
pnpm typecheck
pnpm test
pnpm build
pnpm openapi             # write dist/openapi.json from the wire schemas

# Run the server locally
node dist/cli.js serve --port 5005

# Python client tests (require pnpm build first)
cd clients/python && pip install -e ".[dev]" && pytest

Release

@tangle-network/agent-eval (npm) and tangle-agent-eval (PyPI) ship from the same git tag in the same CI workflow. If either fails to publish, neither does. Versions are locked.

Related

License

MIT