npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@tangle-network/agent-eval

v0.93.0

Published

Evaluate and improve AI agents from runs, traces, judges, and feedback. Compare candidates, cluster failures, measure lift, and gate releases.

Readme

@tangle-network/agent-eval

Evaluate and improve AI agents from the runs they already produce.

agent-eval turns agent outputs, traces, judge scores, and production feedback into a decision packet: did this change help, what failed, what should ship, and what needs more data?

npm pypi tests license: MIT

Use it when you need to:

  • compare a candidate agent/prompt/model against a baseline,
  • turn production traces or human feedback into eval results,
  • run a gated self-improvement loop,
  • explain failures by cluster, cost, judge disagreement, and release risk.

It is a library, not a SaaS requirement. TypeScript is first-class; Python can call the same wire protocol through agent-eval-rpc.


Install

pnpm add @tangle-network/agent-eval

Python clients can use the RPC package:

pip install agent-eval-rpc

Quick start

1. Analyze runs you already have

Start here if you already have production logs, benchmark rows, human ratings, or agent run records.

import { analyzeRuns } from '@tangle-network/agent-eval/contract'

const report = await analyzeRuns({
  runs, // RunRecord[]
  baselineRuns,
})

console.log(report.recommendations)
console.log(report.lift)
console.log(report.failureClusters)

The output includes score distributions, lift confidence intervals, failure modes, cost-quality tradeoffs, judge agreement, contamination checks, and release recommendations when the input supports them.

2. Run a gated improvement loop

Use this when you have scenarios, a runnable agent, and judges.

import { selfImprove } from '@tangle-network/agent-eval/contract'

const result = await selfImprove({
  scenarios,
  dispatch: async ({ scenario }) => myAgent.run(scenario),
  judges: [myJudge],
  baselineSurface: { systemPrompt: currentPrompt },
})

console.log(result.gateDecision)
console.log(result.winnerSurface)
console.log(result.insight.recommendations)

selfImprove() evaluates candidates on held-out scenarios before recommending a winner.

3. Adapt existing data

import { analyzeRuns, fromFeedbackTable, fromOtelSpans } from '@tangle-network/agent-eval/contract'

const { runs, raterScores } = fromFeedbackTable({
  ratings: parseYourFeedbackTable(),
})

const traceRuns = fromOtelSpans({ spans: yourOtelSpans })

await analyzeRuns({ runs: [...runs, ...traceRuns], raterScores })

Core concepts

  • RunRecord: the durable row for one agent run: model, prompt/config hashes, split, cost, tokens, outcome.
  • Scenario: one task or case the agent attempts.
  • Judge: a scoring function, rule-based or model-based.
  • InsightReport: the decision packet returned by analyzeRuns() and embedded in selfImprove().
  • Gate: the policy that decides ship, hold, or need_more_data.

Examples

| Journey | Example | Who it's for | |---|---|---| | Closed loop — improve a prompt under statistical confidence | examples/selfimprove-quickstart/ | Teams with scenarios + judges + agent in hand | | Multi-rater feedback corpus — turn Obsidian/Sheets/CSV ratings into actionable insights | examples/customer-feedback-loop/ | Teams reviewing AI outputs by hand who want to compress that taste into per-member LLM judges + close the loop | | Production OTel traces — analyze logs you already have, no closed loop required | examples/customer-otel-traces/ | Teams running agents in prod with observability, no eval discipline yet |

Each example: README.md + a single index.ts runnable via pnpm tsx. Prints the resulting InsightReport to stdout.


Subpath entry points

| Subpath | What it gives you | |---|---| | …/contract | The headline, frozen surface — new code starts here. selfImprove, analyzeRuns, runEval, runCampaign, runImprovementLoop, diffRuns; intake adapters (fromFeedbackTable, fromOtelSpans); drivers (gepaDriver, evolutionaryDriver); gates (defaultProductionGate, heldOutGate, paretoSignificanceGate, composeGate); the deployment-outcome store; storage; and the five core types Scenario / Dispatch / JudgeConfig / Mutator / Gate. | | …/hosted | createHostedClient / hostedClientFromEnv + the wire types to ship eval-run events + trace spans to a hosted orchestrator (ours or your own implementation of the spec) | | …/adapters/otel | createOtelBridge — forwards OpenTelemetry-shape spans into the hosted-tier ingest, no @opentelemetry/* dependency | | …/adapters/langchain | Wrap any LangChain Runnable as a Dispatch (or JudgeConfig), no @langchain/core peer dep | | …/adapters/http | httpDispatch + runDispatchServer — run a campaign's worker on another machine (multi-region, driver-as-a-service) | | …/campaign | The measurement + improvement engine (@experimental): runProfileMatrix, compareDrivers, every driver (gepaDriver, haloDriver, skillOptDriver, aceDriver, memoryCurationDriver, …), the gates, storage backends, and loop provenance. /contract re-exports the stable subset. | | …/rl | RL bridge from eval artifacts to training signal: verifiable rewards, preferences, OPE, PRM, tournaments, contamination, compute curves, plus the durable corpus + buildRlDataset / datasheet bundle | | …/reporting | Release-decision statistics: pairedBootstrap, benjaminiHochberg, anytime-valid sequential e-values, evaluateReleaseConfidence, and the report renderers | | …/analyst | The trace-analyst surface: AnalystRegistry + buildDefaultAnalystRegistry (run the failure-clustering panel), FindingsStore, and the LLM chat transports | | …/traces | Trace stores + emitters, OTLP-JSONL deterministic replay, analyzeTraces, and the traceAnalystOnRunComplete hook | | …/control | Agent control loop: runAgentControlLoop (observe → validate → decide → act), action policy, propose/review | | …/matrix | runAgentMatrix — an N-axis cartesian over caller-supplied substrate values, per-axis pass/score/cost/duration | | …/multishot | N-shot persona × shot matrix runner (runMultishot / runMultishotMatrix) | | …/wire | The cross-language HTTP/RPC server + Zod schemas (the source-of-truth protocol the Python client speaks) + the built-in rubric registry | | …/benchmarks | BenchmarkAdapter contract + deterministicSplit + the bundled routing reference benchmark |

Specialized surfaces (subpath-only): …/prm (process-reward grading + best-of-N), …/meta-eval (judge calibration + the deployment-outcome store), …/pipelines (trace-diagnostic views: budget breach, failure cluster, stuck loop, …), …/governance (EU AI Act / NIST AI RMF / SOC2 reports), …/knowledge (knowledge-readiness gating before a run), …/builder-eval (code-generator three-layer eval), …/storyboard (trace → watchable replay), …/authenticity (anti-Goodhart "real or convincing BS" scorer over produced files), …/workflow (workflow-trace eval + partner export), …/telemetry (Workers-safe telemetry client).

The root export remains available for backward compatibility; new code should prefer the focused subpaths above — /contract first.


Composition with the stack

agent-eval is the bottom of the layering: consumers depend on it, it depends on none of them.

agent-runtime    Runs agents (chat turns, one-shot tasks, multi-attempt loops), captures every
                 run as a trace, and calls optimizePrompt / runImprovementLoop. Produces the
                 RunRecords + traces agent-eval scores. Depends on agent-eval.

agent-eval       selfImprove, analyzeRuns, runCampaign + drivers (gepaDriver, …), the gates
   (this repo)   (heldOutGate, defaultProductionGate, paretoSignificanceGate), the InsightReport
                 decision packet, the RL bridge, the wire protocol. Depends on neither consumer.

agent-knowledge  proposeKnowledgeWrites / applyKnowledgeWriteBlocks. agent-eval's analyst findings
                 feed it; the knowledge gate consumes them. Depends on agent-eval.

sandbox          AgentProfile, Sandbox.create, streamPrompt. The execution surface the runtime's
                 loops run on; agent-eval scores what comes back.

The rule: agent-eval has zero upward dependencies on a consumer. A concept that makes sense without a running agent loop — a verdict, a run record, a scenario, a judge score — is substrate and lives here; a runtime-shaped one (a sandbox profile, a validation context with an abort signal) lives in agent-runtime. When in doubt, lean substrate.


Concepts + design

The .claude/skills/agent-eval/SKILL.md skill ships embedded directives so LLM agents writing integration code don't reintroduce historical bug classes.


Hosted tier

Wire your loop to a hosted orchestrator (ours, or your own implementation of the spec) with one config:

await selfImprove({
  scenarios, dispatch, judges, baselineSurface,
  hostedTenant: {
    endpoint: 'https://intelligence.tangle.tools',
    apiKey: process.env.TANGLE_API_KEY!,
    tenantId: 'your-tenant',
  },
})

The substrate runs the loop in your process. Only the eval-run events + (optional) trace spans go to the orchestrator. Your scenarios, your judges, your raw data — never sent. Spec at docs/hosted-ingest-spec.md; reference receiver at examples/hosted-ingest-server/.


Development

Run an example:

pnpm tsx examples/selfimprove-quickstart/index.ts
pnpm tsx examples/customer-feedback-loop/index.ts
pnpm tsx examples/customer-otel-traces/index.ts

Run the test suite:

pnpm install
pnpm build
pnpm test

Stability + versioning

The /contract surface is the stability contract: its barrel freezes the API — a 0.x minor only adds; nothing there changes shape or disappears. Depend on /contract (and the documented subpaths) rather than the root barrel.

In the deeper subpaths, @stable / @experimental JSDoc markers (visible in IDE hover + .d.ts) call out what may still move — most granularly in /rl (tagged per export) and /campaign (whole barrel @experimental, since /contract re-exports only its settled subset).

| Tag | Meaning | |---|---| | @stable | API frozen at this major. Breaking changes require a major bump. | | @experimental | Interface may evolve before becoming @stable. Pin the patch version if you depend on it. | | @internal | Not part of the public contract. Use the documented subpath instead. |

CHANGELOG.md tracks every release with what's new / additive / breaking.


License

MIT. See LICENSE.