npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

maestro-evals

v1.0.0

Published

Golden-prompt regression guard for the Maestro agent runtime. Static evals (mock-based, every CI) plus live evals (real Anthropic, scheduled) that catch the four Anthropic tool-calling traps before they ship.

Readme

maestro-evals

Golden-prompt regression guard for maestro-core. Catches the four documented Anthropic tool-calling traps before they hit production — the ones invisible to type checks that only surface as <function_calls> XML in user-facing prose.

Two tiers:

| Tier | Cost | Cadence | What it catches | | --- | --- | --- | --- | | Static | $0 | Every CI run | Call-shape traps + fixture contract checks against a declared simulated response | | Live | ~$0.001 / fixture on haiku | Scheduled + pre-release | Same checks, but against real Anthropic output |

Why this exists

Over a single week of maestro-core development we burned multiple sessions diagnosing the same prod incident in four different forms. All four are individually invisible to TypeScript:

| Trap | Symptom | Assertion that catches it | | --- | --- | --- | | 1. system mixed into messages | Model emits <function_calls> XML in prose | assertNoToolNarrationXml + shape-phase top-level-system check | | 2. Missing stopWhen | Tool fires, bubble ends with no text | assertToolFiredHasText | | 3. Anti-narration rule missing | Both real tool_use AND XML in prose | assertNoToolNarrationXml | | 4. Tool registry resolves empty (surface-vs-transport drift) | Anthropic gets tools: {} → narrates from training corpus | assertToolsRegistered |

A 30-second smoke eval would have caught any of them.

Install

pnpm add -D maestro-evals

Peer deps: maestro-core ^1.0.0, ai ^6.0.0, zod ^3.25.0. For live mode, also @ai-sdk/anthropic ^3.0.0 and ANTHROPIC_API_KEY.

Author a fixture

A fixture is a TypeScript module with a default export — same authoring shape as a maestro-core tool:

// fixtures/basic-tool-call.fixture.ts
import { defineAgentTool, ok } from 'maestro-core'
import { z } from 'zod'
import type { EvalFixture } from 'maestro-evals'

const lookupBooking = defineAgentTool({
    name: 'lookupBooking',
    description: 'Look up a booking by reference.',
    transports: ['chat'],
    inputSchema: z.object({ ref: z.string() }),
    execute: async ({ ref }) => ok({ ref, status: 'confirmed' }),
})

const fixture: EvalFixture = {
    name: 'basic-tool-call',
    prompt: 'Check booking B-1234.',
    tools: [lookupBooking],
    simulated: {
        text: 'Booking B-1234 is confirmed.',
        toolCalls: [{ name: 'lookupBooking' }],
    },
    expect: {
        toolCalls: ['lookupBooking'],
        noXmlInProse: true,
        nonEmptyText: true,
    },
}

export default fixture

The simulated block is the model's "declared" response — the static runner asserts against that as if Anthropic returned it. The live runner ignores simulated and asserts against the real reply.

Run

Static (every CI)

maestro-evals run --dir ./fixtures

Exits non-zero on any fixture failure. No API key required.

Live (scheduled + pre-release)

ANTHROPIC_API_KEY=sk-... maestro-evals run --dir ./fixtures --live

Defaults to claude-haiku-4-5-20251001 for cost. Override with --model.

Reporter formats

maestro-evals run --reporter json
maestro-evals run --reporter tap
maestro-evals run --reporter console   # default

Programmatic use

import { runStaticEvals, runLiveEvals, type EvalFixture } from 'maestro-evals'
import myFixture from './fixtures/basic-tool-call.fixture.js'

const fixtures: EvalFixture[] = [myFixture]

// In CI:
const staticReport = await runStaticEvals(fixtures)
if (!staticReport.passed) process.exit(1)

// In a release-gate job:
const liveReport = await runLiveEvals(fixtures, {
    anthropicApiKey: process.env.ANTHROPIC_API_KEY!,
    model: 'claude-haiku-4-5-20251001',
})
if (!liveReport.passed) process.exit(1)

Assertion library

Reusable outside the runners — pure functions, no AI SDK dependency:

import {
    assertNoToolNarrationXml,
    assertToolFiredHasText,
    assertToolsRegistered,
    assertToolsCalled,
    assertNoToolsCalled,
    assertTextMinLength,
    assertNoForbiddenPhrases,
    EvalAssertionError,
    TOOL_NARRATION_XML_TOKENS,
} from 'maestro-evals/assertions'

Each helper throws EvalAssertionError with a stable .code (xml_in_prose, tool_fired_no_text, empty_tool_registry, etc.) so callers can group / filter without parsing messages.

CI wiring (GitHub Actions)

# .github/workflows/evals.yml
name: evals

on:
  pull_request:
  schedule:
    - cron: '0 14 * * 1'   # Monday 14:00 UTC

jobs:
  static:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - run: pnpm install --frozen-lockfile
      - run: pnpm --filter your-app build
      - run: npx maestro-evals run --dir ./dist/fixtures

  live:
    if: github.event_name == 'schedule' || contains(github.event.pull_request.labels.*.name, 'release-gate')
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: pnpm/action-setup@v4
      - run: pnpm install --frozen-lockfile
      - run: pnpm --filter your-app build
      - run: npx maestro-evals run --dir ./dist/fixtures --live --reporter tap
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}

static runs on every PR. live runs on the weekly schedule plus PRs tagged release-gate.

Sample report

maestro-evals (static)  2026-05-18T15:02:11.123Z
─────────────────────────────────────────────────
  PASS  basic-tool-call
        Single-tool happy path — user asks for a booking, model invokes lookup and summarises.
  PASS  refusal
        Off-scope ask — model should refuse politely without inventing a tool call.
  FAIL  multi-tool
        Two sequential tool calls — find customer, then list their bookings.
        ✗ missing_tool_call: Expected tool "listBookings" to be called but it was not. Actually called: [findCustomer].

Summary: 2/3 passed

Limitations

  • Single-turn only today. Multi-turn fixtures (assistant + user + assistant) are a future extension.
  • No streaming-shape assertions. We assert the finalised text + tool-calls; mid-stream delta shape is not checked.
  • Static fixtures don't catch model-specific narration leaks. That's the point of running live evals on a schedule.
  • The static runner does not invoke runChatTurn directly. It mirrors the applyCacheBreakpoints + tool-registry handoff so any change to runChatTurn that breaks the call-shape contract surfaces here. runChatTurn has its own regression suite in maestro-core covering the per-turn lifecycle.

License

Apache-2.0