npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@plaited/agent-eval-harness

v0.7.0

Published

CLI tool for capturing agent trajectories from headless CLI agents

Downloads

728

Readme

@plaited/agent-eval-harness

npm version CI License: ISC

CLI tool for capturing agent trajectories from headless CLI agents. Execute prompts, capture full trajectories (tools, thoughts, plans), and output structured JSONL for downstream scoring. Available as both a CLI tool and as installable skills for AI coding agents.

CLI Tool

Use these tools directly via the CLI without installation:

# Using built-in headless adapter (recommended - no extra install needed)
export ANTHROPIC_API_KEY=sk-...
bunx @plaited/agent-eval-harness capture prompts.jsonl \
  --schema ./schemas/claude-headless.json \
  -o results.jsonl

Prerequisite: Set your API key. The harness works with any CLI agent that supports JSON output - just provide a schema describing how to interact with it:

export ANTHROPIC_API_KEY=sk-...   # For Claude
export GEMINI_API_KEY=...         # For Gemini

Pre-built schemas are available in .plaited/skills/headless-adapters/schemas/ for Claude and Gemini.

Core Commands

| Command | Description | |---------|-------------| | capture <prompts> --schema <path> | Trajectory capture (full JSONL) | | trials <prompts> --schema <path> | Multi-run with pass@k metrics | | summarize <results> | Derive compact views from results | | calibrate <results> | Sample failures for review | | validate-refs <prompts> | Check reference solutions | | balance <prompts> | Analyze test set coverage | | schemas [name] | Export JSON schemas | | headless --schema <path> | Schema-driven adapter for any CLI agent |

Pipeline Commands (Unix-style)

| Command | Description | |---------|-------------| | run <prompts> --schema <path> | Execute prompts, output raw results | | extract <raw> --schema <path> | Parse raw output into trajectories | | grade <results> --grader <path> | Apply grader to extracted results | | format <results> --style <style> | Convert to markdown, csv, or jsonl | | compare <run1> <run2>... | Compare runs (aggregate report) |

Examples

# Capture trajectories using headless adapter (recommended)
bunx @plaited/agent-eval-harness capture prompts.jsonl \
  --schema ./schemas/claude-headless.json \
  -o results.jsonl

# Run trials for pass@k analysis with debug mode
bunx @plaited/agent-eval-harness trials prompts.jsonl \
  --schema ./schemas/claude-headless.json \
  -k 5 --grader ./grader.ts --debug

# Summarize results
bunx @plaited/agent-eval-harness summarize results.jsonl -o summary.jsonl

# Export schemas
bunx @plaited/agent-eval-harness schemas CaptureResult --json

# Pipeline workflow (Unix-style composition)
cat prompts.jsonl | \
  bunx @plaited/agent-eval-harness run -s ./schemas/claude-headless.json | \
  bunx @plaited/agent-eval-harness extract -s ./schemas/claude-headless.json | \
  bunx @plaited/agent-eval-harness grade -g ./grader.ts | \
  bunx @plaited/agent-eval-harness format -f markdown > report.md

# Compare runs (built-in strategies: weighted, statistical, custom)
bunx @plaited/agent-eval-harness compare run1.jsonl run2.jsonl -o comparison.json

Skills for AI Agents

Install skills for use with AI coding agents:

curl -fsSL https://raw.githubusercontent.com/plaited/skills-installer/main/install.sh | bash -s -- --agents <agent-name> --project agent-eval-harness

Replace <agent-name> with your agent: claude, cursor, copilot, opencode, amp, goose, factory

Available Skills

Agent Eval Harness

CLI tool for capturing agent trajectories, optimized for TypeScript/JavaScript projects using Bun.

Core Commands:

| Command | Description | |---------|-------------| | capture | Execute prompts and capture full trajectories | | trials | Multi-run trials with pass@k/pass^k metrics | | summarize | Derive compact views from trajectory results | | calibrate | Sample failures for grader calibration | | validate-refs | Validate reference solutions against graders | | balance | Analyze test set coverage distribution | | schemas | Export Zod schemas as JSON Schema |

Pipeline Commands (Unix-style):

| Command | Description | |---------|-------------| | run | Execute prompts, output raw results | | extract | Parse raw output into trajectories | | grade | Apply grader to extracted results | | format | Convert to markdown, csv, or jsonl | | compare | Compare runs (aggregate report) |

Use cases:

  • Capturing trajectories for downstream evaluation (Braintrust, custom scorers)
  • Generating training data (SFT/DPO) with full context
  • Building regression test fixtures for agent behavior
  • Comparing agent responses across configurations

Headless Adapters

Schema-driven adapters for headless CLI agent integration.

Commands:

| Command | Description | |---------|-------------| | headless | Schema-driven adapter for any CLI agent |

Use cases:

  • Wrapping headless CLI agents with schema-driven adapter
  • Finding existing adapters for your agent
  • Creating new schemas for CLI agents

Input Format

{"id":"test-001","input":"Create a primary button","hint":"should contain <button>","metadata":{"category":"ui"}}
{"id":"test-002","input":["Create a component","Now add tests"],"metadata":{"category":"multi-turn"}}

| Field | Required | Description | |-------|----------|-------------| | id | Yes | Unique identifier | | input | Yes | Single prompt (string) or conversation turns (string[]) | | hint | No | Grader context - what to look for | | reference | No | Reference solution (for validate-refs) | | metadata | No | Tags, category, difficulty for filtering | | timeout | No | Override default timeout for this prompt (ms) |

Output Format

The harness outputs full trajectory JSONL (CaptureResult schema):

{
  "id": "test-001",
  "input": "Create a primary button",
  "output": "Here's a button component...",
  "hint": "should contain <button>",
  "trajectory": [...],
  "metadata": {"category": "ui", "trajectoryRichness": "full", "turnCount": 1},
  "timing": {"start": 1234567890, "end": 1234567900, "total": 10},
  "toolErrors": false,
  "exitInfo": {"exitCode": 0},
  "score": {"pass": true, "score": 1.0, "reasoning": "Contains hint"}
}

Key fields:

  • toolErrors: Boolean indicating if any tool calls failed
  • score: Grader result (only if --grader provided)
  • trajectory: Full execution trace (thoughts, messages, tool calls, plans)
  • metadata.trajectoryRichness: "full" | "messages-only" | "minimal"
  • exitInfo: Process exit information (exitCode, signal, timedOut)
  • timing.total: End-to-end duration (ms)

Graders

Graders score agent outputs. The harness supports two types and two grading approaches:

Git-Based Outcome Grading (Recommended for Coding Agents)

Grade outcomes, not paths. Use git to detect actual environmental changes:

import type { Grader } from '@plaited/agent-eval-harness/schemas'
import { resolve } from 'node:path'

export const grade: Grader = async ({ output, hint, cwd }) => {
  // Validate cwd to prevent command injection
  const isValidPath = (path: string): boolean => {
    const dangerousChars = /[;&|`$(){}[\]<>'"\\]/
    if (dangerousChars.test(path)) return false
    if (path.includes('..') || path.startsWith('-')) return false
    return true
  }

  if (!cwd || !isValidPath(cwd)) {
    return { 
      pass: false, 
      score: 0, 
      reasoning: 'Invalid working directory path' 
    }
  }
  
  const safeCwd = resolve(cwd)
  
  // Detect file changes using git
  const status = await Bun.$`git -C ${safeCwd} status --porcelain`.text()
  const filesCreated = status
    .split('\n')
    .filter(line => line.startsWith('??'))
    .map(line => line.slice(3).trim())
  
  // Run tests to verify outcome
  const testResult = await Bun.$`cd ${safeCwd} && bun test`.nothrow()
  const testsPassed = testResult.exitCode === 0
  
  return {
    pass: filesCreated.length > 0 && testsPassed,
    score: testsPassed ? 1.0 : 0.0,
    reasoning: `Files created: ${filesCreated.join(', ')}. Tests: ${testsPassed ? 'pass' : 'fail'}`,
    outcome: {  // Optional: structured data for analysis
      filesCreated,
      testsPassed,
      type: 'file_creation_with_tests'
    }
  }
}

Benefits:

  • Detects actual file changes, test results, build success
  • Works universally in any git repo, any language
  • Returns structured outcome data for downstream analysis
  • Zero configuration required

Output-Based Grading (General Purpose)

For non-coding tasks or when git is unavailable:

import type { Grader } from '@plaited/agent-eval-harness/schemas'

export const grade: Grader = async ({ input, output, hint, trajectory }) => {
  const pass = output.toLowerCase().includes(hint?.toLowerCase() ?? '')
  return {
    pass,
    score: pass ? 1.0 : 0.0,
    reasoning: pass ? 'Contains hint content' : 'Missing hint content'
  }
}
agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.ts

Polyglot Graders (Python, etc.)

Any executable script using stdin/stdout JSON protocol:

#!/usr/bin/env python3
import json
import sys
import subprocess
import re
import os

data = json.load(sys.stdin)
output = data["output"].lower()
hint = (data.get("hint") or "").lower()
cwd = data.get("cwd")

# Validate cwd to prevent command injection
def is_valid_path(path):
    if not path:
        return False
    # Reject shell metacharacters
    if re.search(r'[;&|`$(){}\[\]<>\'"\\]', path):
        return False
    # Reject directory traversal and option injection
    if '..' in path or path.startswith('-'):
        return False
    return True

# Git-based grading if cwd is provided
if cwd:
    if not is_valid_path(cwd):
        print(json.dumps({
            "pass": False,
            "score": 0.0,
            "reasoning": "Invalid working directory path"
        }))
        sys.exit(0)
    
    safe_cwd = os.path.abspath(cwd)
    
    try:
        result = subprocess.run(
            ["git", "-C", safe_cwd, "status", "--porcelain"],
            capture_output=True, text=True, check=True
        )
        files_created = [
            line[3:].strip() 
            for line in result.stdout.split('\n') 
            if line.startswith('??')
        ]
        has_changes = len(files_created) > 0
        print(json.dumps({
            "pass": has_changes,
            "score": 1.0 if has_changes else 0.0,
            "reasoning": f"Files created: {', '.join(files_created)}",
            "outcome": {"filesCreated": files_created, "type": "git_check"}
        }))
        sys.exit(0)
    except subprocess.CalledProcessError:
        # Fall back to output-based grading
        pass

# Output-based grading fallback
pass_result = hint in output if hint else True
print(json.dumps({
    "pass": pass_result,
    "score": 1.0 if pass_result else 0.0,
    "reasoning": "Contains hint" if pass_result else "Missing hint"
}))
chmod +x grader.py
agent-eval-harness capture prompts.jsonl --schema ./claude.json --grader ./grader.py

Protocol:

  • Input (stdin): {"input": "...", "output": "...", "hint": "...", "trajectory": [...], "cwd": "/path/to/dir"}
  • Output (stdout): {"pass": true, "score": 1.0, "reasoning": "...", "outcome": {...}}
  • cwd and outcome are optional fields

Downstream Integration

The harness outputs standard JSONL. When graders return the optional outcome field, it's merged onto results for powerful downstream analysis:

# Filter failures
cat results.jsonl | jq 'select(.score.pass == false)'

# Extract tool usage patterns
cat results.jsonl | jq '.trajectory[] | select(.type == "tool_call") | .name'

# Analyze outcomes from git-based graders
cat results.jsonl | jq 'select(.outcome.type == "test_execution")'
cat results.jsonl | jq -s 'map(select(.outcome.testsPassed)) | length'
cat results.jsonl | jq 'select(.outcome.touchedCriticalFiles == true)'

# Use with your scoring pipeline
cat results.jsonl | your-scoring-script.ts

Outcome Field

Git-based graders can return structured outcome data:

{
  "id": "fix-tests",
  "input": "Fix the failing authentication tests",
  "output": "I fixed the auth tests by...",
  "score": {"pass": true, "score": 1.0, "reasoning": "Tests pass"},
  "outcome": {
    "testsPassed": true,
    "filesModified": ["src/auth.ts", "src/auth.spec.ts"],
    "exitCode": 0,
    "type": "test_execution"
  }
}

This enables rich analysis across evaluations without re-parsing trajectories.

Development

bun install               # Install dependencies
bun run check             # Type check + lint + format
bun test                  # Run unit tests
bun run test:integration  # Run integration tests (requires API keys)

# Alternative: Run integration tests in Docker
ANTHROPIC_API_KEY=sk-... GEMINI_API_KEY=... \
  docker compose -f docker-compose.test.yml run --rm test

Requirements

  • Runtime: Bun >= 1.2.9
  • Schema: JSON schema describing CLI agent interaction (see .plaited/skills/headless-adapters/schemas/)
  • API Key: ANTHROPIC_API_KEY for Claude, GEMINI_API_KEY for Gemini

License

ISC © Plaited Labs