npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

agent-duelist

v0.3.1

Published

Pit LLM providers against each other on agent tasks. Built-in task packs for structured output, tool calling, and reasoning.

Readme

agent-duelist

npm version CI Docs License: MIT

Pit LLM providers against each other on agent tasks — Duel your models.

View the landing page →

agent-duelist is a TypeScript-first benchmarking framework that runs the same tasks against multiple LLM providers and gives you structured, reproducible results: correctness, latency, tokens, cost, and more.

npx duelist init   # scaffold a config
npx duelist run    # see who wins

What you get

Console output — box-drawing tables with medals, color-ranked metrics, sparklines, and per-task winners:

Agent Duelist console output

HTML report — a self-contained, shareable single-file report with sortable tables, progress bars, tab navigation, and summary cards:

Agent Duelist HTML report


Why agent-duelist?

| | | |---|---| | Provider-agnostic | One config, many providers. Swap models and gateways without rewriting tasks. | | Agent-focused | Built for agent workflows and tool use, not just single-turn prompts. | | Task packs | Built-in benchmark suites — run with --pack structured-output, zero config needed. | | Quality-first ranking | Medals decided by output quality. Speed and cost only break ties — a fast model that gets nothing right won't medal. | | 7 built-in scorers | Correctness, latency, cost, schema validation, fuzzy similarity, LLM-as-judge, and tool usage. | | Fair benchmarking | Tasks run sequentially while providers race in parallel — fair latency comparison with no queue-induced penalties. | | TypeScript-native | Strongly typed APIs, Zod schemas for structured outputs, and a simple defineArena() entrypoint. | | CI-ready | Regression detection with confidence intervals, cost budgets, PR comments, and a prebuilt GitHub Action. | | CLI-first | npx duelist initnpx duelist run gets you from zero to results in minutes. |


Installation

npm install agent-duelist
# or
pnpm add agent-duelist
# or
yarn add agent-duelist

Set API keys for the providers you want to benchmark:

export OPENAI_API_KEY=sk-...
export AZURE_OPENAI_API_KEY=...
export ANTHROPIC_API_KEY=...
export GOOGLE_API_KEY=...

Quickstart

Initialize a config:

npx duelist init

This creates arena.config.ts in your project:

// arena.config.ts
import { defineArena, openai, azureOpenai } from 'agent-duelist'
import { z } from 'zod'

export default defineArena({
  providers: [
    openai('gpt-5-mini'),
    azureOpenai('gpt-5-mini', { deployment: 'my-azure-deployment' }),
  ],
  tasks: [
    {
      name: 'simple-qa',
      prompt: 'In one sentence, explain what a monorepo is.',
      expected:
        'A monorepo is a single repository that contains code for multiple projects.',
    },
    {
      name: 'structured-extraction',
      prompt: 'Extract the company name and year from: "Acme was founded in 2024."',
      expected: { company: 'Acme', year: 2024 },
      schema: z.object({
        company: z.string(),
        year: z.number(),
      }),
    },
  ],
  scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity'],
  runs: 3,
})

Run the benchmark:

npx duelist run

You'll see a results matrix:

  • Rows: tasks (simple-qa, structured-extraction)
  • Columns: providers (openai/gpt-5-mini, azure/gpt-5-mini)
  • Cells: correctness score, latency, tokens, and estimated cost

Export the results in different formats:

# JSON for CI pipelines and dashboards
npx duelist run --reporter json > results.json

# Self-contained HTML report you can share or host
npx duelist run --reporter html --output report.html

Task packs

Task packs are built-in benchmark suites — curated sets of tasks with recommended scorers. Use them to benchmark providers without writing any tasks yourself.

# List available packs
npx duelist run --pack list

# Run a pack with your providers
npx duelist run --pack structured-output --config arena.config.ts

Your config only needs to define providers — the pack supplies tasks and scorers:

// arena.config.ts
import { defineArena, openai, anthropic } from 'agent-duelist'

export default defineArena({
  providers: [
    openai('gpt-5-mini'),
    anthropic('claude-sonnet-4.6'),
  ],
  tasks: [],     // ignored when --pack is used
  scorers: [],   // pack supplies its own scorers
})

Available packs

| Pack | Tasks | Scorers | Description | |------|-------|---------|-------------| | structured-output | 6 | correctness, schema-correctness, latency, cost | Zod schema stress test — flat objects, nesting, arrays, enums, empty arrays, and adversarial input | | tool-calling | 4 | tool-usage, latency, cost | Function invocation accuracy — single calls, complex params, tool selection, and parallel calls | | reasoning | 5 | correctness, latency, cost | Logic, math, and multi-step thinking — arithmetic, deduction, data interpretation, critical path, and business rules |

Packs work with both run and ci commands:

# CI with a task pack
npx duelist ci --pack structured-output --threshold correctness=0.1 --budget 1.00

You can also combine multiple packs:

npx duelist run --pack structured-output,another-pack

Core concepts

Providers

Providers are factory functions that return plain objects implementing a shared ArenaProvider interface. This lets you swap providers without changing tasks, wrap or extend providers in your own code, and mock providers in tests.

import {
  openai,
  azureOpenai,
  anthropic,
  gemini,
  openaiCompatible,
} from 'agent-duelist'

// OpenAI
const oai = openai('gpt-5-mini')

// Azure OpenAI
const azure = azureOpenai('gpt-5-mini', {
  deployment: 'my-deployment',
})

// Anthropic
const claude = anthropic('claude-sonnet-4.6')

// Google Gemini
const gem = gemini('gemini-3-flash-preview')

// Any OpenAI-compatible gateway (Ollama, LiteLLM, vLLM, etc.)
const local = openaiCompatible({
  id: 'local/llama',
  name: 'Local Ollama',
  baseURL: 'http://localhost:11434/v1',
  model: 'llama3.3',
  apiKeyEnv: 'LOCAL_LLM_API_KEY',
  free: true,           // registers zero-cost pricing
})

// Reasoning models that emit <think> blocks (DeepSeek-R1, MiniMax M2.5, etc.)
const deepseek = openaiCompatible({
  id: 'deepseek/r1',
  name: 'DeepSeek R1',
  baseURL: 'https://api.deepseek.com/v1',
  model: 'deepseek-reasoner',
  apiKeyEnv: 'DEEPSEEK_API_KEY',
  stripThinking: true,  // strips <think>...</think> from output
})

At minimum, a provider implements:

interface ArenaProvider {
  id: string        // e.g. 'openai/gpt-5-mini'
  name: string      // e.g. 'OpenAI'
  model: string
  run(input: TaskInput): Promise<TaskResult>
}

Tasks

Tasks describe what you want the model to do:

interface ArenaTask {
  name: string
  prompt: string
  expected?: unknown         // used by correctness scorers
  schema?: ZodSchema<any>    // used by schema-based scorers
  tools?: ToolDefinition[]   // used by tool-calling scorers
}

Examples:

const tasks: ArenaTask[] = [
  {
    name: 'classify-sentiment',
    prompt: 'Classify the sentiment of: "I love this product".',
    expected: 'positive',
  },
  {
    name: 'extract-structured-data',
    prompt: 'Extract { company, year } from: "Acme was founded in 2024."',
    expected: { company: 'Acme', year: 2024 },
    schema: z.object({
      company: z.string(),
      year: z.number(),
    }),
  },
]

Scorers

Scorers turn raw model outputs into numeric scores (0–1) with optional details. Seven built-in scorers ship out of the box:

| Scorer | What it measures | |--------|-----------------| | latency | Wall-clock response time in milliseconds | | cost | Estimated USD cost from token usage and a bundled pricing catalog | | correctness | Exact match against expected (deep-equal, key-order independent for objects) | | schema-correctness | Validates output against the task's Zod schema via safeParse() | | fuzzy-similarity | Jaccard token-overlap similarity between output and expected | | tool-usage | Tool calling accuracy — checks tool selection and argument correctness (1.0 exact match, 0.5 right tool / wrong args, 0.0 wrong tool) | | llm-judge-correctness | LLM-as-judge — calls a judge model to score accuracy, completeness, and conciseness |

Configure them in your arena:

scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity']

The llm-judge-correctness scorer evaluates outputs on three criteria (accuracy, completeness, conciseness) and returns a composite decimal score. Configure the judge model directly in your arena config:

defineArena({
  // ...
  scorers: ['latency', 'cost', 'correctness', 'llm-judge-correctness'],
  judgeModel: 'gemini-3.1-pro-preview', // or any OpenAI/Azure/Gemini model
})

The judge model defaults to gpt-5-mini. It can also be set via the DUELIST_JUDGE_MODEL env var. The judge backend is auto-detected from the model name — gemini-* models use Google's API, otherwise it falls back to OpenAI or Azure OpenAI.


Arena options

defineArena() accepts these top-level options alongside providers, tasks, and scorers:

| Option | Type | Default | Description | |--------|------|---------|-------------| | runs | number | 1 | Number of runs per provider x task combination. Higher values improve statistical confidence for CI regression detection. | | judgeModel | string | 'gpt-5-mini' | Model used by the llm-judge-correctness scorer. Also settable via DUELIST_JUDGE_MODEL env var. Gemini models auto-route to Google's API. | | timeout | number | 60000 | Per-request timeout in milliseconds. Requests exceeding this are marked as failures. Prevents hanging on unresponsive APIs. | | sparklines | boolean | true | Show sparkline bars next to percentage scores in the console reporter. Disable with false if your terminal doesn't render Unicode block characters well. |

Example with all options:

export default defineArena({
  providers: [openai('gpt-5-mini'), gemini('gemini-3-flash-preview')],
  tasks: [/* ... */],
  scorers: ['latency', 'cost', 'correctness', 'llm-judge-correctness'],
  runs: 3,
  judgeModel: 'gemini-3.1-pro-preview',
  timeout: 30_000,   // 30s — fail fast on slow APIs
  sparklines: false,  // plain percentages, no Unicode bars
})

Reporters

agent-duelist includes four output formats, each suited to a different workflow:

| Reporter | Flag | Use case | |----------|------|----------| | Console | --reporter console (default) | Interactive development — box-drawing tables with medals, sparklines, color-ranked metrics, and per-task winners | | JSON | --reporter json | CI pipelines, dashboards, and downstream tooling | | HTML | --reporter html | Shareable single-file reports with sortable tables, animated backgrounds, tab navigation, CSS progress bars, medal rankings, and summary cards | | Markdown | --comment (CI mode) | Auto-posted PR comment with comparison table, cost summary, and pass/fail verdict |

Generate an HTML report:

npx duelist run --reporter html --output report.html

The HTML report is a single self-contained file — no external dependencies, no build step. Open it in any browser or host it as a static page.


Cost & pricing

Cost estimation is intentionally transparent and conservative:

  1. Token counts — Providers return token usage (prompt and completion tokens) in each TaskResult. These are the source of truth.

  2. Pricing catalogagent-duelist ships with a locally bundled catalog of per-token prices for many models, derived from OpenRouter's public pricing pages.

    • The catalog maps (provider, model) to { inputPerM, outputPerM } in USD per 1M tokens.
    • Azure OpenAI models resolve back to their base OpenAI models (e.g. azure/gpt-5-miniopenai/gpt-5-mini).
    • Cross-provider fallback: models hosted on Groq, Together, Fireworks, etc. resolve to the original provider's pricing.
  3. Estimated USD — The cost scorer computes:

    estimatedUsd = (promptTokens * inputPerM + completionTokens * outputPerM) / 1_000_000
  4. Unknown models — If a model is not in the catalog, tokens are still reported and cost is marked as unknown (no fake numbers).

  5. Custom pricing — Register pricing for models not in the catalog:

    import { registerPricing } from 'agent-duelist'
    
    registerPricing('custom/my-model', {
      inputPerToken: 0.000003,
      outputPerToken: 0.000015,
    })

You can update the bundled catalog with npm run update:pricing, which re-scrapes OpenRouter's public pricing page.


CLI reference

duelist init

Scaffold a new arena.config.ts in the current directory.

npx duelist init
npx duelist init --force  # overwrite an existing config

| Option | Description | |--------|-------------| | --force | Overwrite existing config file |

duelist run

Run benchmarks defined in your arena config.

# Default config, console output
npx duelist run

# Custom config
npx duelist run --config path/to/arena.config.ts

# Run a built-in task pack
npx duelist run --pack structured-output

# List available packs
npx duelist run --pack list

# JSON for piping
npx duelist run --reporter json > results.json

# HTML report
npx duelist run --reporter html --output report.html

# Quiet mode
npx duelist run --quiet

| Option | Description | |--------|-------------| | -c, --config <path> | Path to config file (default: arena.config.ts) | | --pack <names> | Run built-in task pack(s) instead of config tasks. Comma-separated for multiple packs. Use list to show available packs. | | --reporter <type> | Output format: console (default), json, or html | | --output <path> | Output file path for HTML reporter (default: duelist-report.html) | | -q, --quiet | Suppress per-result progress |

duelist ci

Run benchmarks, compare against a baseline, and enforce quality gates. Exits non-zero if regressions are detected or cost exceeds the budget.

# First run — establishes the baseline
npx duelist ci --update-baseline

# Subsequent runs — compare against baseline
npx duelist ci --threshold correctness=0.1 --budget 1.00

# Run CI with a task pack
npx duelist ci --pack structured-output --threshold correctness=0.1

# Post comparison table as a PR comment (GitHub Actions)
npx duelist ci --threshold correctness=0.1 --comment

| Option | Description | |--------|-------------| | -c, --config <path> | Path to config file (default: arena.config.ts) | | --pack <names> | Run built-in task pack(s) instead of config tasks | | --baseline <path> | Baseline JSON file (default: .duelist/baseline.json) | | --budget <dollars> | Max total cost in USD — fails if exceeded | | --threshold <scorer=delta> | Regression threshold (repeatable, e.g. --threshold correctness=0.1 --threshold cost=0.002) | | --update-baseline | Save results as new baseline after passing | | --comment | Post markdown comparison table as a GitHub PR comment | | -q, --quiet | Suppress per-result progress |

How regression detection works:

  • With runs > 1, the CI uses 95% confidence intervals (t-distribution) — a scorer only regresses if the confidence intervals don't overlap beyond the threshold. This avoids false positives from noisy LLM outputs.
  • With runs === 1, it uses a simple delta comparison.
  • Without --threshold flags, regression detection is skipped entirely — only --budget is enforced.
  • Results with high variance (CV > 0.3) are flagged as flaky with a warning.

The CLI loads TypeScript configs directly using a lightweight runtime loader so you don't need to precompile your config.


Tool-calling agent example

agent-duelist supports tool-calling tasks — define tools with Zod-typed parameters and handlers, and the provider will execute them during the benchmark:

import { defineArena, openai } from 'agent-duelist'
import { z } from 'zod'

const weatherTool = {
  name: 'getCurrentWeather',
  description: 'Get the current weather in a given city',
  parameters: z.object({ city: z.string() }),
  handler: async ({ city }: { city: string }) => ({
    city,
    tempC: 20,
  }),
}

export default defineArena({
  providers: [openai('gpt-5-mini')],
  tasks: [
    {
      name: 'weather-tool-call',
      prompt: 'What is the current temperature in Amsterdam? Use the tool.',
      expected: { city: 'Amsterdam' },
      tools: [weatherTool],
    },
  ],
  scorers: ['latency', 'cost', 'tool-usage'],
  runs: 1,
})

The model calls getCurrentWeather, the handler returns a stub result, and the tool-usage scorer reports whether the expected tool was invoked. Tool calls and their results are included in the JSON output for inspection.


Example: multi-provider benchmark

A richer example comparing multiple providers across tasks:

// arena.config.ts
import { defineArena, azureOpenai, openai, gemini } from 'agent-duelist'
import { z } from 'zod'

export default defineArena({
  providers: [
    openai('gpt-5-mini'),
    azureOpenai('gpt-5-nano'),
    gemini('gemini-3-flash-preview'),
  ],
  tasks: [
    {
      name: 'extract-company',
      prompt:
        'Extract the company name and role as JSON from: "I work at Acme Corp as a senior engineer." Return {"company": "...", "role": "..."}',
      expected: { company: 'Acme Corp', role: 'senior engineer' },
      schema: z.object({ company: z.string(), role: z.string() }),
    },
    {
      name: 'summarize',
      prompt:
        'Summarize in one sentence: TypeScript is a strongly typed programming language that builds on JavaScript, giving you better tooling at any scale.',
      expected:
        'TypeScript is a typed superset of JavaScript that improves tooling for projects of any size.',
    },
    {
      name: 'classify-sentiment',
      prompt:
        'Classify the sentiment of this review as "positive", "negative", or "neutral". Return only the classification word.\n\nReview: "The product arrived on time and works exactly as described. Very happy with my purchase!"',
      expected: 'positive',
    },
  ],
  scorers: ['latency', 'cost', 'correctness', 'schema-correctness', 'fuzzy-similarity'],
  runs: 3,
})
npx duelist run

How scoring works:

  • Providers are compared head-to-head within each task — all providers receive the same prompt at the same time.
  • Medals are awarded only when a provider is the sole leader in a metric column. Ties don't award medals, keeping rankings meaningful.
  • Quality-first ranking: medals are decided by quality scorer wins (correctness, schema-correctness, etc.). Efficiency metrics (latency, cost) only break ties among quality-equal providers.
  • Quality gate: providers that score 0% on all quality scorers are ineligible for medals entirely — being fast or cheap doesn't compensate for getting nothing right.
  • The overall winner is the provider with the highest average correctness score.

CI / GitHub Actions

duelist ci is designed to run as a quality gate in your CI pipeline. It compares benchmark results against a saved baseline and fails the build if quality regresses or costs exceed a budget.

GitHub Action

The easiest way to add eval quality gates to your PR workflow:

# .github/workflows/eval.yml
name: LLM Eval
on: [pull_request]

jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: DataGobes/agent-duelist/.github/actions/duelist-ci@main
        with:
          budget: '1.00'
          thresholds: 'correctness=0.1'
        env:
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

The action handles Node.js setup, runs duelist ci, posts a comparison table as a PR comment, and optionally commits an updated baseline.

| Input | Default | Description | |-------|---------|-------------| | config | arena.config.ts | Path to arena config file | | baseline | .duelist/baseline.json | Path to baseline JSON file | | budget | — | Max total cost in USD | | thresholds | — | Space-separated scorer=delta pairs | | update-baseline | false | Save results as new baseline after passing | | comment | true | Post results as PR comment | | node-version | 20 | Node.js version to use |

PR comment output

When --comment is enabled, the CI posts (or updates) a markdown table on the PR:

| Provider | Task | Scorer | Baseline | Current | Delta | Status | |----------|------|--------|----------|---------|-------|--------| | openai/gpt-5-mini | extract | correctness | 0.900 | 0.850 | -0.050 | unchanged | | openai/gpt-5-mini | extract | latency | 0.920 ± 0.030 | 0.890 ± 0.025 | -0.030 | unchanged |

With cost summary, flakiness warnings, and pass/fail verdict.


Roadmap

Shipped:

  • 5 provider types: OpenAI, Azure OpenAI, Anthropic, Google Gemini, and any OpenAI-compatible gateway
  • 7 built-in scorers including LLM-as-judge, tool-usage, schema validation, and fuzzy similarity
  • Tool-calling support with local handlers for agent task benchmarking
  • Task packs: built-in benchmark suites (structured-output, tool-calling, reasoning) — run with --pack, no config writing needed
  • Quality-first medal ranking: output quality decides medals, efficiency only breaks ties
  • Fair head-to-head benchmarking with parallel provider execution
  • 4 reporters: console (tables + medals + sparklines), JSON, HTML (sortable, self-contained), and Markdown (PR comments)
  • duelist ci with regression detection (confidence intervals), cost budgets, and flakiness warnings
  • GitHub Action for CI/CD integration
  • Pricing catalog with cross-provider fallback and registerPricing() for custom models
  • openaiCompatible with stripThinking for reasoning models and free flag for local models
  • Configurable per-request timeout

Planned (subject to community feedback):

  • More task packs — summarization, multi-turn conversation, and code generation packs
  • Agent workflows — multi-step tool chains, multi-hop reasoning, and agent traces
  • More export formats — CSV
  • Plugin system — first-class support for user-defined providers and scorers
  • Embedding-based scoring — semantic similarity via embedding distance
  • More providers — OpenRouter-native and additional OpenAI-compatible gateways

Have a use case in mind? Open an issue — community feedback shapes what gets built first.


Contributing

Contributions, issues, and feature requests are welcome.

  • Bug reports / ideas: open a GitHub issue.
  • Code changes:
    1. Fork the repo.
    2. Create a branch.
    3. Run tests: npm test.
    4. Run build: npm run build.
    5. Open a PR with a clear description.

Please keep PRs narrowly focused (single provider, one new scorer, etc.) so they're easy to review.


License

MIT