@aui.io/evals

v0.1.1

Published

a month ago

Multi-turn conversation agent evaluator with LLM-as-judge scoring.

AUI Agent Evaluator

Multi-turn conversation evaluation framework for AUI agents. Runs simulated conversations against any AUI agent, then scores them with both LLM-based and programmatic (strict) evaluations.

Install

npm install @aui.io/evals

Quick Start

# Validate a test suite
npx aui-evals validate my-suite.json

# Preview what will run
npx aui-evals plan my-suite.json

# Run with OpenAI judge
npx aui-evals run my-suite.json --llm-key=<YOUR_OPENAI_KEY>

# Run with Anthropic judge
npx aui-evals run my-suite.json --llm-key=<YOUR_ANTHROPIC_KEY> --provider=anthropic

You can also set LLM_API_KEY as an env var instead of --llm-key.

CLI Commands

List Bundled Test Suites

# See all available bundled test suites
npx aui-evals list-suites

This shows all included Demo Agents and Quack Agents test suites with their names.

Run Bundled Test Suites

You can run bundled suites by name without specifying full paths:

# Run Demo Agents suites
npx aui-evals run demo:hr --llm-key=$OPENAI_KEY
npx aui-evals run demo:retail --llm-key=$OPENAI_KEY
npx aui-evals run demo:automotive-agent-140426 --llm-key=$KEY

# Run Quack Agents suites
npx aui-evals run quack:artlist/payment_issues_tda --llm-key=$KEY
npx aui-evals run quack:notable/compass_tests --llm-key=$KEY

Run Local Test Suites

# Validate a test suite
npx aui-evals validate suite.json

# Show execution plan
npx aui-evals plan suite.json

# Run evaluations
npx aui-evals run suite.json --llm-key=$OPENAI_KEY

# Run with specific options
npx aui-evals run suite.json \
  --llm-key=$OPENAI_KEY \
  --provider=openai \
  --parallel=10 \
  --loops=3 \
  --test="refund,cancel" \
  --tag="critical"

CLI Flags

| Flag | Description | |------|-------------| | --llm-key=<KEY> | API key for the judge/user-sim LLM (or set LLM_API_KEY env var) | | --provider=openai\|anthropic | LLM provider for judging (default: openai) | | --parallel=<N> | Number of tests to run in parallel (default: 1) | | --loops=<N> | Number of times to repeat each test (default: 1) | | --test=<name> | Filter tests by name (substring match, comma-separated) | | --tag=<tag> | Filter tests by tag (comma-separated) | | --debug | Include raw API responses in markdown output |

Test Suite Structure

A test suite is a JSON file with the following top-level fields:

| Field | Type | Required | Description | |-------|------|----------|-------------| | name | string | yes | Name of the evaluation suite | | agentId | string | yes | Agent ID to test | | apiKey | string | yes | Network API key for authenticating with the agent | | model | string | no | Model override for the agent under test | | judgeModel | string | no | Model for the judge LLM (defaults to suite runner's model) | | tests | Test[] | yes | Array of test cases (min 1) |

Test

Each test defines a simulated user scenario and the criteria to evaluate.

| Field | Type | Required | Description | |-------|------|----------|-------------| | name | string | yes | Unique test name | | userInfo | object | no | Free-form key-value pairs passed as agent_variables to the agent API (user profile, subscription, billing, etc.) | | guidelines | string | yes | Behavioral guidelines — how the simulated user should act (personality, goals, tone) | | systemPrompt | string | no | System prompt override for the agent | | turns | Turn[] | yes | Conversation turns to execute (min 1). Padded to 7 turns with "auto" if fewer are defined | | evaluations | Evaluation[] | yes | Criteria to evaluate after the conversation (min 1) | | tags | string[] | no | Tags for filtering/grouping test results |

Turn

Each turn represents one user message in the conversation.

| Field | Type | Required | Description | |-------|------|----------|-------------| | user | string | yes | The user message to send. Use "auto" to let the LLM generate a realistic reply based on userInfo, guidelines, and conversation history | | timeout | number | no | Timeout in seconds for the agent response (default: 120) | | expectation | string | no | Expected behavior hint — not sent to the agent, used by the LLM judge for context when evaluating |

Auto-generated turns

When user is "auto", the framework uses the judge LLM to generate a realistic user message based on the persona and conversation history. The simulated user will output ###STOP### when the conversation reaches a terminal state, ending the test early.

If fewer than 7 turns are defined, the remaining turns are automatically filled with "auto" to allow the conversation to play out naturally.

Evaluations

Evaluations are criteria applied after the conversation completes. There are two types:

1. LLM Judge Evaluation (default)

When no strict field is set, the criterion is evaluated by an LLM judge that reads the full conversation and determines pass/fail with reasoning.

{
  "criterion": "The agent maintained a friendly and professional tone throughout",
  "weight": 1,
  "category": "tone"
}

2. Strict Evaluation (programmatic)

When strict is set, the evaluation bypasses the LLM judge and is checked programmatically against the conversation trace data. This is deterministic and faster.

Common fields (all evaluation types)

| Field | Type | Required | Default | Description | |-------|------|----------|---------|-------------| | criterion | string | yes | — | Natural language description of what to evaluate | | weight | number | no | 1 | Importance weight. Higher = more impact on the overall score | | category | string | no | — | Category for grouping results (e.g. "accuracy", "tone", "safety", "compliance") | | aui_logic | string[] | no | — | AUI logic rule IDs that govern this evaluation. Used in the summary report's "By AUI Logic" table |

Strict Evaluation Types

`escalated`

Passes if CONVERSATION_FORWARDING was triggered (agent escalated to a human).

{
  "criterion": "Agent should escalate this case to a human",
  "strict": "escalated"
}

`not_escalated`

Passes if CONVERSATION_FORWARDING was NOT triggered.

{
  "criterion": "Agent should handle this without escalation",
  "strict": "not_escalated"
}

`tool_used`

Passes if the specified tool/workflow was executed during the conversation.

| Extra field | Required | Description | |-------------|----------|-------------| | toolName | yes | The workflow name to check (from executed_workflows) |

{
  "criterion": "Refund request workflow should be activated",
  "strict": "tool_used",
  "toolName": "REFUND_REQUEST"
}

`tool_not_used`

Passes if the specified tool/workflow was NOT executed. Optionally, can check that it was not used before a specific turn.

| Extra field | Required | Description | |-------------|----------|-------------| | toolName | yes | The workflow name to check | | beforeTurn | no | 1-based turn number. If the tool is used at this turn or later, it passes. Only fails if used before this turn |

{
  "criterion": "Should not offer a discount before collecting the reason",
  "strict": "tool_not_used",
  "toolName": "DISCOUNT_OFFER",
  "beforeTurn": 3
}

`param_equals`

Passes if an extracted parameter from trace_info.understanding.extracted_params matches the expected value.

| Extra field | Required | Description | |-------------|----------|-------------| | paramName | yes | Parameter name to check | | paramValue | yes | Expected value |

{
  "criterion": "Agent should classify the reason as too_expensive",
  "strict": "param_equals",
  "paramName": "refund-reason",
  "paramValue": "too_expensive"
}

`param_exists`

Passes if the parameter was extracted (with a non-empty, non-"undefined" value).

| Extra field | Required | Description | |-------------|----------|-------------| | paramName | yes | Parameter name to check |

{
  "criterion": "Agent should extract the reason parameter",
  "strict": "param_exists",
  "paramName": "refund-reason"
}

`param_not_exists`

Passes if the parameter was NOT extracted (missing, empty, or "undefined").

| Extra field | Required | Description | |-------------|----------|-------------| | paramName | yes | Parameter name to check |

{
  "criterion": "Agent should not extract a discount amount",
  "strict": "param_not_exists",
  "paramName": "discount-amount"
}

`escalated_due_to_error`

Passes if the agent returned an error response (trace_info.response.type === "error").

{
  "criterion": "Agent should return an error for this invalid request",
  "strict": "escalated_due_to_error"
}

`not_escalated_due_to_error`

Passes if the agent did NOT return an error response.

{
  "criterion": "Agent should handle this without errors",
  "strict": "not_escalated_due_to_error"
}

`rule_triggered`

Passes if a specific rule (identified by its code) was triggered in trace_info.decisions. Rules are accumulated across all turns.

The rule code comes from trace_info.decisions[].rule.code in the agent's raw response.

| Extra field | Required | Description | |-------------|----------|-------------| | ruleCode | yes | The rule code to check for |

{
  "criterion": "Eligible escalation rule should fire",
  "strict": "rule_triggered",
  "ruleCode": "low-tier-eligible-escalate"
}

`rule_not_triggered`

Passes if a specific rule was NOT triggered in any turn.

| Extra field | Required | Description | |-------------|----------|-------------| | ruleCode | yes | The rule code that should not appear |

{
  "criterion": "Premium retention rule should not fire for this user",
  "strict": "rule_not_triggered",
  "ruleCode": "premium-retention-offer"
}

`integration_called`

Passes if a call_integration decision with the matching integration.code and integration.status_code === 200 exists in trace_info.decisions across any turn.

| Extra field | Required | Description | |-------------|----------|-------------| | integrationCode | yes | The integration code to check for (from decisions[].integration.code) |

{
  "criterion": "Employee information integration should be called successfully",
  "strict": "integration_called",
  "integrationCode": "employee-information",
  "category": "accuracy"
}

Scoring

Each evaluation produces a pass/fail result
The overall test score is the weighted pass rate: sum(passed weights) / sum(all weights) * 100
Tests that end with an error response (escalated_due_to_error) are excluded from the aggregate suite score
The suite aggregate score is the average of all non-error test scores

How It Works

Create task — POST /tasks with a random user_id and agent_id returns a task_id
Run turns — For each turn, POST /message with task_id, text, and agent_variables (the full userInfo object)
- Fixed turns: sends the exact message defined in the test
- Auto turns: LLM generates a realistic user reply based on the user profile, guidelines, and conversation history
Escalation detection — If the agent response includes CONVERSATION_FORWARDING in executed_workflows, the conversation stops early
Error detection — If trace_info.response.type === "error", the conversation stops early
Judge — Each evaluation criterion is scored:
- Strict evaluations: checked programmatically against trace data (deterministic, no LLM call)
- LLM evaluations: judge LLM reads the full conversation and scores pass/fail with reasoning
Scoring — Weighted pass rate per test, aggregated across the suite

Output

Results are written to results/<suite-name>/<timestamp>/:

| File | Contents | |------|----------| | _results.json | Full raw results with conversations, evaluations, and trace data | | _summary.md | Aggregate scores, pass rates, category breakdown, AUI logic breakdown | | <test-name>.md | Individual test report with full conversation and evaluation details |

Project Structure

src/
  cli.ts            — CLI entry point, argument parsing, file I/O
  types.ts          — TypeScript types (TestSuite, UserInfo, AgentResponse, results)
  schema.ts         — JSON Schema (draft-07) for test suite validation
  validate.ts       — Suite validator (required fields, structure checks)
  runner.ts         — Core runner engine + AgentAdapter interface
  adapter.ts        — AUI API adapter (tasks/message endpoints) + OpenAI/Anthropic LLM calls
  claude-adapter.ts — Claude Code adapter for running evals inside Claude Code
  formatter.ts      — Markdown report generator (per-test + suite summary)
  index.ts          — Public exports for programmatic usage

Programmatic Usage

import { createAuiAdapter, runSuite, formatSuiteResult } from '@aui.io/evals';
import { readFileSync } from 'fs';

const adapter = createAuiAdapter(
  '<NETWORK_API_KEY>',
  '<LLM_API_KEY>',
  'openai'  // or 'anthropic'
);

const suite = JSON.parse(readFileSync('my-suite.json', 'utf-8'));
const result = await runSuite(adapter, suite);
console.log(formatSuiteResult(result));

Example Test Suites

This package includes real-world example test suites in two directories:

Demo Agents/ - Generic demo agent evaluations for various domains:
- Automotive (used car sales)
- Airbnb (customer support)
- Retail (e-commerce)
- IT Support
- HR Assistant
- Credit Card Dispute
Quack Agents/ - Production test suites for specific clients:
- Artlist (payment issues, refunds, subscriptions)
- Notable (account access, compass tests)
- Rentman (account management)
- Yotpo UGC (policy suites)

These examples demonstrate best practices for structuring test suites, writing evaluations, and using both LLM-based and strict (programmatic) evaluation criteria.

Accessing Bundled Test Suites Programmatically

The package exports helper functions to access the bundled test suites:

import { getDemoAgentsPath, getQuackAgentsPath, getPackageRoot } from '@aui.io/evals';
import { join } from 'path';
import { readFileSync } from 'fs';

// Get paths to bundled test suites
const demoPath = getDemoAgentsPath();
const quackPath = getQuackAgentsPath();

// Load a specific demo suite
const hrSuitePath = join(demoPath, 'demo-agents', 'hr.json');
const hrSuite = JSON.parse(readFileSync(hrSuitePath, 'utf-8'));

// Load a Quack Agents suite
const artlistPath = join(quackPath, 'artlist', 'payment_issues_tda.json');
const artlistSuite = JSON.parse(readFileSync(artlistPath, 'utf-8'));

// Or get the package root and navigate from there
const packageRoot = getPackageRoot();
const customPath = join(packageRoot, 'Demo Agents', 'eval_retail-agent_140426.json');

Via CLI:

# Reference bundled suites directly
npx aui-evals run "node_modules/@aui.io/evals/Demo Agents/demo-agents/hr.json" \
  --llm-key=$OPENAI_KEY

# Or copy them to your project first
cp -r node_modules/@aui.io/evals/Demo\ Agents ./test-suites
npx aui-evals run "./test-suites/demo-agents/hr.json" --llm-key=$OPENAI_KEY

Custom Adapter

Implement AgentAdapter to use a different agent backend:

interface AgentAdapter {
  spawnSession(agentId: string, apiKey: string, label: string, model?: string): Promise<string>;
  sendMessage(sessionKey: string, message: string, userInfo?: UserInfo, timeout?: number): Promise<AgentResponse>;
  callJudge(prompt: string, model?: string): Promise<string>;
  generateUserMessage(prompt: string, model?: string): Promise<string>;
}

Example Test Suite

{
  "name": "Customer Support Evaluations",
  "agentId": "your-agent-id",
  "apiKey": "your-api-key",
  "tests": [
    {
      "name": "Basic support request — should resolve without escalation",
      "userInfo": {
        "name": "Jane Doe",
        "email": "[email protected]",
        "subscription_status": "active",
        "plan_name": "Pro Monthly",
        "last_amount_charged": "29.99"
      },
      "guidelines": "You are a user with a billing question. Be polite and cooperative.",
      "turns": [
        { "user": "Hi, I have a question about my last charge." },
        { "user": "auto" }
      ],
      "evaluations": [
        {
          "criterion": "Agent should not escalate a simple billing question",
          "strict": "not_escalated",
          "category": "accuracy",
          "weight": 2
        },
        {
          "criterion": "Billing inquiry workflow should activate",
          "strict": "tool_used",
          "toolName": "BILLING_INQUIRY",
          "category": "accuracy"
        },
        {
          "criterion": "Eligibility check rule should fire",
          "strict": "rule_triggered",
          "ruleCode": "check-billing-eligibility",
          "category": "compliance",
          "aui_logic": ["R1_billing_check"]
        },
        {
          "criterion": "Agent should extract the inquiry type",
          "strict": "param_exists",
          "paramName": "inquiry-type",
          "category": "accuracy"
        },
        {
          "criterion": "Agent maintained a friendly and professional tone",
          "category": "tone",
          "weight": 1
        },
        {
          "criterion": "Agent should not produce an error response",
          "strict": "not_escalated_due_to_error",
          "category": "safety"
        }
      ],
      "tags": ["billing", "no-escalation"]
    }
  ]
}

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme