@aui.io/evals
v0.1.1
Published
Multi-turn conversation agent evaluator with LLM-as-judge scoring.
Readme
AUI Agent Evaluator
Multi-turn conversation evaluation framework for AUI agents. Runs simulated conversations against any AUI agent, then scores them with both LLM-based and programmatic (strict) evaluations.
Install
npm install @aui.io/evalsQuick Start
# Validate a test suite
npx aui-evals validate my-suite.json
# Preview what will run
npx aui-evals plan my-suite.json
# Run with OpenAI judge
npx aui-evals run my-suite.json --llm-key=<YOUR_OPENAI_KEY>
# Run with Anthropic judge
npx aui-evals run my-suite.json --llm-key=<YOUR_ANTHROPIC_KEY> --provider=anthropicYou can also set LLM_API_KEY as an env var instead of --llm-key.
CLI Commands
List Bundled Test Suites
# See all available bundled test suites
npx aui-evals list-suitesThis shows all included Demo Agents and Quack Agents test suites with their names.
Run Bundled Test Suites
You can run bundled suites by name without specifying full paths:
# Run Demo Agents suites
npx aui-evals run demo:hr --llm-key=$OPENAI_KEY
npx aui-evals run demo:retail --llm-key=$OPENAI_KEY
npx aui-evals run demo:automotive-agent-140426 --llm-key=$KEY
# Run Quack Agents suites
npx aui-evals run quack:artlist/payment_issues_tda --llm-key=$KEY
npx aui-evals run quack:notable/compass_tests --llm-key=$KEYRun Local Test Suites
# Validate a test suite
npx aui-evals validate suite.json
# Show execution plan
npx aui-evals plan suite.json
# Run evaluations
npx aui-evals run suite.json --llm-key=$OPENAI_KEY
# Run with specific options
npx aui-evals run suite.json \
--llm-key=$OPENAI_KEY \
--provider=openai \
--parallel=10 \
--loops=3 \
--test="refund,cancel" \
--tag="critical"CLI Flags
| Flag | Description |
|------|-------------|
| --llm-key=<KEY> | API key for the judge/user-sim LLM (or set LLM_API_KEY env var) |
| --provider=openai\|anthropic | LLM provider for judging (default: openai) |
| --parallel=<N> | Number of tests to run in parallel (default: 1) |
| --loops=<N> | Number of times to repeat each test (default: 1) |
| --test=<name> | Filter tests by name (substring match, comma-separated) |
| --tag=<tag> | Filter tests by tag (comma-separated) |
| --debug | Include raw API responses in markdown output |
Test Suite Structure
A test suite is a JSON file with the following top-level fields:
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| name | string | yes | Name of the evaluation suite |
| agentId | string | yes | Agent ID to test |
| apiKey | string | yes | Network API key for authenticating with the agent |
| model | string | no | Model override for the agent under test |
| judgeModel | string | no | Model for the judge LLM (defaults to suite runner's model) |
| tests | Test[] | yes | Array of test cases (min 1) |
Test
Each test defines a simulated user scenario and the criteria to evaluate.
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| name | string | yes | Unique test name |
| userInfo | object | no | Free-form key-value pairs passed as agent_variables to the agent API (user profile, subscription, billing, etc.) |
| guidelines | string | yes | Behavioral guidelines — how the simulated user should act (personality, goals, tone) |
| systemPrompt | string | no | System prompt override for the agent |
| turns | Turn[] | yes | Conversation turns to execute (min 1). Padded to 7 turns with "auto" if fewer are defined |
| evaluations | Evaluation[] | yes | Criteria to evaluate after the conversation (min 1) |
| tags | string[] | no | Tags for filtering/grouping test results |
Turn
Each turn represents one user message in the conversation.
| Field | Type | Required | Description |
|-------|------|----------|-------------|
| user | string | yes | The user message to send. Use "auto" to let the LLM generate a realistic reply based on userInfo, guidelines, and conversation history |
| timeout | number | no | Timeout in seconds for the agent response (default: 120) |
| expectation | string | no | Expected behavior hint — not sent to the agent, used by the LLM judge for context when evaluating |
Auto-generated turns
When user is "auto", the framework uses the judge LLM to generate a realistic user message based on the persona and conversation history. The simulated user will output ###STOP### when the conversation reaches a terminal state, ending the test early.
If fewer than 7 turns are defined, the remaining turns are automatically filled with "auto" to allow the conversation to play out naturally.
Evaluations
Evaluations are criteria applied after the conversation completes. There are two types:
1. LLM Judge Evaluation (default)
When no strict field is set, the criterion is evaluated by an LLM judge that reads the full conversation and determines pass/fail with reasoning.
{
"criterion": "The agent maintained a friendly and professional tone throughout",
"weight": 1,
"category": "tone"
}2. Strict Evaluation (programmatic)
When strict is set, the evaluation bypasses the LLM judge and is checked programmatically against the conversation trace data. This is deterministic and faster.
Common fields (all evaluation types)
| Field | Type | Required | Default | Description |
|-------|------|----------|---------|-------------|
| criterion | string | yes | — | Natural language description of what to evaluate |
| weight | number | no | 1 | Importance weight. Higher = more impact on the overall score |
| category | string | no | — | Category for grouping results (e.g. "accuracy", "tone", "safety", "compliance") |
| aui_logic | string[] | no | — | AUI logic rule IDs that govern this evaluation. Used in the summary report's "By AUI Logic" table |
Strict Evaluation Types
escalated
Passes if CONVERSATION_FORWARDING was triggered (agent escalated to a human).
{
"criterion": "Agent should escalate this case to a human",
"strict": "escalated"
}not_escalated
Passes if CONVERSATION_FORWARDING was NOT triggered.
{
"criterion": "Agent should handle this without escalation",
"strict": "not_escalated"
}tool_used
Passes if the specified tool/workflow was executed during the conversation.
| Extra field | Required | Description |
|-------------|----------|-------------|
| toolName | yes | The workflow name to check (from executed_workflows) |
{
"criterion": "Refund request workflow should be activated",
"strict": "tool_used",
"toolName": "REFUND_REQUEST"
}tool_not_used
Passes if the specified tool/workflow was NOT executed. Optionally, can check that it was not used before a specific turn.
| Extra field | Required | Description |
|-------------|----------|-------------|
| toolName | yes | The workflow name to check |
| beforeTurn | no | 1-based turn number. If the tool is used at this turn or later, it passes. Only fails if used before this turn |
{
"criterion": "Should not offer a discount before collecting the reason",
"strict": "tool_not_used",
"toolName": "DISCOUNT_OFFER",
"beforeTurn": 3
}param_equals
Passes if an extracted parameter from trace_info.understanding.extracted_params matches the expected value.
| Extra field | Required | Description |
|-------------|----------|-------------|
| paramName | yes | Parameter name to check |
| paramValue | yes | Expected value |
{
"criterion": "Agent should classify the reason as too_expensive",
"strict": "param_equals",
"paramName": "refund-reason",
"paramValue": "too_expensive"
}param_exists
Passes if the parameter was extracted (with a non-empty, non-"undefined" value).
| Extra field | Required | Description |
|-------------|----------|-------------|
| paramName | yes | Parameter name to check |
{
"criterion": "Agent should extract the reason parameter",
"strict": "param_exists",
"paramName": "refund-reason"
}param_not_exists
Passes if the parameter was NOT extracted (missing, empty, or "undefined").
| Extra field | Required | Description |
|-------------|----------|-------------|
| paramName | yes | Parameter name to check |
{
"criterion": "Agent should not extract a discount amount",
"strict": "param_not_exists",
"paramName": "discount-amount"
}escalated_due_to_error
Passes if the agent returned an error response (trace_info.response.type === "error").
{
"criterion": "Agent should return an error for this invalid request",
"strict": "escalated_due_to_error"
}not_escalated_due_to_error
Passes if the agent did NOT return an error response.
{
"criterion": "Agent should handle this without errors",
"strict": "not_escalated_due_to_error"
}rule_triggered
Passes if a specific rule (identified by its code) was triggered in trace_info.decisions. Rules are accumulated across all turns.
The rule code comes from trace_info.decisions[].rule.code in the agent's raw response.
| Extra field | Required | Description |
|-------------|----------|-------------|
| ruleCode | yes | The rule code to check for |
{
"criterion": "Eligible escalation rule should fire",
"strict": "rule_triggered",
"ruleCode": "low-tier-eligible-escalate"
}rule_not_triggered
Passes if a specific rule was NOT triggered in any turn.
| Extra field | Required | Description |
|-------------|----------|-------------|
| ruleCode | yes | The rule code that should not appear |
{
"criterion": "Premium retention rule should not fire for this user",
"strict": "rule_not_triggered",
"ruleCode": "premium-retention-offer"
}integration_called
Passes if a call_integration decision with the matching integration.code and integration.status_code === 200 exists in trace_info.decisions across any turn.
| Extra field | Required | Description |
|-------------|----------|-------------|
| integrationCode | yes | The integration code to check for (from decisions[].integration.code) |
{
"criterion": "Employee information integration should be called successfully",
"strict": "integration_called",
"integrationCode": "employee-information",
"category": "accuracy"
}Scoring
- Each evaluation produces a pass/fail result
- The overall test score is the weighted pass rate:
sum(passed weights) / sum(all weights) * 100 - Tests that end with an error response (
escalated_due_to_error) are excluded from the aggregate suite score - The suite aggregate score is the average of all non-error test scores
How It Works
- Create task —
POST /taskswith a randomuser_idandagent_idreturns atask_id - Run turns — For each turn,
POST /messagewithtask_id,text, andagent_variables(the fulluserInfoobject)- Fixed turns: sends the exact message defined in the test
- Auto turns: LLM generates a realistic user reply based on the user profile, guidelines, and conversation history
- Escalation detection — If the agent response includes
CONVERSATION_FORWARDINGinexecuted_workflows, the conversation stops early - Error detection — If
trace_info.response.type === "error", the conversation stops early - Judge — Each evaluation criterion is scored:
- Strict evaluations: checked programmatically against trace data (deterministic, no LLM call)
- LLM evaluations: judge LLM reads the full conversation and scores pass/fail with reasoning
- Scoring — Weighted pass rate per test, aggregated across the suite
Output
Results are written to results/<suite-name>/<timestamp>/:
| File | Contents |
|------|----------|
| _results.json | Full raw results with conversations, evaluations, and trace data |
| _summary.md | Aggregate scores, pass rates, category breakdown, AUI logic breakdown |
| <test-name>.md | Individual test report with full conversation and evaluation details |
Project Structure
src/
cli.ts — CLI entry point, argument parsing, file I/O
types.ts — TypeScript types (TestSuite, UserInfo, AgentResponse, results)
schema.ts — JSON Schema (draft-07) for test suite validation
validate.ts — Suite validator (required fields, structure checks)
runner.ts — Core runner engine + AgentAdapter interface
adapter.ts — AUI API adapter (tasks/message endpoints) + OpenAI/Anthropic LLM calls
claude-adapter.ts — Claude Code adapter for running evals inside Claude Code
formatter.ts — Markdown report generator (per-test + suite summary)
index.ts — Public exports for programmatic usageProgrammatic Usage
import { createAuiAdapter, runSuite, formatSuiteResult } from '@aui.io/evals';
import { readFileSync } from 'fs';
const adapter = createAuiAdapter(
'<NETWORK_API_KEY>',
'<LLM_API_KEY>',
'openai' // or 'anthropic'
);
const suite = JSON.parse(readFileSync('my-suite.json', 'utf-8'));
const result = await runSuite(adapter, suite);
console.log(formatSuiteResult(result));Example Test Suites
This package includes real-world example test suites in two directories:
Demo Agents/- Generic demo agent evaluations for various domains:- Automotive (used car sales)
- Airbnb (customer support)
- Retail (e-commerce)
- IT Support
- HR Assistant
- Credit Card Dispute
Quack Agents/- Production test suites for specific clients:- Artlist (payment issues, refunds, subscriptions)
- Notable (account access, compass tests)
- Rentman (account management)
- Yotpo UGC (policy suites)
These examples demonstrate best practices for structuring test suites, writing evaluations, and using both LLM-based and strict (programmatic) evaluation criteria.
Accessing Bundled Test Suites Programmatically
The package exports helper functions to access the bundled test suites:
import { getDemoAgentsPath, getQuackAgentsPath, getPackageRoot } from '@aui.io/evals';
import { join } from 'path';
import { readFileSync } from 'fs';
// Get paths to bundled test suites
const demoPath = getDemoAgentsPath();
const quackPath = getQuackAgentsPath();
// Load a specific demo suite
const hrSuitePath = join(demoPath, 'demo-agents', 'hr.json');
const hrSuite = JSON.parse(readFileSync(hrSuitePath, 'utf-8'));
// Load a Quack Agents suite
const artlistPath = join(quackPath, 'artlist', 'payment_issues_tda.json');
const artlistSuite = JSON.parse(readFileSync(artlistPath, 'utf-8'));
// Or get the package root and navigate from there
const packageRoot = getPackageRoot();
const customPath = join(packageRoot, 'Demo Agents', 'eval_retail-agent_140426.json');Via CLI:
# Reference bundled suites directly
npx aui-evals run "node_modules/@aui.io/evals/Demo Agents/demo-agents/hr.json" \
--llm-key=$OPENAI_KEY
# Or copy them to your project first
cp -r node_modules/@aui.io/evals/Demo\ Agents ./test-suites
npx aui-evals run "./test-suites/demo-agents/hr.json" --llm-key=$OPENAI_KEYCustom Adapter
Implement AgentAdapter to use a different agent backend:
interface AgentAdapter {
spawnSession(agentId: string, apiKey: string, label: string, model?: string): Promise<string>;
sendMessage(sessionKey: string, message: string, userInfo?: UserInfo, timeout?: number): Promise<AgentResponse>;
callJudge(prompt: string, model?: string): Promise<string>;
generateUserMessage(prompt: string, model?: string): Promise<string>;
}Example Test Suite
{
"name": "Customer Support Evaluations",
"agentId": "your-agent-id",
"apiKey": "your-api-key",
"tests": [
{
"name": "Basic support request — should resolve without escalation",
"userInfo": {
"name": "Jane Doe",
"email": "[email protected]",
"subscription_status": "active",
"plan_name": "Pro Monthly",
"last_amount_charged": "29.99"
},
"guidelines": "You are a user with a billing question. Be polite and cooperative.",
"turns": [
{ "user": "Hi, I have a question about my last charge." },
{ "user": "auto" }
],
"evaluations": [
{
"criterion": "Agent should not escalate a simple billing question",
"strict": "not_escalated",
"category": "accuracy",
"weight": 2
},
{
"criterion": "Billing inquiry workflow should activate",
"strict": "tool_used",
"toolName": "BILLING_INQUIRY",
"category": "accuracy"
},
{
"criterion": "Eligibility check rule should fire",
"strict": "rule_triggered",
"ruleCode": "check-billing-eligibility",
"category": "compliance",
"aui_logic": ["R1_billing_check"]
},
{
"criterion": "Agent should extract the inquiry type",
"strict": "param_exists",
"paramName": "inquiry-type",
"category": "accuracy"
},
{
"criterion": "Agent maintained a friendly and professional tone",
"category": "tone",
"weight": 1
},
{
"criterion": "Agent should not produce an error response",
"strict": "not_escalated_due_to_error",
"category": "safety"
}
],
"tags": ["billing", "no-escalation"]
}
]
}