@wix/eval-assertions

v0.15.0

Published

3 days ago

Assertion framework for AI agent evaluations - supports skill invocation checks, build validation, and LLM-based judging

@wix/eval-assertions

Assertion framework for evaluating AI agent outputs. Supports skill invocation checks, build validation, and LLM-based judging. Used by the evaluator to validate scenario results.

Installation

npm install @wix/eval-assertions
# or
yarn add @wix/eval-assertions

Features

Skill Was Called: Verify that specific skills were invoked during agent execution
Build Passed: Run build commands and verify exit codes
LLM Judge: Use an LLM to evaluate agent outputs with customizable prompts and scoring

Quick Start

import {
  evaluateAssertions,
  AssertionResultStatus,
  type Assertion,
  type AssertionContext,
  type EvaluationInput
} from '@wix/eval-assertions';

// Define your assertions
const assertions: Assertion[] = [
  {
    type: 'skill_was_called',
    skillName: 'my-skill'
  },
  {
    type: 'build_passed',
    command: 'npm test',
    expectedExitCode: 0
  },
  {
    type: 'llm_judge',
    prompt: 'Evaluate if the output correctly implements the requested feature:\n\n{{output}}',
    minScore: 70
  }
];

// Prepare your evaluation input
const input: EvaluationInput = {
  outputText: 'Agent output here...',
  llmTrace: {
    id: 'trace-1',
    steps: [...],
    summary: {...}
  },
  fileDiffs: [...]
};

// Set up context for assertions that need it
const context: AssertionContext = {
  workDir: '/path/to/working/directory',
  llmConfig: {
    baseUrl: 'https://api.anthropic.com',
    headers: { 'x-api-key': 'your-key' }
  }
};

// Run assertions
const results = await evaluateAssertions(input, assertions, context);

// Check results
for (const result of results) {
  console.log(`${result.assertionName}: ${result.status}`);
  if (result.status === AssertionResultStatus.FAILED) {
    console.log(`  Message: ${result.message}`);
  }
}

Assertion Types

skill_was_called

Checks if a specific skill was invoked by examining the LLM trace.

{
  type: 'skill_was_called',
  skillName: 'commit'  // Name of the skill that must have been called
}

build_passed

Runs a command in the working directory and checks the exit code. When the command fails, the result details includes stdout and stderr so you can see why the build failed.

{
  type: 'build_passed',
  command: 'yarn build',     // Command to run (default: 'yarn build')
  expectedExitCode: 0        // Expected exit code (default: 0)
}

llm_judge

Uses an LLM to evaluate the output with a customizable prompt. The default system prompt instructs the judge to be strict on factual verification: when you ask to verify a specific fact, the judge must compare against the actual data and give 0 or near 0 if there is a mismatch. When the judge returns invalid JSON, the evaluator retries up to 3 times before failing.

{
  type: 'llm_judge',
  prompt: 'Evaluate the quality of this code:\n\n{{output}}',
  systemPrompt: 'You are a code reviewer...',  // Optional custom system prompt
  minScore: 70,                                 // Minimum passing score (0-100, default: 70)
  model: 'claude-3-5-haiku-20241022',          // Model to use
  maxTokens: 1024,                             // Max output tokens
  temperature: 0                               // Temperature (0-1)
}

Tip: When verifying file-related outcomes, include {{changedFiles}} in your prompt so the judge sees the actual files. Without it, the judge still receives this data in the system context, but making it explicit improves accuracy.

Available placeholders in prompts:

{{output}} - The agent's final output text
{{cwd}} - Working directory path
{{changedFiles}} - List of files that were modified
{{trace}} - Formatted LLM trace showing tool calls and completions

Types

EvaluationInput

The input data for assertion evaluation:

interface EvaluationInput {
  outputText?: string;
  llmTrace?: LLMTrace;
  fileDiffs?: Array<{ path: string; content?: string; status?: 'new' | 'modified' }>;
}

When fileDiffs items include status, the {{modifiedFiles}} and {{newFiles}} placeholders are populated for the LLM judge.

AssertionContext

Optional context for assertions:

interface AssertionContext {
  workDir?: string;                           // For build_passed
  llmConfig?: {                               // For llm_judge
    baseUrl: string;
    headers: Record<string, string>;
  };
  generateTextForLlmJudge?: (options) => Promise<{ text: string }>;  // For testing
}

AssertionResult

The result of evaluating an assertion:

interface AssertionResult {
  id: string;
  assertionId: string;
  assertionType: string;
  assertionName: string;
  status: AssertionResultStatus;  // 'passed' | 'failed' | 'skipped' | 'error'
  message?: string;
  expected?: string;
  actual?: string;
  duration?: number;
  details?: Record<string, unknown>;
}

Creating Custom Evaluators

You can extend the framework with custom assertion types:

import { AssertionEvaluator, type AssertionResult, AssertionResultStatus } from '@wix/eval-assertions';
import { z } from 'zod';

// Define your assertion schema
export const MyAssertionSchema = z.object({
  type: z.literal('my_assertion'),
  customField: z.string()
});

export type MyAssertion = z.infer<typeof MyAssertionSchema>;

// Implement the evaluator
export class MyAssertionEvaluator extends AssertionEvaluator<MyAssertion> {
  readonly type = 'my_assertion' as const;

  evaluate(assertion, input, context): AssertionResult {
    // Your evaluation logic here
    return {
      id: crypto.randomUUID(),
      assertionId: crypto.randomUUID(),
      assertionType: 'my_assertion',
      assertionName: 'My Custom Assertion',
      status: AssertionResultStatus.PASSED,
      message: 'Assertion passed!'
    };
  }
}

// Register your evaluator
import { registerEvaluator } from '@wix/eval-assertions';
registerEvaluator('my_assertion', new MyAssertionEvaluator());

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@wix/eval-assertions

Installation

Features

Quick Start

Assertion Types

skill_was_called

build_passed

llm_judge

Types

EvaluationInput

AssertionContext

AssertionResult

Creating Custom Evaluators

License