@wix/eval-assertions
v0.15.0
Published
Assertion framework for AI agent evaluations - supports skill invocation checks, build validation, and LLM-based judging
Readme
@wix/eval-assertions
Assertion framework for evaluating AI agent outputs. Supports skill invocation checks, build validation, and LLM-based judging. Used by the evaluator to validate scenario results.
Installation
npm install @wix/eval-assertions
# or
yarn add @wix/eval-assertionsFeatures
- Skill Was Called: Verify that specific skills were invoked during agent execution
- Build Passed: Run build commands and verify exit codes
- LLM Judge: Use an LLM to evaluate agent outputs with customizable prompts and scoring
Quick Start
import {
evaluateAssertions,
AssertionResultStatus,
type Assertion,
type AssertionContext,
type EvaluationInput
} from '@wix/eval-assertions';
// Define your assertions
const assertions: Assertion[] = [
{
type: 'skill_was_called',
skillName: 'my-skill'
},
{
type: 'build_passed',
command: 'npm test',
expectedExitCode: 0
},
{
type: 'llm_judge',
prompt: 'Evaluate if the output correctly implements the requested feature:\n\n{{output}}',
minScore: 70
}
];
// Prepare your evaluation input
const input: EvaluationInput = {
outputText: 'Agent output here...',
llmTrace: {
id: 'trace-1',
steps: [...],
summary: {...}
},
fileDiffs: [...]
};
// Set up context for assertions that need it
const context: AssertionContext = {
workDir: '/path/to/working/directory',
llmConfig: {
baseUrl: 'https://api.anthropic.com',
headers: { 'x-api-key': 'your-key' }
}
};
// Run assertions
const results = await evaluateAssertions(input, assertions, context);
// Check results
for (const result of results) {
console.log(`${result.assertionName}: ${result.status}`);
if (result.status === AssertionResultStatus.FAILED) {
console.log(` Message: ${result.message}`);
}
}Assertion Types
skill_was_called
Checks if a specific skill was invoked by examining the LLM trace.
{
type: 'skill_was_called',
skillName: 'commit' // Name of the skill that must have been called
}build_passed
Runs a command in the working directory and checks the exit code. When the command fails, the result details includes stdout and stderr so you can see why the build failed.
{
type: 'build_passed',
command: 'yarn build', // Command to run (default: 'yarn build')
expectedExitCode: 0 // Expected exit code (default: 0)
}llm_judge
Uses an LLM to evaluate the output with a customizable prompt. The default system prompt instructs the judge to be strict on factual verification: when you ask to verify a specific fact, the judge must compare against the actual data and give 0 or near 0 if there is a mismatch. When the judge returns invalid JSON, the evaluator retries up to 3 times before failing.
{
type: 'llm_judge',
prompt: 'Evaluate the quality of this code:\n\n{{output}}',
systemPrompt: 'You are a code reviewer...', // Optional custom system prompt
minScore: 70, // Minimum passing score (0-100, default: 70)
model: 'claude-3-5-haiku-20241022', // Model to use
maxTokens: 1024, // Max output tokens
temperature: 0 // Temperature (0-1)
}Tip: When verifying file-related outcomes, include {{changedFiles}} in your prompt so the judge sees the actual files. Without it, the judge still receives this data in the system context, but making it explicit improves accuracy.
Available placeholders in prompts:
{{output}}- The agent's final output text{{cwd}}- Working directory path{{changedFiles}}- List of files that were modified{{trace}}- Formatted LLM trace showing tool calls and completions
Types
EvaluationInput
The input data for assertion evaluation:
interface EvaluationInput {
outputText?: string;
llmTrace?: LLMTrace;
fileDiffs?: Array<{ path: string; content?: string; status?: 'new' | 'modified' }>;
}When fileDiffs items include status, the {{modifiedFiles}} and {{newFiles}} placeholders are populated for the LLM judge.
AssertionContext
Optional context for assertions:
interface AssertionContext {
workDir?: string; // For build_passed
llmConfig?: { // For llm_judge
baseUrl: string;
headers: Record<string, string>;
};
generateTextForLlmJudge?: (options) => Promise<{ text: string }>; // For testing
}AssertionResult
The result of evaluating an assertion:
interface AssertionResult {
id: string;
assertionId: string;
assertionType: string;
assertionName: string;
status: AssertionResultStatus; // 'passed' | 'failed' | 'skipped' | 'error'
message?: string;
expected?: string;
actual?: string;
duration?: number;
details?: Record<string, unknown>;
}Creating Custom Evaluators
You can extend the framework with custom assertion types:
import { AssertionEvaluator, type AssertionResult, AssertionResultStatus } from '@wix/eval-assertions';
import { z } from 'zod';
// Define your assertion schema
export const MyAssertionSchema = z.object({
type: z.literal('my_assertion'),
customField: z.string()
});
export type MyAssertion = z.infer<typeof MyAssertionSchema>;
// Implement the evaluator
export class MyAssertionEvaluator extends AssertionEvaluator<MyAssertion> {
readonly type = 'my_assertion' as const;
evaluate(assertion, input, context): AssertionResult {
// Your evaluation logic here
return {
id: crypto.randomUUID(),
assertionId: crypto.randomUUID(),
assertionType: 'my_assertion',
assertionName: 'My Custom Assertion',
status: AssertionResultStatus.PASSED,
message: 'Assertion passed!'
};
}
}
// Register your evaluator
import { registerEvaluator } from '@wix/eval-assertions';
registerEvaluator('my_assertion', new MyAssertionEvaluator());License
MIT
