npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@wix/eval-assertions

v0.15.0

Published

Assertion framework for AI agent evaluations - supports skill invocation checks, build validation, and LLM-based judging

Readme

@wix/eval-assertions

Assertion framework for evaluating AI agent outputs. Supports skill invocation checks, build validation, and LLM-based judging. Used by the evaluator to validate scenario results.

Installation

npm install @wix/eval-assertions
# or
yarn add @wix/eval-assertions

Features

  • Skill Was Called: Verify that specific skills were invoked during agent execution
  • Build Passed: Run build commands and verify exit codes
  • LLM Judge: Use an LLM to evaluate agent outputs with customizable prompts and scoring

Quick Start

import {
  evaluateAssertions,
  AssertionResultStatus,
  type Assertion,
  type AssertionContext,
  type EvaluationInput
} from '@wix/eval-assertions';

// Define your assertions
const assertions: Assertion[] = [
  {
    type: 'skill_was_called',
    skillName: 'my-skill'
  },
  {
    type: 'build_passed',
    command: 'npm test',
    expectedExitCode: 0
  },
  {
    type: 'llm_judge',
    prompt: 'Evaluate if the output correctly implements the requested feature:\n\n{{output}}',
    minScore: 70
  }
];

// Prepare your evaluation input
const input: EvaluationInput = {
  outputText: 'Agent output here...',
  llmTrace: {
    id: 'trace-1',
    steps: [...],
    summary: {...}
  },
  fileDiffs: [...]
};

// Set up context for assertions that need it
const context: AssertionContext = {
  workDir: '/path/to/working/directory',
  llmConfig: {
    baseUrl: 'https://api.anthropic.com',
    headers: { 'x-api-key': 'your-key' }
  }
};

// Run assertions
const results = await evaluateAssertions(input, assertions, context);

// Check results
for (const result of results) {
  console.log(`${result.assertionName}: ${result.status}`);
  if (result.status === AssertionResultStatus.FAILED) {
    console.log(`  Message: ${result.message}`);
  }
}

Assertion Types

skill_was_called

Checks if a specific skill was invoked by examining the LLM trace.

{
  type: 'skill_was_called',
  skillName: 'commit'  // Name of the skill that must have been called
}

build_passed

Runs a command in the working directory and checks the exit code. When the command fails, the result details includes stdout and stderr so you can see why the build failed.

{
  type: 'build_passed',
  command: 'yarn build',     // Command to run (default: 'yarn build')
  expectedExitCode: 0        // Expected exit code (default: 0)
}

llm_judge

Uses an LLM to evaluate the output with a customizable prompt. The default system prompt instructs the judge to be strict on factual verification: when you ask to verify a specific fact, the judge must compare against the actual data and give 0 or near 0 if there is a mismatch. When the judge returns invalid JSON, the evaluator retries up to 3 times before failing.

{
  type: 'llm_judge',
  prompt: 'Evaluate the quality of this code:\n\n{{output}}',
  systemPrompt: 'You are a code reviewer...',  // Optional custom system prompt
  minScore: 70,                                 // Minimum passing score (0-100, default: 70)
  model: 'claude-3-5-haiku-20241022',          // Model to use
  maxTokens: 1024,                             // Max output tokens
  temperature: 0                               // Temperature (0-1)
}

Tip: When verifying file-related outcomes, include {{changedFiles}} in your prompt so the judge sees the actual files. Without it, the judge still receives this data in the system context, but making it explicit improves accuracy.

Available placeholders in prompts:

  • {{output}} - The agent's final output text
  • {{cwd}} - Working directory path
  • {{changedFiles}} - List of files that were modified
  • {{trace}} - Formatted LLM trace showing tool calls and completions

Types

EvaluationInput

The input data for assertion evaluation:

interface EvaluationInput {
  outputText?: string;
  llmTrace?: LLMTrace;
  fileDiffs?: Array<{ path: string; content?: string; status?: 'new' | 'modified' }>;
}

When fileDiffs items include status, the {{modifiedFiles}} and {{newFiles}} placeholders are populated for the LLM judge.

AssertionContext

Optional context for assertions:

interface AssertionContext {
  workDir?: string;                           // For build_passed
  llmConfig?: {                               // For llm_judge
    baseUrl: string;
    headers: Record<string, string>;
  };
  generateTextForLlmJudge?: (options) => Promise<{ text: string }>;  // For testing
}

AssertionResult

The result of evaluating an assertion:

interface AssertionResult {
  id: string;
  assertionId: string;
  assertionType: string;
  assertionName: string;
  status: AssertionResultStatus;  // 'passed' | 'failed' | 'skipped' | 'error'
  message?: string;
  expected?: string;
  actual?: string;
  duration?: number;
  details?: Record<string, unknown>;
}

Creating Custom Evaluators

You can extend the framework with custom assertion types:

import { AssertionEvaluator, type AssertionResult, AssertionResultStatus } from '@wix/eval-assertions';
import { z } from 'zod';

// Define your assertion schema
export const MyAssertionSchema = z.object({
  type: z.literal('my_assertion'),
  customField: z.string()
});

export type MyAssertion = z.infer<typeof MyAssertionSchema>;

// Implement the evaluator
export class MyAssertionEvaluator extends AssertionEvaluator<MyAssertion> {
  readonly type = 'my_assertion' as const;

  evaluate(assertion, input, context): AssertionResult {
    // Your evaluation logic here
    return {
      id: crypto.randomUUID(),
      assertionId: crypto.randomUUID(),
      assertionType: 'my_assertion',
      assertionName: 'My Custom Assertion',
      status: AssertionResultStatus.PASSED,
      message: 'Assertion passed!'
    };
  }
}

// Register your evaluator
import { registerEvaluator } from '@wix/eval-assertions';
registerEvaluator('my_assertion', new MyAssertionEvaluator());

License

MIT