@wix/evalforge-evaluator

v0.226.0

Published

2 days ago

EvalForge Evaluator

Downloads

8,453

0High
0Medium
0Low

@wix/evalforge-evaluator

CLI tool that executes AI agent evaluations. It fetches an eval run configuration from the backend, runs each scenario against a Claude Code agent, streams trace events, runs assertions, and reports results.

How It Works

evaluator <project-id> <eval-run-id>

Load configuration from environment variables (server URL, AI Gateway credentials, etc.)
Fetch evaluation data from the backend API — eval run, scenarios, agent config, skills, MCPs, sub-agents, rules, and templates
For each scenario:
- Prepare a working directory (download and extract template)
- Write skills to .claude/skills/<name>/SKILL.md
- Write MCPs to .mcp.json
- Write sub-agents to .claude/agents/<name>.md
- Write rules to CLAUDE.md, AGENTS.md, or .cursor/rules/<name>.md based on rule type
- Launch the Claude Code agent with the scenario's trigger prompt via @anthropic-ai/claude-agent-sdk
- Stream trace events back to the backend
- Run assertions on the agent's output
- Report the scenario result
Finalize — set eval run status to COMPLETED or FAILED

Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | EVAL_SERVER_URL | Yes | Backend server URL for fetching data and reporting results | | AI_GATEWAY_URL | Yes | AI Gateway base URL for LLM calls | | AI_GATEWAY_HEADERS | No | Custom headers for AI Gateway (newline-separated key:value pairs) | | EVAL_API_PREFIX | No | API path prefix (e.g., /api/v1) | | EVALUATIONS_DIR | No | Directory for evaluation working directories | | TRACE_PUSH_URL | No | Enables remote trace push when set (remote job execution). Events are pushed via the gRPC PushTraceEvent RPC; the URL value itself is legacy | | EVAL_ROUTE_HEADER | No | x-wix-route header for deploy preview routing | | EVAL_AUTH_TOKEN | No | Bearer token for the remaining legacy REST public endpoints | | EVAL_GRPC_AUTH_TOKEN | No | S2S-signed token for ambassador/gRPC calls (absent in local dev — calls go out unauthenticated) |

For OpenCode runs, the evaluator sets lsp: false in OPENCODE_CONFIG_CONTENT and OPENCODE_DISABLE_LSP_DOWNLOAD / OPENCODE_DISABLE_FILETIME_CHECK in the process environment (same as ditto codegen) to avoid LSP hangs after edit tools and spurious "file modified since last read" failures in automated evals.

OpenCode cost comes from the gateway, not OpenCode. OpenCode prices the Wix AI Gateway as a free custom provider, so its self-reported step_finish.cost is ~$0. Instead, the evaluator runs a localhost pass-through (gateway-cost-interceptor.ts) between OpenCode and the gateway: it forwards each request untouched, streams the response straight back, and reads the real total_cost_usd the gateway already injects into every response. Those per-request costs map to OpenCode turns in the LLM trace; if a request's cost can't be read, that turn falls back to OpenCode's reported cost (logged). No pricing tables to maintain — the number is whatever the gateway billed.

The evaluator is typically launched by the backend (locally or on a remote Dev Machine) with these variables pre-configured.

Backend API access

Backend calls go through the evalforge ambassador packages (gRPC via @wix/http-client): all reads, plus addResult (AddEvalRunResult), clearResults (ClearEvalRunResults), and trace-event push (PushTraceEvent). The only call still on the legacy REST surface is updateEvalRun — the gRPC UpdateEvalRun handler only forwards user-editable fields, not the system state transitions (status/completedAt/jobError/jobStatus) the evaluator writes.

Live trace during environment setup

For templated runs, the evaluator emits PROGRESS trace events during environment setup — "Setting up environment", "Fetching template files", "Installing dependencies", "Environment ready" — via the shared emitTraceEvent helper. Because emitTraceEvent writes to stdout (captured by the backend for local runs) and also calls the pushEvent callback (used for remote jobs via tracePushUrl), these events appear in the live trace in both local and remote runs. Without them, the trace panel stays blank during the often multi-minute setup phase before the agent starts.

Scripts

yarn build       # Build CJS + ESM + type declarations
yarn test        # Run tests
yarn lint        # Run ESLint
yarn clean       # Remove build artifacts

Dependencies

@wix/evalforge-types — shared type definitions
@wix/eval-assertions — assertion evaluation framework
@wix/evalforge-github-client — GitHub API client for fetching skill files
@anthropic-ai/claude-agent-sdk — Claude Code agent SDK

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@wix/evalforge-evaluator

How It Works

Environment Variables

Backend API access

Live trace during environment setup

Scripts

Dependencies