@poofnew/vibe-check

v0.1.1

Published

4 months ago

AI agent evaluation framework for Claude and beyond

0High
0Medium
0Low

prpatel05

ai agent evaluation testing claude anthropic llm

Background

Building reliable AI agents is an incredibly challenging endeavor. Unlike traditional software where inputs and outputs are deterministic, AI agents operate in a complex, non-deterministic environment where the smallest changes can have unexpected and far-reaching consequences. A minor prompt modification, a slight adjustment to system instructions, or even a change in model parameters can ripple through the entire system, causing subtle failures that are difficult to detect and diagnose.

Testing AI agents presents unique challenges that traditional testing frameworks cannot adequately address. How do you validate that an agent correctly interprets user intent? How do you ensure tool invocations are appropriate and executed in the right sequence? How do you catch regressions when a prompt change breaks edge cases you didn't anticipate? These questions become exponentially more complex when dealing with multi-turn conversations, code generation, and complex routing decisions.

We built vibe-check internally to rigorously test and validate poof.new. As we iterated on prompts, refined agent behaviors, and added new capabilities, we needed a systematic way to ensure our changes didn't break existing functionality—and to catch issues before they reached production. Traditional testing approaches fell short, so we created a framework specifically designed for AI agent evaluation.

After using vibe-check extensively in our own development process, we're now open-sourcing it to help the broader AI agent development community. We believe that robust testing and evaluation frameworks are essential for building production-ready AI systems, and we hope vibe-check will help others navigate the complexities of agent development with more confidence.

Why vibe-check?

Building reliable AI agents is hard. Traditional testing approaches fall short when evaluating LLM behavior, tool usage, and multi-turn interactions. vibe-check provides a comprehensive framework specifically designed for AI agent evaluation:

Agent-Native Testing: Evaluate tool calls, code generation, routing decisions, and conversational flows
Learning from Failures: Built-in learning system analyzes failures and suggests prompt improvements
Production-Ready: Parallel execution, retries, isolated workspaces, and detailed reporting
Framework Agnostic: Works with Claude SDK (TypeScript & Python), custom agents, or any LLM-powered system
Developer-First: TypeScript-native with full type safety and intuitive APIs

Real-World Use Cases

🤖 Agent Development: Validate your AI agent meets requirements before shipping
📊 Regression Testing: Catch regressions when updating prompts or models
🔄 A/B Testing: Compare agent performance across different configurations
📈 Continuous Improvement: Use learning system to systematically improve prompts
🎯 Benchmarking: Measure and track agent performance over time
🔍 Pre-deployment Validation: Gate production deployments on eval results

Comparison

| Feature | vibe-check | Manual Testing | Unit Tests | | ------------------------- | ---------- | -------------- | ---------- | | Agent-specific evaluation | ✅ | ❌ | ❌ | | Tool call validation | ✅ | ⚠️ Difficult | ❌ | | Multi-turn conversations | ✅ | ⚠️ Manual | ❌ | | Learning from failures | ✅ | ❌ | ❌ | | Isolated workspaces | ✅ | ❌ | ⚠️ Manual | | Parallel execution | ✅ | ❌ | ✅ | | Framework agnostic | ✅ | ✅ | ⚠️ Limited | | TypeScript-native | ✅ | N/A | ✅ |

Features

5 Eval Categories: Tool usage, code generation, routing, multi-turn conversations, and basic evaluations
7 Built-in Judges: File existence, tool invocation, pattern matching, syntax validation, skill invocation, and 4 LLM-based judges with rubric support
Automatic Tool Extraction: For claude-code agents, tool calls are automatically extracted from JSONL logs
Extensible Judge System: Create custom judges for specialized validation
Parallel Execution: Run evaluations concurrently with configurable concurrency
Retry Logic: Automatic retries with exponential backoff for flaky tests
Flaky Test Detection: Automatically identifies tests that pass on retry
Isolated Workspaces: Each eval runs in its own temporary directory
Multi-trial Support: Run multiple trials per eval with pass thresholds
Per-Turn Judges: Evaluate each turn independently in multi-turn conversations
Learning System: Analyze failures and generate improvement rules
TypeScript First: Full type safety with comprehensive type exports

Installation

# Using bun (recommended)
bun add @poofnew/vibe-check

# Using npm
npm install @poofnew/vibe-check

# Using pnpm
pnpm add @poofnew/vibe-check

Quick Start

1. Initialize your project

bunx vibe-check init

This creates:

vibe-check.config.ts - Configuration file with agent function stub
__evals__/example.eval.json - Example evaluation case

2. Configure your agent

Edit vibe-check.config.ts to implement your agent function:

import { defineConfig } from "@poofnew/vibe-check";

export default defineConfig({
  testDir: "./__evals__",

  agent: async (prompt, context) => {
    // Your agent implementation here
    const result = await yourAgent.run(prompt, {
      cwd: context.workingDirectory,
    });

    return {
      output: result.text,
      success: result.success,
      toolCalls: result.tools,
    };
  },
});

3. Create eval cases

Create JSON files in __evals__/ directory:

{
  "id": "create-hello-file",
  "name": "Create Hello File",
  "description": "Test that agent can create a simple file",
  "category": "code-gen",
  "prompt": "Create a file called hello.ts that exports a greet function",
  "targetFiles": ["hello.ts"],
  "expectedPatterns": [
    {
      "file": "hello.ts",
      "patterns": ["export", "function greet"]
    }
  ],
  "judges": ["file-existence", "pattern-match"]
}

4. Run evaluations

bunx vibe-check run

Configuration

Configuration File

Create vibe-check.config.ts in your project root:

import { defineConfig } from "@poofnew/vibe-check";

export default defineConfig({
  // Required: Your agent function
  agent: async (prompt, context) => {
    return { output: "", success: true };
  },

  // Optional settings with defaults shown
  agentType: "generic", // 'claude-code' | 'generic' - use 'claude-code' for automatic JSONL tool extraction
  testDir: "./__evals__", // Directory containing eval cases
  rubricsDir: "./__evals__/rubrics", // Directory for LLM judge rubrics
  testMatch: ["**/*.eval.json"], // Glob patterns for eval files
  parallel: true, // Run evals in parallel
  maxConcurrency: 3, // Max concurrent evals
  timeout: 300000, // Timeout per eval (ms)
  maxRetries: 2, // Retry failed evals
  retryDelayMs: 1000, // Initial retry delay
  retryBackoffMultiplier: 2, // Exponential backoff multiplier
  trials: 1, // Number of trials per eval
  trialPassThreshold: 0.5, // Required pass rate for trials
  verbose: false, // Verbose output
  preserveWorkspaces: false, // Keep workspace dirs after eval (for debugging)
  outputDir: "./__evals__/results",

  // Custom judges
  judges: [],

  // Workspace hooks (customize workspace creation)
  createWorkspace: async () => {
    return { id: "my-workspace", path: "/path/to/workspace" };
  },
  cleanupWorkspace: async (workspace) => {
    // Custom cleanup logic
  },

  // Lifecycle hooks
  setup: async () => {},
  teardown: async () => {},
  beforeEach: async (evalCase) => {},
  afterEach: async (result) => {},

  // Learning system config
  learning: {
    enabled: false,
    ruleOutputDir: "./prompts",
    minFailuresForPattern: 2,
    autoApprove: false,
  },
});

Agent Function

The agent function receives a prompt and context, and must return an AgentResult:

interface AgentContext {
  workingDirectory: string; // Isolated temp directory for this eval
  evalId: string; // Unique eval case ID
  evalName: string; // Eval case name
  sessionId?: string; // For multi-turn sessions
  timeout?: number; // Eval timeout in ms
}

interface AgentResult {
  output: string; // Agent's text output
  success: boolean; // Whether agent completed successfully
  toolCalls?: ToolCall[]; // Record of tool invocations
  sessionId?: string; // Session ID for multi-turn
  error?: Error; // Error if failed
  duration?: number; // Execution time in ms
  usage?: {
    inputTokens: number;
    outputTokens: number;
    totalCostUsd?: number;
  };
}

interface ToolCall {
  toolName: string;
  input: unknown;
  output?: unknown;
  isError?: boolean;
  timestamp?: number; // When the tool was called
  duration?: number; // How long the call took (ms)
}

Eval Case Categories

Basic (`basic`)

Simple prompt-response evaluations.

{
  "id": "basic-greeting",
  "name": "Basic Greeting",
  "description": "Test basic response",
  "category": "basic",
  "prompt": "Say hello",
  "expectedBehavior": "Should respond with a greeting",
  "judges": []
}

Tool (`tool`)

Validates that specific tools are invoked correctly.

{
  "id": "read-file-test",
  "name": "Read File Test",
  "description": "Test file reading capability",
  "category": "tool",
  "prompt": "Read the contents of package.json",
  "expectedToolCalls": [
    {
      "toolName": "Read",
      "minCalls": 1,
      "maxCalls": 3
    }
  ],
  "judges": ["tool-invocation"]
}

Code Generation (`code-gen`)

Validates file creation and content patterns.

{
  "id": "create-component",
  "name": "Create React Component",
  "description": "Test component generation",
  "category": "code-gen",
  "prompt": "Create a React button component in src/Button.tsx",
  "targetFiles": ["src/Button.tsx"],
  "expectedPatterns": [
    {
      "file": "src/Button.tsx",
      "patterns": [
        "export.*Button",
        "React",
        "onClick"
      ]
    }
  ],
  "syntaxValidation": true,
  "buildVerification": false,
  "judges": ["file-existence", "pattern-match"]
}

Routing (`routing`)

Validates request routing to appropriate agents.

{
  "id": "route-to-coding",
  "name": "Route to Coding Agent",
  "description": "Test routing for code tasks",
  "category": "routing",
  "prompt": "Write a sorting algorithm",
  "expectedAgent": "coding",
  "shouldNotRoute": ["research", "conversational"],
  "judges": []
}

Multi-Turn (`multi-turn`)

Validates multi-turn conversational flows with optional per-turn evaluation.

{
  "id": "iterative-refinement",
  "name": "Iterative Code Refinement",
  "description": "Test multi-turn improvements",
  "category": "multi-turn",
  "turns": [
    {
      "prompt": "Create a basic add function",
      "expectedBehavior": "Creates initial function",
      "judges": ["file-existence"]
    },
    {
      "prompt": "Add input validation",
      "expectedBehavior": "Adds type checking",
      "judges": ["pattern-match"]
    },
    {
      "prompt": "Add JSDoc comments",
      "expectedBehavior": "Documents the function"
    }
  ],
  "judges": ["syntax-validation"],
  "sessionPersistence": true
}

Common Fields

All eval cases support these fields:

| Field | Type | Description | | ------------- | -------- | -------------------------------------------------- | | id | string | Unique identifier | | name | string | Display name | | description | string | Description of what's being tested | | category | string | One of: basic, tool, code-gen, routing, multi-turn | | tags | string[] | Optional tags for filtering | | enabled | boolean | Enable/disable (default: true) | | timeout | number | Override default timeout (ms) | | trials | object | { count: number, passThreshold: number } |

Built-in Judges

file-existence

Validates that expected files were created.

Checks all targetFiles exist in workspace
Passes if ≥80% of files exist
Returns detailed list of missing files

tool-invocation

Validates tool call counts.

Checks expectedToolCalls against actual calls
Supports minCalls and maxCalls constraints
Defaults minCalls to 1 if not specified

pattern-match

Validates file content matches expected patterns.

Uses regex patterns from expectedPatterns
Supports multiline matching
Passes if ≥80% of patterns found

syntax-validation

Validates generated code has valid syntax.

Supports TypeScript, JavaScript, JSX, TSX
Uses Babel parser for accurate syntax checking
Reports specific syntax errors

skill-invocation

Validates that specific skills were invoked.

Checks expectedSkills against actual skill calls
Supports minCalls constraints

LLM Judges (Rubric-based)

Evaluate outputs using LLM with custom rubrics:

llm-code-quality - Evaluate code against code-quality.md rubric
llm-response-quality - Evaluate responses against response-quality.md rubric
llm-routing-quality - Evaluate routing decisions
llm-conversation-quality - Evaluate conversation quality

Configure rubrics directory in your config:

export default defineConfig({
  rubricsDir: "./__evals__/rubrics",
  // ...
});

Create rubric files (e.g., code-quality.md) with evaluation criteria.

Reference Solutions

Use referenceSolution in eval cases for pairwise comparison:

{
  "id": "create-function",
  "category": "code-gen",
  "prompt": "Create an add function",
  "referenceSolution": {
    "description": "A properly typed add function",
    "code": "function add(a: number, b: number): number {\n  return a + b;\n}"
  },
  "judges": ["llm-code-quality"]
}

LLM judges will compare the agent's output against the reference solution.

Custom Judges

Create custom judges by extending BaseJudge:

import {
  BaseJudge,
  getJudgeRegistry,
  type JudgeContext,
  type JudgeResult,
  type JudgeType,
} from "@poofnew/vibe-check";

class ResponseLengthJudge extends BaseJudge {
  id = "response-length";
  name = "Response Length Judge";
  type: JudgeType = "code";

  constructor(
    private minLength: number = 10,
    private maxLength: number = 1000,
  ) {
    super();
  }

  async evaluate(context: JudgeContext): Promise<JudgeResult> {
    const length = context.executionResult.output.length;

    if (length < this.minLength) {
      return this.createResult({
        passed: false,
        score: 0,
        reasoning: `Response too short: ${length} chars`,
      });
    }

    if (length > this.maxLength) {
      return this.createResult({
        passed: false,
        score: 50,
        reasoning: `Response too long: ${length} chars`,
      });
    }

    return this.createResult({
      passed: true,
      score: 100,
      reasoning: `Response length ${length} is acceptable`,
    });
  }
}

// Register globally
const registry = getJudgeRegistry();
registry.register(new ResponseLengthJudge(20, 500));

// Or add to config
export default defineConfig({
  judges: [new ResponseLengthJudge(20, 500)],
  // ...
});

JudgeContext

interface JudgeContext {
  evalCase: EvalCase; // The eval case being judged
  executionResult: ExecutionResult;
  workingDirectory: string; // Workspace path
  turnIndex?: number; // For multi-turn evals
}

interface ExecutionResult {
  success: boolean;
  output: string;
  error?: Error;
  toolCalls: ToolCallRecord[];
  duration: number;
  numTurns?: number;
  sessionId?: string;
  workingDirectory?: string;
  transcript?: Transcript; // Full conversation transcript
  progressUpdates?: ProgressRecord[]; // Progress tracking
  usage?: {
    inputTokens: number;
    outputTokens: number;
    totalCostUsd?: number;
  };
}

Learning System

The learning system automatically analyzes test failures and generates prompt improvements to enhance your agent's performance over time.

How It Works

Collect Failures: Gathers failed evals from test runs or JSONL logs
Generate Explanations: Uses LLM to analyze why each failure occurred
Detect Patterns: Groups similar failures into patterns
Propose Rules: Generates actionable prompt rules to fix systemic issues
Human Review: Allows manual approval before integrating rules
Iterate: Re-run evals to validate improvements

Configuration

Enable learning in your config:

export default defineConfig({
  learning: {
    enabled: true,
    ruleOutputDir: "./prompts", // Where to save rules
    minFailuresForPattern: 2, // Min failures to form a pattern
    autoApprove: false, // Require manual review
  },
  // ...
});

Usage

# Run full learning iteration
vibe-check learn run --source eval

# Analyze failures without generating rules
vibe-check learn analyze --source both

# Review pending rules
vibe-check learn review

# Show learning statistics
vibe-check learn stats

# Auto-approve high-confidence rules (use with caution)
vibe-check learn run --auto-approve

Data Sources

eval: Analyze failures from recent eval runs
jsonl: Load failures from JSONL files
both: Combine both sources

Output Structure

prompts/
├── learned-rules.json          # Approved rules
├── pending-rules.json          # Rules awaiting review
├── history.json                # Learning history
└── iterations/
    └── iteration-{timestamp}.json

Example Rule Output

{
  "ruleId": "rule-1234",
  "ruleContent": "When creating files, always verify the parent directory exists before writing",
  "targetSection": "file-operations",
  "rationale": "Multiple failures showed agents attempting to write to non-existent directories",
  "addressesPatterns": ["pattern-dir-not-found"],
  "expectedImpact": {
    "failureIds": ["eval-123", "eval-456"],
    "confidenceScore": 0.89
  },
  "status": "approved"
}

Best Practices

Start with autoApprove: false to review all rules manually
Run learning iterations after accumulating 10+ failures
Set minFailuresForPattern: 2 to catch recurring issues
Review rules before integration to avoid over-fitting
Use JSONL source for production failure logs

CLI Commands

vibe-check run

Run the evaluation suite.

vibe-check run [options]

Options:
  -c, --config <path>        Path to config file
  --category <categories...> Filter by category (tool, code-gen, routing, multi-turn, basic)
  --tag <tags...>            Filter by tag
  --id <ids...>              Filter by eval ID
  -v, --verbose              Verbose output

Examples:

# Run all evals
vibe-check run

# Run only code-gen evals
vibe-check run --category code-gen

# Run evals with specific tags
vibe-check run --tag critical --tag regression

# Run specific evals by ID
vibe-check run --id create-file --id read-file

# Verbose output
vibe-check run -v

vibe-check list

List available eval cases.

vibe-check list [options]

Options:
  -c, --config <path>        Path to config file
  --category <categories...> Filter by category
  --tag <tags...>            Filter by tag
  --json                     Output as JSON

vibe-check init

Initialize vibe-check in a project.

vibe-check init [options]

Options:
  --typescript  Create TypeScript config (default)

vibe-check learn

Learning system commands for analyzing failures and generating rules.

# Run full learning iteration
vibe-check learn run [options]
  --source <source>   Data source (eval, jsonl, both)
  --auto-approve      Auto-approve high-confidence rules
  --save-pending      Save rules for later review

# Analyze failures without generating rules
vibe-check learn analyze [options]
  --source <source>   Data source (eval, jsonl, both)

# Review pending rules
vibe-check learn review

# Show learning statistics
vibe-check learn stats

Programmatic API

Use vibe-check programmatically in your code:

import {
  defineConfig,
  EvalRunner,
  loadConfig,
  loadEvalCases,
} from "@poofnew/vibe-check";

// Load and run
const config = await loadConfig("./vibe-check.config.ts");
const runner = new EvalRunner(config);

const result = await runner.run({
  categories: ["code-gen"],
  tags: ["critical"],
});

console.log(`Pass rate: ${result.passRate * 100}%`);
console.log(`Duration: ${result.duration}ms`);

// Access individual results
for (const evalResult of result.results) {
  if (!evalResult.success) {
    console.log(`Failed: ${evalResult.evalCase.name}`);
    for (const judge of evalResult.judgeResults) {
      if (!judge.passed) {
        console.log(`  - ${judge.judgeId}: ${judge.reasoning}`);
      }
    }
  }
}

Exports

// Configuration
export { defaultConfig, defineConfig, loadConfig };
export type {
  AgentContext,
  AgentFunction,
  AgentResult,
  ResolvedConfig,
  VibeCheckConfig,
};

// Schemas
export {
  isBasicEval,
  isCodeGenEval,
  isMultiTurnEval,
  isRoutingEval,
  isToolEval,
  parseEvalCase,
};
export type { CodeGenEvalCase, EvalCase, EvalCategory, ToolEvalCase /* ... */ };

// Runner
export { EvalRunner };
export type { EvalSuiteResult, RunnerOptions };

// Judges
export { BaseJudge, getJudgeRegistry, JudgeRegistry, resetJudgeRegistry };
export type { ExecutionResult, Judge, JudgeContext, JudgeResult, JudgeType };

// Harness
export { TestHarness };
export type { EvalWorkspace, HarnessOptions };

// Utils
export { groupByCategory, loadEvalCase, loadEvalCases };

// Adapters (for multi-language support)
export { PythonAgentAdapter } from "@poofnew/vibe-check/adapters";
export type {
  AgentRequest,
  AgentResponse,
  PythonAdapterOptions,
} from "@poofnew/vibe-check/adapters";

Examples

Explore complete working examples in the examples/ directory:

🎯 Basic Example

Simple agent integration with minimal configuration:

cd examples/basic
bun install
bun run vibe-check run

Use case: Quick start template, testing custom agents

🤖 Claude Agent SDK Integration

Full-featured Claude SDK integration with tool tracking (TypeScript):

cd examples/claude-agent-sdk
bun install
export ANTHROPIC_API_KEY=your_key
bun run vibe-check run

Use case: Production Claude agents, comprehensive testing

🐍 Python Agent SDK Integration

Python SDK integration using the process-based adapter:

cd examples/python-agent
bun install
./setup.sh  # Creates Python venv and installs claude-agent-sdk
export ANTHROPIC_API_KEY=your_key
bun run vibe-check run

Use case: Python-based Claude agents, multi-language support

The Python adapter uses a JSON protocol over stdin/stdout to communicate with Python agent scripts:

import { PythonAgentAdapter } from "@poofnew/vibe-check/adapters";

const adapter = new PythonAgentAdapter({
  scriptPath: "./agent.py",
  pythonPath: "./.venv/bin/python",
  env: { ANTHROPIC_API_KEY: process.env.ANTHROPIC_API_KEY },
});

export default defineConfig({
  agent: adapter.createAgent(),
  agentType: "claude-code",
});

Eval Examples by Category and Judge

| Eval File | Category | Judges Used | | ------------------------------------- | ---------- | ------------------------------------------------ | | basic.eval.json | basic | llm-code-quality | | code-gen.eval.json | code-gen | file-existence, pattern-match, syntax-validation | | tool-usage.eval.json | tool | tool-invocation | | multi-turn.eval.json | multi-turn | - | | route-to-coding.eval.json | routing | agent-routing | | route-to-research.eval.json | routing | agent-routing | | route-to-reviewer.eval.json | routing | agent-routing | | route-intent-classification.eval.json | tool | tool-invocation | | tool-chain-explore-modify.eval.json | tool | tool-invocation | | tool-chain-search-replace.eval.json | tool | tool-invocation | | tool-chain-bash.eval.json | tool | tool-invocation | | tool-chain-analysis.eval.json | tool | tool-invocation | | multi-file-feature.eval.json | code-gen | file-existence, pattern-match, syntax-validation | | skill-invocation.eval.json | tool | tool-invocation, skill-invocation | | code-review.eval.json | basic | llm-code-quality | | debug-workflow.eval.json | code-gen | file-existence, pattern-match, syntax-validation |

🎨 Custom Judges

Advanced custom validation logic:

cd examples/custom-judges
bun install
bun run vibe-check run

Use case: Domain-specific validation, custom metrics

🔄 Multi-Turn

Multi-turn conversation testing with session persistence:

cd examples/multi-turn
bun install
bun run vibe-check run

Use case: Conversational agents, iterative refinement flows

📚 Learning System

Demonstrates the learning system with a mock agent that has deliberate flaws:

cd examples/learning
bun install
bun run vibe-check run           # Runs evals (some will fail by design)
bun run vibe-check learn stats   # Shows learning system status
bun run vibe-check learn analyze # Analyzes failures (requires ANTHROPIC_API_KEY)

Use case: Understanding the learning system, testing failure analysis pipeline

The example includes:

A mock agent with predictable flaws (uses Read instead of Write, refuses delete operations, etc.)
Pre-configured eval cases designed to fail
Pre-generated results so learning commands work immediately

Performance Tips

Optimize your eval suite for speed and reliability:

Parallel Execution

export default defineConfig({
  parallel: true,
  maxConcurrency: 5, // Balance between speed and resource usage
});

Tip: Higher concurrency = faster but more memory/API usage. Start with 3-5.

Selective Test Runs

# Run only critical tests during development
vibe-check run --tag critical

# Run specific categories
vibe-check run --category tool code-gen

# Run single test for debugging
vibe-check run --id my-test-id

Optimize Timeouts

export default defineConfig({
  timeout: 60000,  // Default for all tests
});

// Override per eval case
{
  "id": "quick-test",
  "timeout": 10000,  // Fast tests
  // ...
}

{
  "id": "complex-generation",
  "timeout": 300000,  // Longer timeout for complex tasks
  // ...
}

Retry Strategy

export default defineConfig({
  maxRetries: 2, // Retry failed tests
  retryDelayMs: 1000, // Initial delay
  retryBackoffMultiplier: 2, // Exponential backoff
});

Tip: Enable retries for flaky network/API tests, disable for deterministic tests.

Trial Optimization

export default defineConfig({
  trials: 3,              // Run each test 3 times
  trialPassThreshold: 0.67,  // Pass if 2/3 succeed
});

// Or per eval
{
  "id": "flaky-test",
  "trials": { "count": 5, "passThreshold": 0.8 },
  // ...
}

Tip: Use trials for non-deterministic agent behavior, but avoid over-reliance.

Workspace Management

By default, vibe-check creates temporary workspaces and cleans them up after each eval. Use preserveWorkspaces: true for debugging:

export default defineConfig({
  preserveWorkspaces: true, // Keep workspaces for inspection
  // ...
});

Workspace Hooks

For full control over workspace lifecycle, use createWorkspace and cleanupWorkspace hooks:

import { defineConfig, type EvalWorkspace } from "@poofnew/vibe-check";
import * as fs from "fs/promises";
import * as path from "path";
import { execFile } from "child_process";
import { promisify } from "util";

const execFileAsync = promisify(execFile);

export default defineConfig({
  createWorkspace: async (): Promise<EvalWorkspace> => {
    const id = `ws-${Date.now()}-${Math.random().toString(36).slice(2)}`;
    const wsPath = path.join(process.cwd(), "__evals__/results/workspaces", id);

    // Copy your template (including node_modules for fast setup)
    await fs.cp("./template", wsPath, { recursive: true });

    // Optional: install dependencies if not included in template
    // await execFileAsync('npm', ['install'], { cwd: wsPath });

    return { id, path: wsPath };
  },

  cleanupWorkspace: async (workspace: EvalWorkspace): Promise<void> => {
    await fs.rm(workspace.path, { recursive: true, force: true });
  },
  // ...
});

Benefits of custom workspace hooks:

Use any package manager (npm, yarn, pnpm, bun)
Pre-install dependencies in template for faster workspace setup
Custom setup logic per workspace
Full control over cleanup behavior

Troubleshooting

Common Issues

"Cannot find config file"

# Ensure config exists
ls vibe-check.config.ts

# Or specify path
vibe-check run --config ./path/to/config.ts

"No eval cases found"

// Check testDir and testMatch in config
export default defineConfig({
  testDir: "./__evals__", // Must exist
  testMatch: ["**/*.eval.json"], // Must match file names
});

# List all detected evals
vibe-check list

"Agent timeout exceeded"

// Increase timeout
export default defineConfig({
  timeout: 300000,  // 5 minutes
});

// Or per eval
{
  "timeout": 600000  // 10 minutes for slow tests
}

"Module not found" errors with Claude SDK

# Install peer dependencies
bun add @anthropic-ai/sdk @anthropic-ai/claude-agent-sdk

# Verify installation
bun pm ls | grep anthropic

Test failures in CI/CD

// Reduce concurrency for stability
export default defineConfig({
  parallel: true,
  maxConcurrency: 2, // Lower for CI environments
  maxRetries: 3, // More retries for flaky CI networks
});

Out of memory errors

# Reduce concurrency
vibe-check run --config config-with-lower-concurrency.ts

# Or run tests in batches
vibe-check run --category tool
vibe-check run --category code-gen

Debug Mode

# Enable verbose output
vibe-check run -v

# Preserve workspaces for inspection
vibe-check run --config config-with-preserve-workspaces.ts

Getting Help

Check examples for working configurations
Search GitHub Issues
Join our Discord
Post on X

FAQ

General

Q: What agent frameworks does vibe-check support?

A: Any agent that can be wrapped in an async (prompt, context) => AgentResult function. Built-in support for Claude SDK (TypeScript and Python via adapters), but works with LangChain, custom agents, or any LLM framework.

Q: Can I use this with other LLMs (OpenAI, Gemini, etc.)?

A: Yes! The framework is LLM-agnostic. Just implement the agent function to call your preferred LLM.

Q: Do I need Bun or can I use Node/npm?

A: While optimized for Bun, vibe-check works with Node.js 18+ and npm/pnpm. Bun is recommended for best performance.

Q: Can I use Python agents with vibe-check?

A: Yes! Use the PythonAgentAdapter from @poofnew/vibe-check/adapters. It spawns Python scripts as subprocesses and communicates via JSON over stdin/stdout. See the Python Agent SDK Integration example.

Configuration

Q: How do I test multi-file code generation?

A: Use code-gen category with multiple targetFiles:

{
  "category": "code-gen",
  "targetFiles": ["src/index.ts", "src/utils.ts", "test/index.test.ts"],
  "expectedPatterns": [...],
  "judges": ["file-existence", "pattern-match"]
}

Q: Can I use custom validation logic?

A: Yes! Create custom judges extending BaseJudge. See Custom Judges.

Q: How do I handle authentication/secrets in tests?

A: Use environment variables:

export default defineConfig({
  agent: async (prompt, context) => {
    const apiKey = process.env.ANTHROPIC_API_KEY;
    // ...
  },
});

Learning System

Q: Does the learning system modify my prompts automatically?

A: No! It generates rule suggestions that require human review (unless autoApprove: true). You control what gets integrated.

Q: How many failures do I need to generate useful rules?

A: Start with 10+ failures. The system works best with 20-50 failures showing clear patterns.

Q: Can I use production logs for learning?

A: Yes! Export failures to JSONL format and use --source jsonl.

Performance

Q: How fast does vibe-check run?

A: Depends on agent speed and concurrency. With maxConcurrency: 5 and Claude SDK, expect ~10-20 evals/minute.

Q: Can I run tests in CI/CD?

A: Yes! Use exit codes for CI integration:

vibe-check run || exit 1  # Fails CI if tests fail

Q: How do I speed up slow test suites?

A: See Performance Tips. Key strategies: increase concurrency, use selective runs, optimize timeouts.

Debugging

Q: Where are test artifacts stored?

A: By default in __evals__/results/. Workspaces are temporary unless preserveWorkspaces: true.

Q: How do I debug a single failing test?

A: Run with verbose mode and workspace preservation:

vibe-check run --id failing-test -v

Set preserveWorkspaces: true to inspect the working directory.

Q: Why are my tests flaky?

A: LLMs are non-deterministic. Use trials with pass thresholds:

{ trials: 3, trialPassThreshold: 0.67 }

Development

Prerequisites

Bun >= 1.0

Setup

# Clone the repository
git clone https://github.com/poofdotnew/vibe-check.git
cd vibe-check

# Install dependencies
bun install

# Run tests
bun test

# Build
bun run build

# Type check
bun run typecheck

Project Structure

src/
├── bin/                  # CLI entry points
│   ├── vibe-check.ts      # Main executable
│   └── cli.ts            # CLI commands
├── config/               # Configuration
│   ├── types.ts          # Type definitions
│   ├── schemas.ts        # Zod validation schemas
│   └── config-loader.ts  # Config file loading
├── harness/              # Test execution
│   ├── test-harness.ts   # Main execution engine
│   └── workspace-manager.ts
├── judges/               # Evaluation judges
│   ├── judge-interface.ts
│   ├── judge-registry.ts
│   └── builtin/          # Built-in judges
├── runner/               # Test orchestration
│   └── eval-runner.ts
├── learning/             # Learning system
└── utils/                # Utilities

CI/CD Integration

GitHub Actions

name: Eval Tests

on: [push, pull_request]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - uses: oven-sh/setup-bun@v1
        with:
          bun-version: latest

      - name: Install dependencies
        run: bun install

      - name: Run vibe-check
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
        run: bun run vibe-check run

GitLab CI

test:
  image: oven/bun:latest
  script:
    - bun install
    - bun run vibe-check run
  variables:
    ANTHROPIC_API_KEY: $ANTHROPIC_API_KEY

CircleCI

version: 2.1

jobs:
  test:
    docker:
      - image: oven/bun:latest
    steps:
      - checkout
      - run: bun install
      - run: bun run vibe-check run

Tips for CI

Use maxConcurrency: 2-3 for stable CI runs
Set appropriate timeouts for CI environment
Cache dependencies for faster runs
Store API keys in secrets/environment variables
Consider running critical tests only on PRs, full suite on main

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Table of Contents

Background

Why vibe-check?

Real-World Use Cases

Comparison

Features

Installation

Quick Start

1. Initialize your project

2. Configure your agent

3. Create eval cases

4. Run evaluations

Configuration

Configuration File

Agent Function

Eval Case Categories

Basic (basic)

Tool (tool)

Code Generation (code-gen)

Routing (routing)

Multi-Turn (multi-turn)

Common Fields

Built-in Judges

file-existence

tool-invocation

pattern-match

syntax-validation

skill-invocation

LLM Judges (Rubric-based)

Reference Solutions

Custom Judges

JudgeContext

Learning System

How It Works

Configuration

Usage

Data Sources

Output Structure

Example Rule Output

Best Practices

CLI Commands

vibe-check run

vibe-check list

vibe-check init

vibe-check learn

Programmatic API

Exports

Examples

🎯 Basic Example

🤖 Claude Agent SDK Integration

🐍 Python Agent SDK Integration

Eval Examples by Category and Judge

🎨 Custom Judges

🔄 Multi-Turn

📚 Learning System

Performance Tips

Parallel Execution

Selective Test Runs

Optimize Timeouts

Retry Strategy

Trial Optimization

Workspace Management

Workspace Hooks

Troubleshooting

Common Issues

"Cannot find config file"

"No eval cases found"

"Agent timeout exceeded"

"Module not found" errors with Claude SDK

Test failures in CI/CD

Out of memory errors

Debug Mode

Getting Help

FAQ

Basic (`basic`)

Tool (`tool`)

Code Generation (`code-gen`)

Routing (`routing`)

Multi-Turn (`multi-turn`)