probeai
v0.2.2
Published
CLI tool for testing and evaluating AI coding agents
Readme
ProbeAI
Test and evaluate AI coding agents with YAML scenarios.
ProbeAI runs your AI agent, captures its output, and scores it using rule-based checks and LLM judges. Get a pass/fail verdict with detailed reports.
Install
npm install -g probeaiOr run directly:
npx probeai run my-scenario.yamlQuick Start
1. Create a scenario file (test-my-agent.yaml):
id: hello-test
name: "Hello World Test"
description: "Check that my agent can echo hello"
agent:
type: command
command: "echo 'Hello from agent'"
steps:
- action: check_output
expect: "Hello"
evaluate:
method: rules
passThreshold: 100
rules:
- type: contains
target: stdout
value: "Hello from agent"
weight: 1
- type: exit_code
target: exit
value: "0"
weight: 12. Run it:
probeai run test-my-agent.yaml3. See results:
Running 1 scenario(s)...
[hello-test] Hello World Test
PASS score=100/100
All 1 scenario(s) passed!Usage
# Run one or more scenarios
probeai run scenario.yaml
probeai run tests/*.yaml
# Verbose output (shows stdout, stderr, timing)
probeai run scenario.yaml -v
# Generate markdown report
probeai run scenario.yaml --md
# Custom output directory
probeai run scenario.yaml -o ./my-results
# Validate scenario files without running
probeai validate scenario.yamlScenario Format
A scenario is a YAML file with 4 sections:
agent
What to run. Currently supports command type (runs a shell command).
agent:
type: command
command: "my-agent --task 'do something'"
env:
API_KEY: "test-key"steps
Actions to perform during the run.
steps:
- action: send # Send input to stdin
input: "hello"
- action: wait # Wait for N ms
duration: 2000
- action: check_output # Mark output for evaluation
expect: "success"
- action: check_file # Mark file for evaluation
path: "./output.txt"evaluate
How to score the result. Three methods:
Rules only — deterministic checks:
evaluate:
method: rules
passThreshold: 80
rules:
- type: contains # stdout/stderr contains string
target: stdout
value: "success"
- type: regex # regex match (case-insensitive)
target: stdout
value: "(ok|done|success)"
- type: exit_code # process exit code
target: exit
value: "0"
- type: file_exists # file was created
target: file
value: "./output.txt"
- type: json_match # JSON key-value match
target: stdout
value: '{"status":"ok"}'LLM only — Ollama judges the output:
evaluate:
method: llm
model: "qwen2.5-coder:14b"
rubric: |
Did the agent complete the task correctly?
Score 0-100 based on correctness and completeness.Hybrid — average of rules + LLM:
evaluate:
method: hybrid
passThreshold: 60
model: "qwen2.5-coder:14b"
rubric: "Evaluate the output quality."
rules:
- type: exit_code
target: exit
value: "0"Other options
id: unique-id # Required
name: "Human readable" # Required
description: "What this tests"
timeout: 120 # Seconds (default: 120)Templates
Starter templates in templates/:
- cli-smoke.yaml — Quick check that a CLI runs without crashing
- agent-task.yaml — Full AI agent task with hybrid evaluation
- api-health.yaml — HTTP endpoint health check
Copy and edit:
cp node_modules/probeai/templates/cli-smoke.yaml my-test.yamlReports
ProbeAI generates JSON reports by default. Add --md for markdown.
probeai run tests/*.yaml --md -o ./reportsReports include:
- Pass/fail per scenario
- Score breakdown (rule results, LLM reasoning)
- Execution duration
- Overall summary
GitHub Actions
Run ProbeAI automatically on every PR and push.
1. Add scenarios to your repo (e.g. tests/probeai/build-check.yaml):
id: build-check
name: "Build Check"
description: "Verify the project compiles"
agent:
type: command
command: "npx tsc --noEmit 2>&1"
steps:
- action: check_output
expect: ""
evaluate:
method: rules
passThreshold: 100
rules:
- type: exit_code
target: exit
value: "0"
weight: 1More scenario ideas:
# Lint check
id: lint-check
name: "Lint Check"
agent:
type: command
command: "npx biome check src/ 2>&1"
steps:
- action: check_output
expect: ""
evaluate:
method: rules
passThreshold: 100
rules:
- type: exit_code
target: exit
value: "0"
weight: 1# Test check
id: test-check
name: "Test Check"
agent:
type: command
command: "npm test 2>&1"
steps:
- action: check_output
expect: ""
evaluate:
method: rules
passThreshold: 100
rules:
- type: exit_code
target: exit
value: "0"
weight: 12. Create workflow (.github/workflows/probeai.yml):
name: ProbeAI
on:
pull_request:
push:
jobs:
verify:
name: ProbeAI Verification
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: 20
- run: npm install
- name: Run ProbeAI scenarios
run: npx probeai run tests/probeai/*.yaml -v3. Push and check — ProbeAI results appear as a GitHub check on your PRs.
Programmatic Usage
Use ProbeAI as a library in your Node.js code:
import { probe, loadScenarios } from "probeai";
const scenarios = loadScenarios(["tests/build.yaml", "tests/lint.yaml"]);
const results = await probe(scenarios, {
outputDir: "./results",
markdown: false,
verbose: false,
});
const failed = results.filter((r) => !r.evaluation.passed);
if (failed.length > 0) {
console.log(`${failed.length} scenario(s) failed`);
process.exit(1);
}Requirements
- Node.js 20+
- For LLM evaluation: Ollama running locally
License
MIT
