stepproof

v0.5.0

Published

3 months ago

Regression testing for multi-step AI workflows. Not observability — a CI gate.

0High
0Medium
0Low

bilkobibitkov

ai testing regression agents llm cli openai anthropic ci

stepproof

Regression testing for multi-step AI workflows. Not observability.

You upgraded to gpt-4o-mini. Your LangSmith traces look fine. Three days later a customer reports your extraction step stopped working. You found out from a Slack message, not a test.

stepproof is what you run before you deploy.

npm install -g stepproof

30-second quickstart

Write a scenario:

# classify.yaml
name: "Intent classification"
iterations: 10

steps:
  - id: classify
    provider: anthropic
    model: claude-sonnet-4-6
    prompt: "Classify the intent of this message: {{input}}"
    variables:
      input: "I need to cancel my subscription"
    min_pass_rate: 0.90
    assertions:
      - type: contains
        value: "cancel"
      - type: json_schema
        schema: ./schemas/intent.json

  - id: respond
    provider: openai
    model: gpt-4o
    prompt: "Given intent '{{classify.output}}', write a helpful reply to: {{input}}"
    min_pass_rate: 0.80
    assertions:
      - type: llm_judge
        prompt: "Is this response helpful and on-topic? Answer yes/no."
        pass_on: "yes"

Run it:

stepproof run classify.yaml

Output:

stepproof v0.2.0 — running "Intent classification" (10 iterations)

  step: classify
    ✓ 9/10 passed (90.0%) — threshold: 90% ✓

  step: respond
    ✓ 8/10 passed (80.0%) — threshold: 80% ✓

All steps passed. Exit 0.

Now break it — swap to a cheaper model, lower the pass rate. It fails:

  step: classify
    ✗ 5/10 passed (50.0%) — threshold: 90% ✗

1 step failed. Exit 1.

Commands

`stepproof run <scenario>`

Run a scenario file or directory of scenarios.

stepproof run classify.yaml
stepproof run scenarios/
stepproof run scenarios/ --format sarif --output results.sarif
stepproof run scenarios/ --format junit --output results.xml

Flags:

--format <format> — output format: terminal (default), sarif, junit
--output <file> — write output to file instead of stdout

`stepproof init [dir]`

Scaffold a starter scenario in the target directory. Defaults to ./scenarios/.

stepproof init
# Creates: ./scenarios/first-test.yaml

stepproof init my-tests
# Creates: ./my-tests/first-test.yaml

The generated first-test.yaml is a working example you can edit and run immediately.

Environment Variables

| Variable | Required | Purpose | |----------|----------|---------| | ANTHROPIC_API_KEY | For Anthropic steps | Authenticates calls to Claude models | | OPENAI_API_KEY | For OpenAI steps | Authenticates calls to GPT models |

Only the keys for the providers you use in your scenarios are required.

CI integration

# .github/workflows/ai-regression.yml
name: AI regression tests
on: [push, pull_request]

jobs:
  stepproof:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm install -g stepproof
      - run: stepproof run scenarios/classify.yaml
        env:
          ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
          OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}

Exit code 1 on regression. PR blocked. Done.

Assertions

| Type | What it checks | |------|---------------| | contains | Output includes this string | | not_contains | Output does not include this string | | regex | Output matches this pattern | | json_schema | Output is valid JSON matching this schema | | llm_judge | A second LLM call evaluates the output (boolean verdict) |

Structured reports (v0.2.0)

stepproof outputs machine-readable SARIF 2.1.0 and JUnit XML for CI pipeline integration.

SARIF — GitHub Advanced Security / GitLab / Azure DevOps

# Write SARIF to stdout
stepproof run classify.yaml --format sarif

# Write SARIF to file
stepproof run classify.yaml --format sarif --output results.sarif

Integrate with GitHub Advanced Security:

# .github/workflows/ai-regression.yml
- name: Run stepproof
  run: stepproof run scenarios/ --format sarif --output results.sarif

- name: Upload to GitHub Security tab
  uses: github/codeql-action/upload-sarif@v3
  with:
    sarif_file: results.sarif
  if: always()

JUnit XML — Jenkins / CircleCI / TeamCity

stepproof run classify.yaml --format junit
stepproof run classify.yaml --format junit --output results.xml

# .github/workflows/ai-regression.yml
- name: Run stepproof
  run: stepproof run scenarios/ --format junit --output test-results.xml

- name: Publish test results
  uses: actions/upload-artifact@v4
  with:
    name: test-results
    path: test-results.xml
  if: always()

Default output (no --format flag) is unchanged — human-readable terminal output.

Migration note (v0.2.x → v0.3.0): --report still works but is deprecated and will print a warning. Switch to --format at your next convenience. --report will be removed at v1.0.0.

How this is different from LangSmith / Braintrust / Langfuse

| | stepproof | LangSmith / Braintrust | |--|-----------|------------------------| | When it runs | Before deploy (CI) | After deploy (production) | | What it answers | "Is my pipeline still correct?" | "What did my pipeline do?" | | Output | Pass/fail with exit code | Traces and dashboards | | Use case | Regression testing | Observability |

They tell you what happened. We tell you whether to deploy.

These are different jobs. Use both.

Troubleshooting

`Error: "scenarios/" is a directory`

stepproof run ./scenarios/first-test.yaml   # ← run a specific file
stepproof run ./scenarios/                  # ← or run the whole dir (note trailing slash)

`Error parsing scenario: ...`

Your YAML has a syntax error. Common culprits: inconsistent indentation, unquoted {{vars}}, or a missing steps: key. Run node -e "require('fs').readFileSync('./your.yaml')" to catch basic issues.

API errors (`401 Unauthorized`, `403 Forbidden`)

Set the API key for whichever provider your scenario uses:

export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...

Only the keys for providers you use in your scenarios are required.

Steps failing when they should pass

Check min_pass_rate in your scenario. The default is not 100% — if you set min_pass_rate: 0.90 you expect 1-in-10 to fail. Lower it, or improve your prompt.

`--format must be "sarif" or "junit"`

Only sarif and junit are valid format values. For terminal output, omit the --format flag entirely.

Pro features blocked (SARIF / JUnit output)

SARIF and JUnit formats require a Team license. Set your key:

export PREFLIGHT_LICENSE_KEY=preflight_...
stepproof run scenarios/ --format sarif --output results.sarif

Get a license at the Preflight pricing page.

Scenarios

See /examples for copy-paste ready scenarios:

simple-chain.yaml — basic prompt → response → assertion
tool-calling.yaml — verify tool selection and output
multi-turn.yaml — conversation with memory, verify consistency

Roadmap

v0.2.0 (current): YAML scenarios, N iterations, 5 assertion types, exit code 1 on failure, OpenAI + Anthropic, SARIF 2.1.0 + JUnit XML reporters, stepproof init scaffolding
v0.3.0 (next): Baseline comparison (fail on regression from last run), GitHub Actions native action, provider comparison mode — run the same scenario against two models and diff the results
Cloud dashboard (month 3–6): Persistent history, trend charts, team workspaces — never in the CLI

Contributing

Issues and PRs welcome. See CONTRIBUTING.md for dev setup and guidelines. The tool is and will remain free. Cloud features are the business model, not the CLI.

Part of the Preflight suite

stepproof is one tool in the Preflight AI Agent DevOps suite — local-first CLIs covering the full lifecycle from pre-deploy validation to production observability:

| Tool | Purpose | Install | |------|---------|---------| | stepproof | Behavioral regression testing | npm install -g stepproof | | agent-comply | EU AI Act compliance scanning | npm install -g agent-comply | | agent-gate | Unified pre-deploy CI gate | npm install -g agent-gate | | agent-shift | Config versioning + environment promotion | npm install -g agent-shift | | agent-trace | Local observability — OTel traces in SQLite | npm install -g agent-trace |

Install the full suite:

npm install -g agent-gate stepproof agent-comply agent-shift agent-trace

stepproof — because "I checked manually before the deploy" is not a test.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

stepproof

30-second quickstart

Commands

stepproof run <scenario>

stepproof init [dir]

Environment Variables

CI integration

Assertions

Structured reports (v0.2.0)

SARIF — GitHub Advanced Security / GitLab / Azure DevOps

JUnit XML — Jenkins / CircleCI / TeamCity

How this is different from LangSmith / Braintrust / Langfuse

Troubleshooting

Error: "scenarios/" is a directory

Error parsing scenario: ...

API errors (401 Unauthorized, 403 Forbidden)

Steps failing when they should pass

--format must be "sarif" or "junit"

Pro features blocked (SARIF / JUnit output)

Scenarios

Roadmap

Contributing

Part of the Preflight suite

Legal

`stepproof run <scenario>`

`stepproof init [dir]`

`Error: "scenarios/" is a directory`

`Error parsing scenario: ...`

API errors (`401 Unauthorized`, `403 Forbidden`)

`--format must be "sarif" or "junit"`