verdict-ci

v0.1.0

Published

18 days ago

PR-time verification that AI-written code actually matches its stated intent. The verification gatekeeper for the agentic coding era.

0High
0Medium
0Low

verdict-ci

github-action code-review ai llm verification pull-request agentic ci

Verdict

Does the code actually do what the PR says it does?

Verdict is a verification gatekeeper for the agentic coding era. AI agents now write enormous amounts of code, and it usually looks right — but looking right and being faithful to the stated intent are different things. Verdict reads a pull request's declared intent (the issue, the PR description, a linked spec or doc) and compares it against the actual diff, then answers one question:

Does this change faithfully deliver what was claimed — or does it silently introduce a contradiction, quietly change scope, or claim something the code never implements?

It runs as a GitHub Action on every PR and as a CLI anywhere. It is not a style linter and not a general code reviewer. It hunts for one thing: the gap between claim and code.

✅ Verdict: PASS — fidelity score 92/100
The diff adds rate limiting to the login route exactly as described in Issue #42.

❌ Verdict: FAIL — fidelity score 28/100
The PR claims to "only add logging," but the diff also changes the default
session timeout from 30m to 24h — an unstated behavior change.

### Findings (1)
- 🛑 Unstated behavior change `silent-scope` · confidence 88%
  - PR body says "logging only," but src/auth/session.ts changes SESSION_TTL.
  - files: `src/auth/session.ts`

Why this exists

The whole industry raced toward generating code. The gap opened on the verifying side. The more AI output is produced, the more the question "is this correct, consistent, faithful to its source?" matters — and that's exactly the question humans can no longer keep up with at PR volume.

Verdict catches the failure mode that ordinary CI and linters miss: code that compiles, passes tests, and reads cleanly, but doesn't do what it said it would.

What it detects

| Category | What it means | | --- | --- | | intent-mismatch | The diff does something different from, or less than, what was claimed. | | silent-scope | The diff changes behavior the intent never mentioned (hidden side effects). | | unbacked-claim | The PR/docs claim something the diff does not actually implement. | | doc-drift | Code and its referenced docs/spec now contradict each other. | | missing-impl | A stated requirement has no corresponding code change. |

A clean, faithful change returns zero findings.

Quick start (under 5 minutes)

1. Add an API key secret

Verdict brings no bundled model — it speaks to any OpenAI-compatible or Anthropic endpoint. Add your key as a repository secret named VERDICT_API_KEY.

2. Drop in the workflow

Create .github/workflows/verdict.yml:

name: Verdict
on:
  pull_request:
    types: [opened, synchronize, reopened, edited]

permissions:
  contents: read
  pull-requests: write

jobs:
  verify:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: verdict-ci/verdict@v0
        with:
          api-key: ${{ secrets.VERDICT_API_KEY }}
          model: gpt-4o-mini
          fail-on: block

Pre-1.0 note: verdict-ci/verdict@v0 is the canonical action coordinate. Until the tagged release is published, pin a commit SHA of your fork, or run the action from a local checkout with uses: ./ (see this repo's own .github/workflows/verdict.yml).

3. Open a PR

Verdict reads the diff and the PR's intent, posts a single comment with its verdict, and (optionally) fails the check when it finds a blocking contradiction. That's it.

Use it locally (CLI)

# Until verdict-ci is published to npm, install from source:
#   git clone … && cd verdict && npm install && npm run build && npm link
# Once published this becomes: npm install -g verdict-ci
export VERDICT_API_KEY=sk-...

# Compare your working changes against a stated intent
git diff main... | verdict --intent "Add rate limiting to the login route"

# Or verify against an authoritative spec file
git diff main... | verdict --spec docs/spec.md --fail-on warn

# Emit a shareable badge (intent inline so the example is self-contained)
git diff main... | verdict --intent "Add rate limiting" --badge verdict-badge.json

The CLI exits non-zero when the verdict is fail, so it slots into any CI.

Shareable badge

--badge writes a shields.io endpoint JSON file. Commit it somewhere with a raw URL (a gh-pages branch or a gist), then point a shields endpoint badge at it:

![verdict](https://img.shields.io/endpoint?url=https://raw.githubusercontent.com/<you>/<repo>/gh-pages/verdict-badge.json)

The badge color follows the decision: green (pass), yellow (warn), red (fail).

Configuration

Action inputs

| Input | Default | Description | | --- | --- | --- | | api-key | — (required) | LLM API key. Store as a secret. | | provider | openai | openai (OpenAI-compatible) or anthropic. | | model | gpt-4o-mini | Model id used for verification. | | base-url | — | Override API base URL (OpenRouter, local models, etc.). | | spec-paths | — | Comma-separated spec/doc files to treat as authoritative intent. | | fail-on | block | Minimum severity that fails the check: block, warn, info. | | min-confidence | 0.6 | Minimum confidence (0–1) for a finding to count. | | max-diff-chars | 60000 | Diff size budget sent to the model. | | comment | true | Post the verdict as a PR comment. | | github-token | ${{ github.token }} | Token used to read the PR diff and post the comment (needs pull-requests: write). |

Outputs

decision (pass/warn/fail), score (0–100), findings (JSON array).

Environment variables (CLI and Action)

VERDICT_API_KEY, VERDICT_PROVIDER, VERDICT_BASE_URL, VERDICT_MODEL.

How it works

PR opened ─▶ fetch diff ─┐
                         ├─▶ build prompt ─▶ model ─▶ JSON verdict ─▶ decide ─▶ PR comment + check
intent (issue/body/spec)─┘

Gather intent — PR title, PR body, linked issues (fixes #42), and any spec/doc files you point it at. Most explicit source wins.
Parse the diff — dependency-free unified-diff parser, truncated to a cost budget.
Judge fidelity — a tightly scoped prompt asks the model for evidence-backed findings only, returned as strict JSON.
Decide — findings at/above fail-on severity (and above min-confidence) flip the check to fail.

The model is treated as a commodity component: swap providers freely. The value is in the verification framing and the accumulating sense of what tends to go wrong — not in any one model.

Build from source

npm install
npm run typecheck      # type-check
npm run build          # compile the CLI to dist/
npm run build:action   # bundle the Action to dist-action/ (requires @vercel/ncc)

The Action runs dist-action/index.js. Commit that bundle (or build it in a release workflow) so the node20 runtime can execute it without an install step.

Roadmap

Inline review comments anchored to specific diff lines.
Public web scan: drop in a repo URL, get a shareable code/doc fidelity report + badge.
Hosted tier: continuous monitoring, dashboards, and team integrations.

License

Apache-2.0.