vigiles

v5.1.0

Published

10 hours ago

Lint & test the harness your AI agent runs on — verify the references in your CLAUDE.md / AGENTS.md and test that your hooks and skills actually work.

0High
0Medium
0Low

zernie

claude-code codex agents agentic ai llm harness hooks skills claude agents-md eval testing linter

Agent = Model + Harness. You'd never ship an app without a linter and a test suite — yet the harness steering your agent runs on vibes. vigiles[^name] is the deterministic layer for it, and does two independent things — adopt either, or both:

| | | | ----------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | | 🔎 Lint | Every file path, script, code symbol, and linter rule your CLAUDE.md cites is checked against reality — so a renamed file or a disabled rule can't silently mislead the agent. → | | 🧪 Test | Hooks, skills, and subagents are code. vigiles tests they do their job — and almost all of it is deterministic, no API key; the real-model evals run on your Claude subscription, not metered tokens. → |

Pick the one that hurts today. Works with Claude Code and Codex (vigiles/codex), and you can teach it your own harness.

Quick start

Paste into Claude Code or Codex:

Set up vigiles in this repo with good defaults (lint + test, non-interactive).
Verify my CLAUDE.md / AGENTS.md references and show me what's stale, then write
and run a harness test for one of my hooks or skills. Ask me first before gating
it in CI, adding a real-model eval, or enforcing strictly (--strict).

Or do it yourself:

npx vigiles init   # sets up lint + test: spec + harness test + CI + plugin

It's interactive in a terminal and non-interactive for agents/CI (or with --yes), so "set up vigiles" from a Claude Code / Codex prompt Just Works — and it installs a model-invocable test-harness skill, so afterward you can just tell your agent "test my skills" and it picks the tier and writes the test.

Both lint and test by default; scope with --lint / --test (one or both).
Adds vigiles to your devDependencies.
Installs the Claude Code plugin (skills + hooks) via the marketplace — globally, never vendored into your repo.

Wires CI as a zernie/vigiles@v1 workflow (a composite over the same CLI):

- uses: actions/checkout@v4
- uses: zernie/vigiles@v1 # lints by default; posts a sticky PR comment + a `valid` output

Prefer to write tests yourself? They can be JS or TS (*.harness.{mjs,ts}) — run them with npx vigiles test.

① Lint — your CLAUDE.md lies to your agent

Your CLAUDE.md points the agent at src/auth/login.ts and tells it to run npm run check. But the file moved to src/auth/session.ts six commits ago, and the script was renamed. The agent trusts the stale claim and acts on fiction.

npx vigiles lint resolves every reference against reality:

CLAUDE.md:
  ✗ src/auth/login.ts — no such file (renamed or moved?)
  ✗ npm run check — not in package.json. Did you mean: "check:types"?
  ✓ @typescript-eslint/no-floating-promises — exists and enabled in eslint config

File paths, scripts, and code symbols — plus linter rules across 7 catalogs (the rule exists and is enabled). Start with one inline comment, no new files; step up to a typed .spec.ts (compiled to CLAUDE.md, compiler-grade) when you want it. Full guide →

Same cross-reference, any plugin. npx vigiles scan checks a plugin's contracts — every subagent tool, mcp__server__tool, mcp_tool hook, hook event, and script path actually exists and resolves, not just parses (valid YAML ≠ a tool that's real). A superset of Anthropic's claude plugin validate, no key. Audit any plugin →

② Test — does your harness do its job?

A hook can be wired wrong. A skill's description can fail to trigger — or hijack unrelated prompts. Injected context can never reach the model. All of it passes a naive "did it run?" check. vigiles tests the assembled harness for real:

import { runHook } from "vigiles/testing";

const r = runHook(guard, {
  hook_event_name: "PreToolUse",
  tool_name: "Bash",
  tool_input: { command: "git commit --no-verify" },
});
assert(r.blocked); // a red ✗ means your guard silently lets it through

It goes well past "did it fire?":

Hooks block what they must — runHook, or the real agent CLI via runHarnessTest.
Skills trigger on the right prompts and stay quiet on the wrong ones — recall and precision (measureTriggerRate).
Behaviour is good — score a skill's output directly, or A/B it on-vs-off for the real lift over no-skill (measure / runEval, with significance testing).
Safety holds — the agent didn't push to the wrong branch or hit a paid API; interceptTools catches the attempt so the side effect never happens.

The eval you can actually afford. Almost every tier runs with no model and no API key — milliseconds, on every commit. The rest drive your own claude CLI:

| | Runs on | Cost | | ---------------------- | ----------------------- | ------------------------------------------- | | promptfoo, DeepEval, … | metered API SDK | billed per token, every run | | vigiles | your Claude Pro/Max sub | $0 extra — and most tiers need no model |

That's why you can eval your harness on every change, not just once. How it works → · Why it's affordable → · Safety model →

Plugin health leaderboard → — point scan at a marketplace (e.g. wshobson/agents) and it ranks every plugin by structural health (0–100, A–F), worst issues first — still no key. Add --trigger for the model-gated column: do the skills actually fire?
CLI & GitHub Action → — every command, the Action (inputs / output / versioning), the Claude Code plugin, and the lint rules.
Skills → — consumer skills installed as a Claude Code plugin: /plugin marketplace add zernie/vigiles then /plugin install vigiles@vigiles (or let vigiles init do it). The model-invocable ones (test-harness, strengthen, edit-spec) fire on their own — ask "test my skills", "strengthen my rules", or "add a rule to CLAUDE.md" and the agent reaches for them; migrate-to-spec and linter-docs are user-invoked.
Docs index → · Research → · Related tools → (ast-grep, Dependency Cruiser, Ruler, rulesync).
Companion to Feedback Loop Is All You Need.

License

MIT

[^name]: vigiles — the watchmen of ancient Rome, who guarded the city (and fought its fires) by night. Quis custodiet ipsos custodes? — "who watches the watchmen?" (Juvenal, Satire VI).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Quick start

① Lint — your CLAUDE.md lies to your agent

② Test — does your harness do its job?

More

License