@skyswordw/skillmatrix

v0.2.0

Published

17 days ago

Behavioral cross-agent testing for agent skills: run a skill headlessly on Claude Code and Codex, assert deterministic outcomes, and get a pass/fail behavior matrix.

0High
0Medium
0Low

skywalker_s

agent-skills skill claude-code codex testing cross-agent behavior ci cli

skillmatrix

Behavioral cross-agent testing for agent skills. Run a skill for real on Claude Code and Codex, assert deterministic outcomes, and get a pass/fail behavior matrix.

If skillport is the static check ("does this SKILL.md use anything that won't port?"), skillmatrix is the behavioral one: it actually runs the skill headlessly on each agent against an isolated workspace and checks what the skill did.

Static vs behavioral. skillport reads your skill in milliseconds and flags Claude-only syntax. skillmatrix runs it on each agent and verifies the result. Use skillport as the fast pre-check; use skillmatrix to prove behavior. Lint first, then run.

How it works

For every (test × agent) pair, skillmatrix:

Creates a fresh, isolated workspace.
Installs the skill into that agent's discovery path (.claude/skills/ for Claude Code, .agents/skills/ for Codex).
Runs the agent headlessly (claude -p / codex exec) with your prompt.
Evaluates deterministic filesystem assertions against the resulting workspace.

The robust, cross-agent signal is the filesystem effect (did the skill produce the files it promised, with the right content?) plus a clean exit — identical semantics on every agent.

Quick start

A test is a *.smtest.json file:

{
  "name": "make-marker creates the marker file",
  "skill": "./skills/make-marker",
  "prompt": "Use your make-marker skill to create its marker file.",
  "agents": ["claude-code", "codex"],
  "assert": [{ "file": "MARKER.txt", "equals": "skillmatrix-marker-ok" }]
}

Run it (this invokes the real claude / codex CLIs and uses their quota):

npx github:skyswordw/skillmatrix examples/make-marker

make-marker creates the marker file
  claude-code ✓  ·  codex ✓

2 cell(s): 2 ✓ pass · 0 ✗ fail

When a skill behaves differently on one agent, you see exactly where:

make-marker creates the marker file
  claude-code ✓  ·  codex ✗
  ✗ codex:
     - expected file "MARKER.txt" to exist, but it does not

Assertions

Each assertion targets a file (relative to the workspace) with exactly one matcher:

| Matcher | Passes when | |---|---| | "exists": true / false | the file is present / absent | | "equals": "..." | content equals the value (both sides trimmed) | | "contains": "..." | content contains the substring | | "matches": "regex" | content matches the JS regex (raw content) | | "json": { "path": "a.b.0", "equals": ... } | the file parses as JSON and the value at the dot-path deep-equals equals | | "directory": true | file is a directory | | "glob": "data/*.json", "count": 2 | files match the glob (optional exact count; default ≥1) | | "trace": { "path": "is_error", "equals": false } | the agent's JSON stdout: value at the dot-path equals / contains |

File paths are confined to the workspace — absolute paths and .. escapes are rejected (at parse and at evaluation). Content is CRLF-normalized before matching.

Seed input files into the workspace before the run with a top-level "files" map ({ "data/input.txt": "..." }), in addition to (or instead of) a "fixture" directory.

CLI

skillmatrix [path] [options]

  -a, --agent <list>   claude-code,codex,all   (default: all)
      --json           machine-readable output
      --keep           keep run workspaces for debugging
      --work-dir <d>   base dir for workspaces (default: OS temp)
      --timeout <sec>  per-run timeout (default: 240)

Exit code is non-zero if any cell fails — drop it into CI. Note: CI needs the claude / codex CLIs authenticated on the runner.

Status

v0.1 — deterministic filesystem assertions across Claude Code + Codex, proven end-to-end by a feasibility spike (spike/). Roadmap: more assertion kinds (command-was-run via the agents' JSON traces), Cursor once its skill CLI matures, and tighter integration with skillport (lint → run in one pass).

Development

npm install      # project-local; no global installs
npm run check    # typecheck
npm test         # build + run the node:test suite (uses a fake executor — no real agent calls)
npm run build    # emit dist/

The agent runner is injectable (AgentExecutor), so the full orchestration is unit-tested without spawning real agents. The real executor (src/exec.ts) shells out to the CLIs.

Part of a small set of honest-by-default QA tools for AI-assisted development:

skillport — static cross-agent skill linter: does your SKILL.md port across agents?
skillmatrix (this repo) — behavioral cross-agent skill testing
claimcheck — a CI receipt for the claims your PR makes

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme