tex-eval
v0.1.0
Published
A/B harness for things a coding agent uses (MCPs, CLIs, scaffolds, prompts). Drives Claude Code against a fixed corpus and produces comparable reports.
Readme
tex
The eval engine behind tex-mex and (eventually)
other audience-specific eval plugins. This repo is both:
- The
tex-evalnpm package — a Node CLI that spawns Claude Code against a corpus, measures four behavioral metrics per task, runs an LLM-as-judge for completion, and writes comparable reports. - A Claude Code plugin marketplace at
theDakshJaitly/tex. Today it ships one plugin (tex-mex); more land here as audiences are wedged.
If you're trying to use tex to evaluate something, you almost certainly want a plugin, not the engine directly. Skip to For users.
If you're trying to build a new audience-specific plugin or wire tex into CI, read on after that.
What tex caught in the wild
A scaffold whose pitch was less context via curated routing shipped 5 design iterations over 5 days. All held 10/10 completion. Most improved navigation precision 2–2.5×.
But every variant increased tokens_loaded. The best one was
+177% above baseline — 5,622 tokens vs 2,030. The scaffold's whole
pitch was less context. The candidate would have shipped silently
without the harness.
Decision: shelved, not shipped. Full report:
eval-reports/0.3.0-alpha.md.
This is what tex is for. It doesn't decide for you. It surfaces the contradiction loudly enough that "ship it" stops being the default.
For users
If you're iterating on a mex-shaped scaffold
You want the tex-mex plugin. Inside Claude Code:
/plugin marketplace add theDakshJaitly/tex
/plugin install tex-mex@texThen /tex-mex to run an eval. Two questions, then it does the work.
Full pitch and details: plugins/tex-mex/README.md.
If you're iterating on an agent-facing CLI (e.g., a cli-printing-press output)
A tex-cli plugin is on the roadmap. Until it lands, the engine
supports the CLI workflow directly — see For plugin authors / CI
below and use tex run --subject cli:<path>.
If you're iterating on an MCP server or prompt layer
The engine's subject loaders for mcp and prompt are stubbed for
v1.1. Use --subject none as a control today; the diff against a
fixture that hand-wires your MCP into .claude/settings.json is the
manual workaround.
For plugin authors / CI
The engine ships as the tex-eval npm package. Install:
npm install -g tex-eval
tex --version
tex --helpYou also need to be logged into Claude Code (claude /login once,
cached in your keychain). No API key required by default.
Quickstart
# Verify the pipeline
tex smoke
# Scaffold a starter corpus (cli or scaffold templates)
tex init --kind scaffold --var scaffold_name=foo --var scaffold_purpose="..."
# Validate
tex validate corpus
# Run
tex run --label baseline --subject scaffold
# Compare
tex diff results/baseline/report.json results/candidate/report.jsonCLI reference
| Command | What it does |
|---|---|
| tex init --kind <cli\|scaffold> --var ... | Scaffold a starter corpus + fixture. mcp and prompt are stubbed for v1.1. |
| tex validate [<dir>] | Load corpus, report errors. Exit non-zero on failure. |
| tex run --label <name> --subject <arg> [--auth oauth\|key] [--task <id>] | Run the corpus; write results/<label>/report.{json,md} |
| tex diff <baseline.json> <candidate.json> | Markdown diff to stdout + eval-reports/ |
| tex smoke [--auth oauth\|key] | One no-op task against the bundled fixture |
| tex detect [<path>] | Classify a directory as mcp / cli / scaffold / etc. |
--subject accepts a shorthand (none, scaffold, cli:<path>,
mcp:<config>, prompt:<path>) or a path to a JSON SubjectConfig.
The four metrics
| Metric | What it measures |
|---|---|
| tokens_loaded | Approximate token count of files the agent actually read (chars/4). |
| navigation.precision | Read-files ∩ expected-files / read-files |
| navigation.recall | Read-files ∩ expected-files / expected-files |
| time_to_first_output_ms | Wall-clock from spawn to first agent text |
| completion.overall_score | LLM-judged pass rate across binary rubric criteria, scaled 0–10 |
Deliberately decoupled. A change can improve one and regress another — the per-task table tells you which, the aggregate table tells you how much, the rubric breakdown tells you why.
Auth modes
--auth oauth (default) uses your Claude Code subscription. Spawned
sessions inherit your CLAUDE.md / hooks / plugin context — that's
~18.8k tokens of "pollution" per session, constant across compared
runs, so deltas remain meaningful but absolute scores aren't portable.
--auth key opt-in requires ANTHROPIC_API_KEY and prepends --bare
to every spawn, stripping all of that. Real money. Portable scores.
Recommended for CI and published benchmarks.
tex diff warns when comparing reports with mismatched auth modes.
Full semantics: docs/headless-cc-notes.md.
Authoring a corpus
See docs/getting-started.md for the end-to-end
walkthrough. Short version:
id: my-task
fixture: code-repo
prompt: |
Specific, evidence-checkable task the agent should perform.
expected_context_files:
- ROUTER.md
- patterns/foo.md
success_criteria:
- id: stable-id
text: A binary, evidence-checkable claim.
# ... at least 3 criteria
# Optional per-task overrides:
driver_model: haiku
budget_usd: 0.10The describe-a-fail discipline: for every criterion, you should be able
to state in one sentence what failure looks like. If you can't, the
criterion is too vague and the eval will be noise. The tex-mex skill
enforces this; the bare CLI trusts you.
Building an audience plugin
If you want to ship a /plugin install-able experience for a specific
audience (CLI users, MCP authors, prompt-layer iterators), copy the
shape of plugins/tex-mex/:
plugins/<your-plugin>/
.claude-plugin/plugin.json # plugin manifest
skills/<your-plugin>/SKILL.md # the audience-shaped UX
bin/tex # shim → CLAUDE_PLUGIN_DATA/node_modules/.bin/tex
hooks/hooks.json # SessionStart: npm install tex-eval
package.json # tex-eval as a dependency
README.md # audience-facing pitchThen add an entry to .claude-plugin/marketplace.json
at this repo's root.
Plugin docs: docs/v1-spec.md for the design
rationale; Claude Code plugin reference
for the manifest schema.
What's in this repo
tex/
├── README.md ← you are here
├── .claude-plugin/
│ └── marketplace.json ← lists plugins for /plugin marketplace add
├── plugins/
│ └── tex-mex/ ← the mex-flavored plugin
├── bin/tex.ts ← engine CLI entry (compiled to dist/bin/tex.js)
├── harness/ ← engine source (typescript)
├── templates/ ← cli + scaffold starter corpora used by tex init
├── corpus/ ← engine's own dogfood corpus (5 mex-shaped tasks)
├── fixtures/code-repo/ ← engine's own dogfood fixture (mex-scaffolded billing svc)
├── docs/
│ ├── v1-spec.md ← engine design doc
│ ├── getting-started.md ← end-to-end CLI walkthrough
│ └── headless-cc-notes.md ← Claude Code driver / auth details
├── eval-reports/ ← committed historical eval artifacts
│ ├── 0.3.0-alpha.md ← the headline story (the +177% regression)
│ └── v1-validation/ ← v1 ship validation artifacts
└── test/ ← engine testsKnown limitations
- Single trial per task. Absolute scores are noisy at N=1; deltas are the primary signal.
- Operator-environment pollution under default oauth auth. Use
--auth keyfor portable absolute scores. tokens_loadedis achars/4approximation, not a real tokenizer. Relative measurements across runs are stable; absolute counts are off-by-some-percent.mcpandpromptsubject loaders are stubbed for v1.1.cli,scaffold, andnoneare real.- Claude Code is the only agent driver in v1. The interface is clean for plug-in; Cursor / Codex / Aider land in v2.
Roadmap
- v1.1 — real
mcpandpromptsubject loaders;tex sync-fixturefor snapshot-based workflows - v1.x —
tex-cliplugin (CPP / agent-facing CLI audience) - v2 — multi-agent drivers (Cursor, Codex, Aider); N>1 trials with variance reporting; CI gates
Origin
Split from mex on 2026-05-07. tex was originally mex's internal eval harness — including the +177% token-regression evaluation that got that work shelved. Carved out so the harness can be useful elsewhere.
