tex-eval

v0.1.0

Published

2 months ago

A/B harness for things a coding agent uses (MCPs, CLIs, scaffolds, prompts). Drives Claude Code against a fixed corpus and produces comparable reports.

0High
0Medium
0Low

daksh_jaitly

tex

The eval engine behind tex-mex and (eventually) other audience-specific eval plugins. This repo is both:

The tex-eval npm package — a Node CLI that spawns Claude Code against a corpus, measures four behavioral metrics per task, runs an LLM-as-judge for completion, and writes comparable reports.
A Claude Code plugin marketplace at theDakshJaitly/tex. Today it ships one plugin (tex-mex); more land here as audiences are wedged.

If you're trying to use tex to evaluate something, you almost certainly want a plugin, not the engine directly. Skip to For users.

If you're trying to build a new audience-specific plugin or wire tex into CI, read on after that.

What tex caught in the wild

A scaffold whose pitch was less context via curated routing shipped 5 design iterations over 5 days. All held 10/10 completion. Most improved navigation precision 2–2.5×.

But every variant increased tokens_loaded. The best one was +177% above baseline — 5,622 tokens vs 2,030. The scaffold's whole pitch was less context. The candidate would have shipped silently without the harness.

Decision: shelved, not shipped. Full report: eval-reports/0.3.0-alpha.md.

This is what tex is for. It doesn't decide for you. It surfaces the contradiction loudly enough that "ship it" stops being the default.

For users

If you're iterating on a mex-shaped scaffold

You want the tex-mex plugin. Inside Claude Code:

/plugin marketplace add theDakshJaitly/tex
/plugin install tex-mex@tex

Then /tex-mex to run an eval. Two questions, then it does the work.

Full pitch and details: plugins/tex-mex/README.md.

If you're iterating on an agent-facing CLI (e.g., a cli-printing-press output)

A tex-cli plugin is on the roadmap. Until it lands, the engine supports the CLI workflow directly — see For plugin authors / CI below and use tex run --subject cli:<path>.

If you're iterating on an MCP server or prompt layer

The engine's subject loaders for mcp and prompt are stubbed for v1.1. Use --subject none as a control today; the diff against a fixture that hand-wires your MCP into .claude/settings.json is the manual workaround.

For plugin authors / CI

The engine ships as the tex-eval npm package. Install:

npm install -g tex-eval
tex --version
tex --help

You also need to be logged into Claude Code (claude /login once, cached in your keychain). No API key required by default.

Quickstart

# Verify the pipeline
tex smoke

# Scaffold a starter corpus (cli or scaffold templates)
tex init --kind scaffold --var scaffold_name=foo --var scaffold_purpose="..."

# Validate
tex validate corpus

# Run
tex run --label baseline --subject scaffold

# Compare
tex diff results/baseline/report.json results/candidate/report.json

CLI reference

| Command | What it does | |---|---| | tex init --kind <cli\|scaffold> --var ... | Scaffold a starter corpus + fixture. mcp and prompt are stubbed for v1.1. | | tex validate [<dir>] | Load corpus, report errors. Exit non-zero on failure. | | tex run --label <name> --subject <arg> [--auth oauth\|key] [--task <id>] | Run the corpus; write results/<label>/report.{json,md} | | tex diff <baseline.json> <candidate.json> | Markdown diff to stdout + eval-reports/ | | tex smoke [--auth oauth\|key] | One no-op task against the bundled fixture | | tex detect [<path>] | Classify a directory as mcp / cli / scaffold / etc. |

--subject accepts a shorthand (none, scaffold, cli:<path>, mcp:<config>, prompt:<path>) or a path to a JSON SubjectConfig.

The four metrics

| Metric | What it measures | |---|---| | tokens_loaded | Approximate token count of files the agent actually read (chars/4). | | navigation.precision | Read-files ∩ expected-files / read-files | | navigation.recall | Read-files ∩ expected-files / expected-files | | time_to_first_output_ms | Wall-clock from spawn to first agent text | | completion.overall_score | LLM-judged pass rate across binary rubric criteria, scaled 0–10 |

Deliberately decoupled. A change can improve one and regress another — the per-task table tells you which, the aggregate table tells you how much, the rubric breakdown tells you why.

Auth modes

--auth oauth (default) uses your Claude Code subscription. Spawned sessions inherit your CLAUDE.md / hooks / plugin context — that's ~18.8k tokens of "pollution" per session, constant across compared runs, so deltas remain meaningful but absolute scores aren't portable.

--auth key opt-in requires ANTHROPIC_API_KEY and prepends --bare to every spawn, stripping all of that. Real money. Portable scores. Recommended for CI and published benchmarks.

tex diff warns when comparing reports with mismatched auth modes.

Full semantics: docs/headless-cc-notes.md.

Authoring a corpus

See docs/getting-started.md for the end-to-end walkthrough. Short version:

id: my-task
fixture: code-repo
prompt: |
  Specific, evidence-checkable task the agent should perform.

expected_context_files:
  - ROUTER.md
  - patterns/foo.md

success_criteria:
  - id: stable-id
    text: A binary, evidence-checkable claim.
  # ... at least 3 criteria

# Optional per-task overrides:
driver_model: haiku
budget_usd: 0.10

The describe-a-fail discipline: for every criterion, you should be able to state in one sentence what failure looks like. If you can't, the criterion is too vague and the eval will be noise. The tex-mex skill enforces this; the bare CLI trusts you.

Building an audience plugin

If you want to ship a /plugin install-able experience for a specific audience (CLI users, MCP authors, prompt-layer iterators), copy the shape of plugins/tex-mex/:

plugins/<your-plugin>/
  .claude-plugin/plugin.json    # plugin manifest
  skills/<your-plugin>/SKILL.md # the audience-shaped UX
  bin/tex                       # shim → CLAUDE_PLUGIN_DATA/node_modules/.bin/tex
  hooks/hooks.json              # SessionStart: npm install tex-eval
  package.json                  # tex-eval as a dependency
  README.md                     # audience-facing pitch

Then add an entry to .claude-plugin/marketplace.json at this repo's root.

Plugin docs: docs/v1-spec.md for the design rationale; Claude Code plugin reference for the manifest schema.

What's in this repo

tex/
├── README.md                       ← you are here
├── .claude-plugin/
│   └── marketplace.json            ← lists plugins for /plugin marketplace add
├── plugins/
│   └── tex-mex/                    ← the mex-flavored plugin
├── bin/tex.ts                      ← engine CLI entry (compiled to dist/bin/tex.js)
├── harness/                        ← engine source (typescript)
├── templates/                      ← cli + scaffold starter corpora used by tex init
├── corpus/                         ← engine's own dogfood corpus (5 mex-shaped tasks)
├── fixtures/code-repo/             ← engine's own dogfood fixture (mex-scaffolded billing svc)
├── docs/
│   ├── v1-spec.md                  ← engine design doc
│   ├── getting-started.md          ← end-to-end CLI walkthrough
│   └── headless-cc-notes.md        ← Claude Code driver / auth details
├── eval-reports/                   ← committed historical eval artifacts
│   ├── 0.3.0-alpha.md              ← the headline story (the +177% regression)
│   └── v1-validation/              ← v1 ship validation artifacts
└── test/                           ← engine tests

Known limitations

Single trial per task. Absolute scores are noisy at N=1; deltas are the primary signal.
Operator-environment pollution under default oauth auth. Use --auth key for portable absolute scores.
tokens_loaded is a chars/4 approximation, not a real tokenizer. Relative measurements across runs are stable; absolute counts are off-by-some-percent.
mcp and prompt subject loaders are stubbed for v1.1. cli, scaffold, and none are real.
Claude Code is the only agent driver in v1. The interface is clean for plug-in; Cursor / Codex / Aider land in v2.

Roadmap

v1.1 — real mcp and prompt subject loaders; tex sync-fixture for snapshot-based workflows
v1.x — tex-cli plugin (CPP / agent-facing CLI audience)
v2 — multi-agent drivers (Cursor, Codex, Aider); N>1 trials with variance reporting; CI gates

Origin

Split from mex on 2026-05-07. tex was originally mex's internal eval harness — including the +177% token-regression evaluation that got that work shelved. Carved out so the harness can be useful elsewhere.