@hummer98/ctxd-claude-plugin

v0.2.0

Published

8 days ago

Declarative companion to shell state mutations (cd, export, git checkout/switch). Emits structured JSON describing cwd, env, and git state.

0High
0Medium
0Low

hummer98

claude claude-code claude-plugin skill shell cwd git context

ctxd

English | 日本語

Declarative CLI commands that pass structured context to AI agents.

ctxd wraps the shell operations that AI agents get lost in — cd, export, git checkout — and returns structured JSON so the agent knows exactly what changed.

# Before: silent, agent has to guess
cd /foo && git checkout main

# After: agent sees the state
ctxd chdir /foo
# {"ok":true,"cmd":"chdir","result":{"cwd":"/foo","git_branch":"main","listing":["src","docs","go.mod"]}}

ctxd git-switch main
# {"ok":true,"cmd":"git-switch","result":{"branch":"main","dirty":false,"ahead":0,"behind":0}}

Why

The silent CLI problem

Unix's "rule of silence" — successful commands produce no output — was designed for human operators who can perceive context implicitly. For AI agents, it is a systematic blind spot.

| Command | What it does silently | What the agent loses | |---|---|---| | cd /foo | Changes working directory | New cwd, git branch, file listing | | export FOO=bar | Sets environment variable | Which variables changed, what their values are | | git checkout main | Switches branch | Branch, dirty state, divergence from remote | | kill -STOP <pid> | Pauses a process | Process state | | umask 022 | Changes file creation mask | Effective permissions for new files |

The agent's only recovery is to emit a follow-up command (pwd, env, git status) — burning extra tokens and an extra round-trip — or to reason from context, which drifts under long conversations.

Transformers can't track sequential state

Mozer et al. "The Topological Trouble With Transformers" (2026) formalizes the issue: feedforward architectures cannot maintain evolving state across depth. The longer an agent session runs, the more it relies on external signals to reconstruct where it is.

This isn't a prompt-engineering problem. It's structural. The fix is structural too: make the state external and machine-readable at the point of mutation.

Infrastructure did this already

Server orchestration faced the same problem in the 2010s — imperative shell scripts drifted, state was implicit, failures were opaque. The industry converged on Terraform, Kubernetes, and GitOps: declare intent, verify postconditions, report structured diffs.

ctxd applies that same pattern to the local shell, scoped to the operations AI agents use most.

How it works

Every ctxd command:

Executes the underlying operation
Observes the resulting state
Returns a structured JSON report

{
  "ok": true,
  "cmd": "chdir",
  "args": ["/foo"],
  "result": {
    "cwd": "/foo",
    "git_branch": "main",
    "listing": ["src", "docs", "go.mod", "README.md"]
  },
  "postcondition": { "passed": true, "checks": [] },
  "elapsed_ms": 4
}

On failure:

{
  "ok": false,
  "cmd": "chdir",
  "args": ["/nonexistent"],
  "error": {
    "code": "path_not_found",
    "message": "no such file or directory: /nonexistent",
    "retryable": false
  }
}

Postconditions

Declare what you expect the state to be after the command. ctxd verifies it and reports a clear pass/fail:

ctxd git-switch main --expect branch=main --expect dirty=false

{
  "ok": true,
  "postcondition": {
    "passed": true,
    "checks": [
      {"key": "branch", "expected": "main", "actual": "main", "passed": true},
      {"key": "dirty",  "expected": "false","actual": "false","passed": true}
    ]
  }
}

Commands (MVP)

| Command | Replaces | Key output fields | |---|---|---| | ctxd chdir <path> | cd | cwd, git_branch, listing | | ctxd git-switch <branch> | git checkout / git switch | branch, dirty, ahead, behind | | ctxd env-set <KEY=val>… | export | set, diff.added, diff.changed |

--human flag switches to human-readable output for debugging.

`ctxd chdir`

Resolve a path, list its contents, and report the git branch (if any):

ctxd chdir /path/to/repo

{
  "ok": true,
  "cmd": "chdir",
  "args": ["/path/to/repo"],
  "result": {
    "cwd": "/path/to/repo",
    "git_branch": "main",
    "listing": ["docs", "go.mod", "src"]
  },
  "elapsed_ms": 3
}

git_branch is null when the path is outside a git working tree or HEAD is detached. Errors return ok: false with error.code of not_found (path missing) or not_a_directory (path is a file).

The parent shell's cwd is not modified — pass the resolved cwd to the next command instead.

`ctxd git-switch`

Switch a git branch and report the resulting working tree state:

ctxd git-switch main

{
  "ok": true,
  "cmd": "git-switch",
  "args": ["main"],
  "result": {
    "branch": "main",
    "dirty": false,
    "ahead": 0,
    "behind": 0
  },
  "elapsed_ms": 32
}

branch is null when HEAD is detached. ahead / behind are 0 when no upstream is configured. On failure, error.code is one of not_a_git_repo, branch_not_found, dirty_tree, or git_not_found.

The parent shell's HEAD is updated (the switch is real) but cwd is not changed.

`ctxd env-set`

Set one or more environment variables in this child process and report the resulting set map and diff (added / changed):

ctxd env-set FOO=bar BAZ=qux

{
  "ok": true,
  "cmd": "env-set",
  "args": ["FOO=bar", "BAZ=qux"],
  "result": {
    "set": {"FOO": "bar", "BAZ": "qux"},
    "diff": {
      "added": ["BAZ"],
      "changed": ["FOO"]
    }
  },
  "elapsed_ms": 1
}

set is the final KEY → value map of this invocation (last-write-wins when the same KEY is repeated). diff.added lists keys that were not previously in the environment; diff.changed lists keys whose value differed from the previous value. Keys whose value is unchanged appear in neither list.

The argument format is KEY=VAL. The first = is the separator, so values may contain = (e.g. URL=http://x?a=b). An empty value (KEY=) is valid. On failure, error.code is invalid_args (missing =, empty KEY, or zero arguments) or exec_failed (the underlying os.Setenv call failed).

The parent shell's environment is not modified — pass the resolved set to the next command, or read the JSON to know which variables the child process saw.

Installation

Work in progress. Binary releases coming soon.

Claude Code Plugin (recommended for Claude Code users):

claude plugins:install @hummer98/ctxd-claude-plugin

Installs both the ctxd binary and the Skill that teaches Claude to use it.

Standalone CLI:

# npm install -g @hummer98/ctxd  # (coming soon — Go binary npm wrapper not yet published)
brew install hummer98/tap/ctxd

Skill bundle

ctxd ships with a SKILL.md compliant with the Anthropic Agent Skills specification.

The skill does not enforce usage. It nudges: when the agent reaches for cd, export, or git checkout, the skill surfaces the ctxd equivalent and explains what context would be gained. Adoption is the agent's choice.

Compatible with Claude Code, OpenCode, Codex, Cursor, Gemini CLI, and any host that supports Agent Skills.

Design principles

Don't shadow existing commands — new command names only, no aliases over cd
JSON by default, human optional — --human for readable output
Postconditions are opt-in — useful when you want them, invisible when you don't
Narrow and deep — top 20–30 commands done well, not full POSIX coverage
Pluggable — users can add their own declarative wrappers

Development

Prerequisites

Go 1.26 or later

Build

go build -o ctxd ./cmd/ctxd

Run

./ctxd --version
./ctxd --help

Test

go test ./...

Architecture decisions

See docs/adr/ for design decisions (CLI framework selection etc.).

Eval harness

evals/ contains a SKILL adherence harness that drives a real claude process inside an isolated cmux workspace and measures whether the agent reaches for ctxd chdir / ctxd git-switch / ctxd env-set when the SKILL says it should.

bash evals/run.sh
# or override the per-scenario trial count
EVAL_N=1 bash evals/run.sh
# or pin a different model (default: claude-opus-4-7)
EVAL_MODEL=claude-sonnet-4-5 bash evals/run.sh

Each trial spins up claude --settings <per-trial>.json so a Stop hook touches a sentinel file when the session ends, and a PostToolUse hook appends each tool_use to session-<id>-<trial>.tools.jsonl. The runner waits on the sentinel instead of scraping the screen, and summarize.py reads tool_use from the hook JSONL first (falling back to the raw Claude session JSONL if the hook output is empty).

Outputs land in evals/results/<UTC-timestamp>/:

session-<id>-<trial>.jsonl — raw Claude Code session JSONL (one per trial, git-ignored)
session-<id>-<trial>.meta.json — exit_status, wall time, session id (git-ignored)
session-<id>-<trial>.tools.jsonl — PostToolUse hook output, one tool_use per line (git-ignored)
session-<id>-<trial>.done — Stop hook sentinel marking session completion (git-ignored)
session-<id>-<trial>.settings.json — per-trial claude --settings payload wiring the hooks (git-ignored)
summary.md — overall and per-scenario success rate, plus the first failing tool_use quoted for context. Header records plugin version, git SHA, git branch, claude version, and model so each run is uniquely traceable.

Cross-run trend lives in evals/results/index.md and evals/results/index.csv (one row per run). Both are committed; the heavy JSONL / meta files are not — re-running the harness regenerates them.

The plugin version comes from .claude-plugin/plugin.json and acts as the canonical unit for comparing measurements. See CLAUDE.md for the bump policy when SKILL.md changes.

Cost / time budget: each trial spends a few model cents. Default EVAL_N=3 × 5 scenarios ≈ a handful of dimes to ~$1 and 5–10 minutes wall-clock for EVAL_MODEL=claude-opus-4-7 (default). Switching to a faster / cheaper model via EVAL_MODEL shifts both axes. See evals/scenarios.jsonl for the prompts and expected patterns.

evals/.eval-plugin/ (plugin shim that wires skills/ctxd into the Skills loader) is git-ignored — the harness regenerates the shim on every run, dynamically writing the version from .claude-plugin/plugin.json.

Adherence over plugin versions

How often the agent reaches for ctxd when the SKILL says it should, across plugin versions. The figures below come from the evals/run.sh harness — per-scenario breakdown below.

| model | plugin version | N | trials | overall | chdir | git-switch | env-set | notes | |---|---|---:|---:|---:|---:|---:|---:|---| | claude-opus-4-7 | 0.1.0 | 3 | 15 | 0.0% | 0/6 | 0/6 | 0/3 | Initial baseline; hook-based harness landed (T013–T015) | | claude-opus-4-7 | 0.1.1 | 3 | 15 | 6.7% | 0/6 | 0/6 | 1/3 | SKILL.md trigger reinforced (description, ❌→✅ examples) (T016) | | claude-opus-4-7 | 0.1.2 | 3 | 15 | 53.3% | 5/6 | 1/6 | 2/3 | disambiguation + NEVER phrasing + Precondition section (T017) | | claude-opus-4-7 | 0.1.3 | 3 | 15 | 100.0% | 6/6 | 6/6 | 3/3 | pattern matcher tightened + scenario setup hooks + plugin author (T018) | | claude-opus-4-7 | 0.1.3 | 10 | 50 | 98.0% | 20/20 | 19/20 | 10/10 | Variance check at N=10 (T019) | | claude-opus-4-7 | 0.2.0 | 3 | 15 | 100.0% | 6/6 | 6/6 | 3/3 | SKILL.md Postcondition section rewritten to match T010 final implementation (--expect DSL + Result.Data preserved on failure) (T023) |

N is trials per scenario; trials is N × 5 scenarios. Each cell shows passes / trials for that command family. Model is part of the run identity — switching to a different EVAL_MODEL (e.g. claude-sonnet-4-5) requires re-running the baseline; rows are not directly comparable across models. The latest baseline lives in evals/results/index.md — the table here is updated by hand, see CLAUDE.md.

Efficiency over plugin versions

What did adopting the SKILL cost in tokens, tool-uses, and wall time? Adherence (above) measures whether the agent reaches for ctxd; efficiency measures what it took to get there. Numbers are per-trial averages collected by the same evals/run.sh harness, with EVAL_N=3 × 5 scenarios.

| model | plugin version | avg_tool_uses | avg_input_tokens | avg_output_tokens | avg_wall_ms | notes | |---|---|---:|---:|---:|---:|---| | claude-opus-4-7 | ≤ 0.2.0 | – | – | – | – | Pre-T024 runs; raw session JSONL was git-ignored and pruned, so retroactive efficiency numbers are unavailable. |

avg_input_tokens is the sum of fresh input_tokens across all assistant messages in a trial (cache writes and reads are tracked separately in each run's summary.md). avg_wall_ms is integer-second precision (the raw wall_seconds from meta.json × 1000). The Stage 1 baseline gets seeded the next time evals/run.sh runs after this commit; until then, the row above documents that nothing was recoverable from past runs.

Status

Early development. The design is settled; the implementation is not.

Contributions, feedback, and use-case reports welcome via issues.

License

MIT