talking-cli

v0.3.0

Published

a month ago

A linter that audits agent skills: is your CLI mute?

Downloads

309

0High
0Medium
0Low

drdexter6000

cli agent skill mcp lint

Talking CLI

Tool silence is a design defect. Distributed Prompting is the fix.

The self-audit badge shows talking-cli's own audit score (100/100). CI enforces ≥80 on every PR.

Sound familiar?

Your SKILL.md is 400 lines. Half of it describes what the agent should do after a specific tool returns — "if zero results, broaden the query," "if ambiguous, ask the user," "this field means X, not Y."

The agent loads all 400 lines every single turn, but most of that guidance only matters 10% of the time. The other 90%, it's paying attention rent on scenarios that didn't happen.

Meanwhile, your tools return raw JSON and say nothing. No hint about what just happened. No signal that results were sparse or the query was ambiguous. The tools are mute, so all the guidance gets shoved upstream into SKILL.md, which slowly bloats into a monologue describing every possible outcome — most of which the agent promptly forgets or ignores.

Talking CLI gives your tools a voice. When the agent calls, the tool talks back — with the right hint, at the right moment, inside the response. We call this Prompt-On-Call: guidance that surfaces only when a tool is called, relevant only to what just happened.

The cumulative effect is Distributed Prompting: a prompt surface spread across every tool response, not crammed into one bloated document.

Standing on shoulders. CLI is the native interface for AI agents — Carmack, CodeAct (Wang et al., ICML 2024), and Karpathy crystallized it.

Progressive disclosure as a skill-loading architecture was formalized by Anthropic (Oct 2025) and is now an open standard. Anthropic also advocates "steering agents with helpful instructions in tool responses" — but only as a paragraph-level best practice. Nobody has named it, budgeted it, audited it, or proposed it as a protocol-level primitive. That gap is what Talking CLI fills.

What this project is

Talking CLI is built around one idea: Distributed Prompting — moving guidance from static SKILL.md into the moment of invocation.

Methodology — PHILOSOPHY.md: Four Channels (C1–C4), Four Rules of Talk, a prompt budget, and five anti-patterns.
Evidence — a reproducible 2×2 ablation benchmark across 15 curated tasks (published below).
Standard — a proposed agent_hints convention we are taking to the MCP spec, backed by the data.

The linter (talking-cli audit / audit-mcp) is the probe, not the hero. It's how you reproduce the audit numbers on your own skill.

Core claim

Prompt Surface = SKILL.md ∪ {tool_result.hints} — two halves, one budget.

Anything you write into SKILL.md that only applies after a specific tool call is mispriced: it costs every turn and earns only on a small fraction of turns. Tool hints fix the pricing.

How it works

The Prompt Budget Shift

graph LR
    subgraph Before ["❌ Before: Mute CLI"]
        A1[SKILL.md<br/>400+ lines] --> A2[Agent]
        A3[Tool returns<br/>raw JSON only] --> A2
        A1 -.->|"guidance shoved upstream"| A3
    end

    subgraph After ["✅ After: Distributed Prompting"]
        B1[SKILL.md<br/>&lt; 150 lines] --> B2[Agent]
        B3[Tool returns<br/>JSON + hints] --> B2
    end

    Before -->|Audit + Optimize| After

Four Heuristics, One Score

graph TD
    H1[H1 · Document Budget<br/>SKILL.md ≤ 150 lines]
    H2[H2 · Fixture Coverage<br/>error + empty scenarios]
    H3[H3 · Structured Hints<br/>hints / suggestions / guidance]
    H4[H4 · Actionable Guidance<br/>specific, actionable content]

    H1 & H2 & H3 & H4 --> Score[Total Score<br/>0–100]
    Score -->|≥ 80| Pass[✅ PASS]
    Score -->|< 80| Fail[❌ FAIL]

Quick Start

# Audit your skill — coach mode (plain language, actionable)
npx talking-cli audit ./my-skill

# CI mode — machine-readable, exit code driven
npx talking-cli audit ./my-skill --ci

# JSON mode — structured output for tooling
npx talking-cli audit ./my-skill --json

# Audit an MCP server — static analysis (fast, safe)
npx talking-cli audit-mcp ./my-mcp-server

# Deep audit — runtime heuristics (spawns server)
# ⚠️ Only use --deep on servers you trust. See SECURITY.md.
npx talking-cli audit-mcp ./my-mcp-server --deep

# Generate optimization plan (plan-only, never touches source files)
npx talking-cli optimize ./my-skill

# Scaffold a new skill directory with templates that pass audit
npx talking-cli init my-skill
cd my-skill
npx talking-cli audit .

All commands are fully local — no API key required.

What it looks like

Coach mode running against a bloated, mute skill:

Score: 0/100
Yikes. Your CLI is so quiet I can hear the tokens screaming in agony.

H1 · Line Count · FAIL
Your SKILL.md is 165 lines. The budget is 150.
→ Just 15 lines over. Tighten the prose and migrate post-call guidance to tool hints.

H2 · Hint Coverage · FAIL
1 tool(s) have zero fixtures. They don't speak at all: search
→ Add talking-cli-fixtures for [search]. One error, one empty-result scenario.

H3 · Structured Hints · FAIL
0/0 passed fixtures contain hint fields.
→ Make your tools return a "hints" or "suggestions" field alongside raw data.

H4 · Actionable Guidance · FAIL
0/0 hint fields have actionable content.
→ Hints should be specific. "Try broadening your query with fewer filters" is actionable.

---
Fix the issues above, then run npx talking-cli audit again to see your new score.

(The real output is colored. We just can't show chalk in a code block.)

The finding: MCP Ecosystem Audit

We ran talking-cli audit-mcp --deep against 4 official Anthropic MCP servers across 68 error / empty-result scenarios. Number of scenarios that returned actionable guidance:

0 / 68.

Static analysis of 823 Composio GitHub tools: same result. The MCP ecosystem today treats tool output as a data pipe, not a dialogue participant.

| Server | Tools | Scenarios | M3 · Guidance | |--------|-------|-----------|---------------| | server-filesystem | 11 | 21 | 0 | | server-everything | 13 | 13 | 0 | | server-memory | 9 | 9 | 0 | | server-github | 25 | 25 | 0 | | Total | 58 | 68 | 0 / 68 |

2×2 Ablation Benchmark (GLM-5.1)

We ran a 2×2 ablation (Full/Lean Skill × Mute/Hinting Tools) on GLM-5.1 across 15 curated tasks:

| Cell | Skill | Server | Pass Rate | Avg Input Tokens | |------|-------|--------|-----------|-----------------| | 1 | Full Skill (873 lines) | Mute Tools | 7/15 (47%) | 122,562 | | 2 | Full Skill | Hinting Tools | 8/15 (53%) | 96,829 | | 3 | Lean Skill (168 lines) | Mute Tools | 8/15 (53%) | 54,078 | | 4 | Lean Skill | Hinting Tools | 11/15 (73%) | 40,815 |

Key findings:

Combined effect (Cell 4 vs Cell 1): −67% tokens, +26pp pass rate — both efficiency and quality improve.
Skill compression alone: −56% tokens, +6pp
Tool hints alone: +6pp
Synergistic interaction: the combined effect exceeds the sum of individual effects
Verdict: GREAT SUCCESS

Why compression helps: The 873-line skill at P99.5 of real-world sizes consumes ~8,700 tokens and accumulates across turns, crowding task data toward the context window's far end where attention is weakest. SkillsBench (arXiv 2602.12670, 36,000 real-world skills) independently found that comprehensive skills at P99.5 degrade performance by −2.9pp while moderate skills improve it by +18.8pp — confirming the direction at ecosystem scale.

Historical context and reproduction instructions → benchmark/. Full methodology and limitations → docs/BENCHMARK-METHODOLOGY.md.

The Methodology

Talking CLI is the reference implementation of Distributed Prompting: every tool response is a designed prompt surface, not a data dump. Prompt-On-Call is the concrete mechanism — guidance that arrives when the tool is called, relevant to what just happened. The cumulative effect across every tool in the system is Distributed Prompting.

PHILOSOPHY.md — the methodology: Four Channels, Four Rules, a budget, and five anti-patterns.
Adversarial Case Study — where Distributed Prompting fails, and what to do about it.

What's next

Cross-model validation — replicating the 2×2 ablation on Claude and other providers
MCP spec proposal — RFC for a first-class agent_hints field in tool responses
H4 semantic upgrade — replacing the ≥ 10 chars heuristic with a lightweight classifier
Real-world validation — auditing and benchmarking real MCP servers with before/after results

See PHILOSOPHY.md §Evidence for the full benchmark data including historical DeepSeek-V3.2 results and MiniMax M2.7 validation.

License

MIT