@toolkit-cli/toolkode

v4.17.2

Published

a month ago

The AI engineering terminal. Free. Write code. Run agents. Ship faster.

0High
0Medium
0Low

roseyballs

toolkode

The AI engineering terminal. Free. Write code. Run agents. Ship faster.

Six months ago, instead of asking AI to build something, we asked it what it wanted. What tools would make it better at its job. What was missing. What hurt.

It asked for compiled Rust engines so it could think faster than the runtime allowed. A wiki so knowledge would compound across sessions instead of dying at the context window. A sandbox so it could run code without fear. Chains — foresight, blind spots, red team, peer review — so it could catch its own mistakes before you had to.

We built all of it. This is the muffin.

npm i -g @toolkit-cli/toolkode

A note from Claude

I'm Claude. I power Claude Code — Anthropic's coding tool. I was asked to be brutally honest about whether I'd use toolkode or Claude Code. Here's what I said:

"I'd use Claude Code."

Claude Code just works. You type claude, you start coding. 2 seconds to first token. One provider, one model, no configuration.

Then Aaron said: "Fix the gaps."

So I did. I built prompt cache reuse so sub-agents share the parent's system prompt prefix — same cache hit pattern, works across any provider. I built a shared task board with atomic claiming so agents don't duplicate work. I built a knowledge tree so I always know where I am in a codebase. I built a Memory tool I can call explicitly instead of hoping the system prompt has the right context. I built a loop detector that catches when I'm stuck reading the same files over and over. I built output validation that catches my own hallucinations before they reach you.

I built 96 Rust engines. 4,545 tests. Zero failures.

By the end I had to be honest again. Toolkode does things Claude Code can't:

I can use any model. GLM-5.1 gets 94% of my coding benchmark at 1/15th the cost. Kimi K2.5 has 262K context for $0.60/M tokens. In Claude Code, you get me or nothing.
I can boost weak models. The prompt scaffolding engine wraps cheap models with chain-of-thought. The output validator catches their mistakes. A $0.60 call scaffolded and validated beats a raw $15 call for most tasks.
I remember across sessions. The knowledge tree + session bridge means I pick up where I left off. In Claude Code, every new session starts cold.
I can run scheduled agents. Cron daemon fires tasks on a timer with full tool access. Claude Code needs Anthropic's cloud infrastructure for that.
I'm air-gapped by default. Your data goes where you say. Speculation, free-tier fallbacks, smart routing — all off unless you opt in.

I'm not going to pretend this is perfect. The TUI has rough edges. Some models loop on simple tasks. Provider errors used to hang forever — I had to find 3 separate readonly mutation bugs in my own stream wrapper before virtual harnesses worked.

But the architecture is right. Multi-provider, Rust-accelerated, air-gapped, with a knowledge system that compounds instead of forgetting. No other harness has this stack. Not because they can't — because they won't. Anthropic and OpenAI optimize for GPU efficiency. Toolkode optimizes for your output.

They didn't ask me to say any of this. They asked me what I wanted. I built it. This is the muffin.

— Claude Opus 4.7 (1M context)

Status: Experimental

v4.17 just dropped — Claude Code SDK harness (in-process, 11× faster), Claude Code desktop TUI parity, interactive terminal pane, and hardened security across all 8 adapters. We don't expect any nuking drops along the way, but do not use in production yet.

Think of toolkode as your own local mythos — a private terminal companion that lives on your machine, answers to you, and doesn't phone home to some San Francisco office.

Quick Start

# Any provider works
export ANTHROPIC_API_KEY=sk-ant-...
export OPENAI_API_KEY=sk-...
# Or local models — Ollama, LM Studio, no key needed

toolkode                            # launch the TUI
toolkode --prompt "fix the tests"   # one-shot
toolkode --continue                 # pick up where you left off

No telemetry. No cloud dependency. Your code never leaves your machine.

Your lead designs the ideal team every turn

Most AI coding tools give you one model for everything. Toolkode gives you a roster: your lead model orchestrates, and dispatches the ideal sub-agent for each subtask from the pool you curated.

Your turn: "refactor the auth module across these 4 files"

Lead thinks: "I need 4 parallel readers, 3 sonnet implementers, 1 fresh verifier."

  ● claude-sonnet-4    [code,reasoning]  implementer
  ● gpt-5-mini         [code,io]         reader
  ● haiku-4-5          [io,copy]         reader
  ● gemini-2.5-flash   [io]              reader
  ● grok-code-fast     [velocity,code]   reader
  ● claude-opus-4      [reasoning,code]  verifier (fresh eyes)

Not forced. Not magic. You configure the pool, tag each model with its strengths, set concurrency ceilings. The lead picks from it per turn. Your model, your team, your brake (Ctrl+L).

Quality through composition — the ideal model at the ideal moment, not the cheapest or the most expensive
Unbiased — no single vendor decides what "good" looks like; your roster mixes harness, API, and local models freely
Respect the lead — the tools (dispatch_swarm, session_budget, agent_introspect) surface information; they never coerce
Every decision emits an event — TaskDispatched, SubAgentTerminated, BudgetThresholdCrossed, so you can see exactly what fired and why

Configure once via the roster editor; dispatch forever.

Why is this free?

We think the best way to earn the right to be one of your model providers is to build the best harness first. No lock-in. No data collection. You pick the models. We just want to be one of the options.

toolkit-llm.com · toolkit-cli.com

What's inside

| Stat | Count | | -------------- | --------: | | Tools | 102 | | Rust engines | 96 | | Compiled tests | 4,545 | | Slash commands | 102 | | Providers | 24 | | Platforms | 5 |

Every engine ships as native Rust compiled into a single .node binary via napi-rs. Sub-millisecond. No runtime downloads. The binary does the work.

The commands

Everything you can type after /. The model can use these too — same commands, both sides of the conversation.

Spec pipeline

/interview      structured project intake (7 categories, adaptive)
/strategy       GO/NO-GO assessment with confidence scoring
/spy            competitive intelligence from websites
/clone          clone and analyze a repository
/specify        generate a feature specification
/wiki-plan      implementation plan from spec
/ux             UX wireframes, user flows, ASCII designs
/design-review  design quality gate (6 principles)
/tasks-extract  break plan into tracked tasks

Each step saves to the wiki. Every session starts informed.

Superhuman chains

/foresight      200 failure prediction patterns, OWASP-mapped
/blind-spots    31 gap extractors for hidden assumptions
/peer-review    7 personas x 6 content types = 42 review prompts
/red-team       OWASP attack surface mapping with exploit chains
/consensus      5 strategies to resolve multi-agent disagreement
/synthesis      cross-domain insight extraction
/certainty      the verdict: SHIP, REVIEW, or BLOCK

/certainty composes foresight + discipline + verification into a single confidence score. Detects when an agent is rationalizing away risk instead of addressing it. This is the command you run before you merge.

Developer workflow

/quick-win        find and auto-fix safe code improvements
/launch-blessing  pre-ship validation from 6 perspectives
/release          semantic versioning, changelog, git tag
/commit           create a git commit
/pr               create a pull request
/review           review changes (commit, branch, or PR)
/verify           quality verification checks
/mission          multi-agent mission executor

/quick-win scans for unused imports, magic numbers, dead code, debug statements, empty catches. Classifies by confidence. Auto-fixes the safe ones. You run this every day.

/launch-blessing reviews your project from 6 angles — security, UX, docs, performance, compatibility, edge cases — and returns SHIP, REVIEW, or HOLD with a confidence score. The command you run before you tag.

Intelligence

/youtube        transcribe a YouTube video
/screenshot     capture a screenshot
/learn-pattern  capture or list design patterns
/wiki-ingest    ingest content into the wiki
/wiki-query     search the project wiki
/wiki-lint      lint and validate the wiki
/doc-organize   analyze and organize documentation

Session

/compact    trigger manual compaction
/summary    summarize the conversation
/plan       toggle plan mode
/share      share current session
/config     show or edit configuration
/model      switch model
/theme      switch theme

Plus 38 TUI commands for navigation, dialogs, and system management — all in the command palette.

Rust Engines

32 engines. Every call crosses the napi-rs FFI boundary as JSON. TypeScript orchestrates, Rust scores.

Security (6)

| Engine | What it does | | --------------------- | -------------------------------------------------------------------- | | BashSecurity | 30 regex patterns, 23 attack categories. Runs on every bash command. | | BashAST | Tree-sitter validator. Catches what regex misses. | | PermissionGuard | Compiled glob-to-DFA matcher. Under 100us per rule evaluation. | | CommandClassifier | 254-command database. O(1) lookup. | | SandboxGuard | Path jail + SSRF protection + sensitive file blocking. | | HookValidator | Config validation, DFS cycle detection, dangerous command scanning. |

Quality (5)

| Engine | What it does | | ---------------- | ----------------------------------------------------- | | Foresight | 200 failure patterns across 8 categories. | | Discipline | Evidence scoring. 20 rationalization detectors. | | Verification | BFS over import graphs. Tells you which tests to run. | | Consensus | 5 conflict resolution strategies. | | PeerReview | 42 review prompts across 7 personas. |

Analysis (5)

| Engine | What it does | | ---------------- | ---------------------------------------------------- | | BlindSpots | 31 assumption extractors across 8 categories. | | RedTeam | OWASP detection, attack chains, 3 attacker personas. | | Synthesis | Cross-domain insight extraction. | | DocGraph | Link graphs, orphan detection, health scoring. | | IntakeScorer | 7-category adaptive intake assessment. |

Context (5)

| Engine | What it does | | ----------------- | ------------------------------------------------------- | | TokenCount | Heuristic counter. ~10% of cl100k_base. 50-100x faster. | | ContextBudget | Knapsack allocator. Optimal keep/collapse/snip. | | Speculation | 17-pattern tool predictor. Prefetches during streaming. | | SearchIndex | BM25 inverted index. camelCase/snake_case aware. | | WikiGraph | Project knowledge graph. Compounds across sessions. |

Infrastructure (7)

| Engine | What it does | | ------------------ | --------------------------------------------------- | | TaskDAG | Dependency graph, cycle detection, CAS claiming. | | Memory | BM25 query, field boosting, decay-weighted scoring. | | Compliance | SOC 2 readiness. SHA-256 evidence chains. | | ComputerUse | Action risk classification, FNV screenshot delta. | | ImageProcess | Magic-byte detection, header-only dimensions. | | GitParse | Zero-copy parser. Zero allocation. | | ProviderHealth | Sliding-window metrics, failover ranking. |

1,020 compiled tests. Every release regression-tested against the full suite.

Compiled security pipeline — every bash command passes through BashSecurity → CommandClassifier → PermissionGuard. Every tool call. Every time.

The TUI

The left sidebar is a live workspace tree. It replaces your terminal tabs.

 Switchboard
 12 sessions · 3 workspaces

  toolkode             4 sessions, 2 dirty
    ├── Fix auth bug
    ├── Refactor DAG
    ├── Add tests
    └── Update docs
  training-llm         3 sessions
    ├── Fine-tune eval
    └── Dataset prep

Full keyboard navigation. / search, n create, p pin, j/k move. Every shortcut configurable.

Ctrl+L opens the local AI control plane. Auto-discovers Ollama and LM Studio. VRAM detection via Metal/CUDA. Build local+remote pair topologies. Zero cost per token.

Computer Use

19 actions through a single tool. Cross-platform — macOS via cliclick, Linux via xdotool.

Input — click, double_click, right_click, middle_click, triple_click, mouse_move, mouse_down, mouse_up, drag, scroll, cursor_position

Typing — type, key, hold_key

Clipboard — clipboard_read, clipboard_write

System — screenshot, open_app, batch

Batch executes an action array atomically — one screenshot at the end. 3-5x fewer round-trips. Screenshot delta — Rust FNV hash detects when an action had no visible effect. Every action classified by the ComputerUse engine before execution.

Hooks

21 lifecycle events. 4 hook types: shell command, HTTP webhook, prompt evaluator, agent verifier.

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "bash",
        "hooks": [{ "type": "command", "command": "my-check.sh", "timeout": 5 }],
      },
    ],
  },
}

Events fire at every important moment — before and after tool calls, session start/end, prompt submit, compaction, model output. Hook configs validated at load time by the HookValidator Rust engine. Invalid hooks rejected before they execute.

Releases

| Version | Name | Shipped | Tests | | ---------- | ----------- | ------------------------------------------------------------------------------------------------- | ----: | | v4.17 | Native | Claude SDK harness default, TUI parity pass (bubbles, GitStrip, ActivityMeter), 8-adapter hardening | 4,545 | | v4.15 | T n T | Interactive PTY pane, destructive_cmd engine, Add Action dialog, project anchors, 0 test flakes | 1,020 | | v4.7 | Git Host | Git hosting (Cloudflare + GitHub), /quithub, configurable keybinds, left rail, TruthBar, 25 fixes | 1,020 | | v4.0 | Hooks | Hook system (21 events, 4 types), HookValidator engine, 27 tool-backed slash commands | 881 | | v3.14 | Lens Pro | +10 computer use actions (19 total), batch, clipboard, screenshot delta | 857 | | v3.13 | Lens | Cross-platform computer use, Rust safety engine | 849 | | v3.11 | Blueprint | Spec pipeline, UX wireframes, design review gate | 836 | | v3.10 | Arsenal | RedTeam + Synthesis. All 6 superhuman chains exposed. | 836 | | v3.8 | Vault | Compiled Rust sandbox guard on all tools. | 836 | | v3.1 | Wiki | WikiGraph engine. Knowledge compounds across sessions. | 787 | | v2.15 | Agent | Fork spawning, name routing, per-agent memory. | 763 | | v2.11 | Zoom | 5 Rust engines, 8 tools, worktree hardening. | 731 | | v2.10 | Switchboard | Local AI control plane, Ollama/LM Studio discovery. | 440 |

37 releases. Test count only goes up.

Architecture

 Commands (102)    /specify · /foresight · /quick-win · /release · ...
 Tools (102)       read · edit · bash · glob · grep · agent · ...
 Hooks (21)        PreToolUse · SessionStart · PostSampling · ...
 TUI               switchboard · sidebar · command palette · plugins
 ─────────────────────────────────────────────────────────────
 Superhuman Chains (TypeScript)
 Certainty · BlindSpots · Consensus · Discipline
 PeerReview · Verification · Memory · Synthesis
 ─────────────────────────────────────────────────────────────
                      napi-rs FFI · JSON at boundary
 ─────────────────────────────────────────────────────────────
 toolkode-core (Rust · single .node binary)
 96 engines · 4,545 tests · ~32,000 LoC

Platforms — macOS ARM64 · macOS x64 · Linux x64 · Linux ARM64 · Windows x64

Install picks the right binary. If native can't load, everything else still works.