miii-cli

v2.0.17

Published

5 hours ago

The local-first autonomous coding agent. Claude Code UX powered by Ollama. Zero cost, maximum privacy, fully offline.

0High
0Medium
0Low

ai-coding-agent ollama local-llm autonomous-coding privacy-first cli-tool offline-ai developer-tooling open-source agentic-workflow

miii

The AI coding agent that runs on your machine. Not theirs.

npm install -g miii-cli && miii

Every cloud coding agent does the same thing: takes your source code, ships it to a server you don't control, charges you per token, and calls it intelligence.

You have a machine on your desk with 32GB of RAM and a GPU that's been idle since you bought it.

miii is the agent layer that belongs on your hardware. Autonomous. Local. Free. Your code never leaves unless you decide it should.

The thing that makes it work on a local 7B model where other tools fail: Beacon — a goal-directed context engine that keeps the model on task across deep tool chains without ever calling the LLM to summarise. Jump to Beacon ↓

Why miii

Your code stays on your machine

Local-first by default. Runs on Ollama — free, offline, private. When you hit a hard problem, /cloud escalates one prompt to Claude Opus 4 or o3 and comes back. You choose what leaves and when.

No subscription. No per-token billing for local runs. No vendor lock-in.

Beacon — goal-directed context engine

The context problem isn't that models have bad memory. It's that every other tool fills memory with the wrong things.

Watch what happens to a typical agent after 10 tool calls:

depth  1 │ user: "fix the token expiry bug"
depth  2 │ read_file(auth.ts)       → 800 lines verbatim  ← already acted on, still taking space
depth  3 │ read_file(middleware.ts)  → 600 lines verbatim  ← same
depth  4 │ list_files(src/)         → full tree            ← same
depth  5 │ run_command(grep...)     → full stdout          ← same
depth  6 │ update_file(auth.ts)     → patch
depth  7 │ run_tests()              → full test output     ← same
depth  8 │ ░░░░░░░░░░░░░░░░░░░░░░  context full

The model hits the wall. The standard fix: call the LLM again to summarise. More tokens. More latency. On a local model, another queue stall.

Beacon does not call the LLM. It compresses — in microseconds, with zero overhead.

depth  1 │ user: "fix the token expiry bug"
depth  2 │ read_file(auth.ts)       → [147 lines · first 4 shown]
depth  3 │ read_file(middleware.ts)  → [89 lines · first 4 shown]
depth  4 │ list_files(src/)         → [23 files · first 8 shown]
depth  5 │ run_command(grep...)     → [first 4 lines + last line]
depth  6 │ update_file(auth.ts)     → patch                        ← recent: untouched
depth  7 │ run_tests()              → [first 10 lines · failures verbatim]
depth  8 │ [Beacon Goal Block] → goal + steps taken + what's left  ← injected live
depth  9 │ model response
...
depth 20 │ ✓ still running

Not all messages age equally. A read_file from 12 steps ago doesn't need to be 800 verbatim lines. It needs to be read_file(auth.ts) → 147 lines. The model acted on it. It doesn't need to re-read it — it needs to know it happened.

Per-tool compression — the right strategy for each tool, not a blunt trim:

| Tool | Without Beacon | With Beacon | |---|---|---| | read_file | 800 lines verbatim | filename + line count + first 4 lines | | list_files | full directory tree | file count + first 8 paths | | run_command | full stdout | first 4 lines + last line | | run_tests | full test output | first 10 lines (failures always kept) | | update_file | patch + old content | patch label only | | Error messages | verbatim | always verbatim — never touched |

Recent messages stay untouched. Errors are sacred. A shadow store preserves everything pruned — nothing is lost, just condensed.

Goal injection — the harder problem isn't tokens, it's drift.

After 10 tool calls, models routinely forget the original task and start solving adjacent problems. Beacon prevents this with zero LLM calls: it extracts your goal synchronously at depth 0, then injects a live goal state block just before the last message at every subsequent depth:

╔══════════════════════════════════════════════════════╗
║  Beacon — Goal State                                 ║
║  Goal: fix token expiry bug in auth middleware       ║
║                                                      ║
║  Steps taken:                                        ║
║    • read_file(src/middleware/auth.ts) → 147 lines   ║
║    • update_file(auth.ts) → changed < to <=          ║
║    • run_tests(auth.test.ts) → 12 passed             ║
║                                                      ║
║  Remaining: verify expired-at-boundary edge case     ║
╚══════════════════════════════════════════════════════╝

The model sees this right before every response. It cannot lose the thread because the thread is handed back every single time.

Dynamic context window detection — no more compacting at 10K when you have 200K.

Most tools hardcode a conservative limit and compact way too early. Beacon detects the actual window at session start, updates if you switch models:

| Provider | How | |---|---| | Anthropic | Model prefix lookup — claude-opus-4-* / claude-sonnet-4-* → 200K | | OpenAI | Model prefix lookup — o3 / o4-mini → 200K, gpt-4o → 128K | | Ollama | Live query to /api/show → exact llama.context_length from model metadata | | OpenAI-compat | Queries /v1/models/{model} (vLLM, LM Studio, llama.cpp) → name patterns fallback |

─ context window: 200K tokens → compact threshold 560K chars

Claude users get 14× more headroom before compaction triggers. Ollama users get the exact window their model loaded with — not a guess.

When compaction does trigger, you see it live instead of a frozen cursor:

⟳ compressing context…  ████░░░░  12s

Zero overhead. Goal extraction, per-tool compression, shadow store, window detection — pure string operations. No embedding model. No summariser LLM. No Ollama queue contention. Microseconds per depth.

The result: tool chains that stalled at depth 8 now run to 20. The model stays on task. A local 7B with Beacon outruns a naive Claude Opus 4 or o3 integration on long tasks — not because it's smarter, but because it's not fighting its own context.

A 7B model that reasons like a 70B

miii builds a static call graph of your entire codebase — every function, class, method, and every edge between them. Pure AST. No model. No network.

/graph build         # 847 symbols, 1203 edges — done in <1s
/graph query "auth"  # auth → verifyToken → decodeJWT → hmacSha256

When a structural question comes up, the agent queries the graph first. Context beats parameters.

miii also reads files like a surgeon. Most agents dump the whole file into context. miii reads a window: imports, a focus region around the edit target, the footer. A 500-line file costs ~480 tokens, not ~2000.

You see everything before it happens

Every tool call is narrated. Every write shows a live diff. You approve or deny before anything changes.

⚡ Edit src/middleware/auth.ts?
   - if (exp < Date.now() / 1000) {
   + if (exp <= Date.now() / 1000) {
❯ Yes  /  Yes, for this session  /  No

When a patch fails, miii doesn't retry the same call. It tracks failure count per (tool, file) pair and injects a decompose nudge on the second failure: re-read the file, find exact lines, one minimal change at a time.

Nothing is lost

Every file is snapshotted before the write. Esc restores everything from that run instantly — across crashes, across sessions.

Every successful edit auto-commits to a private shadow git repo. Full model edit history, grepp-able and reversible.

/history    # model edit log
/undo       # revert last edit

Shell commands run inside an OS-level sandbox (sandbox-exec on macOS, bwrap on Linux). Write access bounded to your project.

How it compares

| | miii | Claude Code | Aider | Cursor / Cline | |---|---|---|---|---| | Runs locally | ✓ default | ✗ cloud only | ✓ | ✗ | | Free to use | ✓ (local) | ✗ subscription | ✓ | ✗ | | Code stays on machine | ✓ | ✗ | ✓ | ✗ | | Goal-directed context (Beacon) | ✓ | ✗ | ✗ | ✗ | | Per-tool context compression | ✓ | ✗ | ✗ | ✗ | | Dynamic context window detection | ✓ auto | ✗ hardcoded | ✗ | ✗ | | Call graph | ✓ | ✗ | ✗ | ✗ | | Windowed file reads | ✓ | ✗ | ✗ | ✗ | | Permission modal + live diff | ✓ | ✓ | ✗ | partial | | Shadow git (model edit log) | ✓ | ✗ | partial | ✗ | | OS-level shell sandbox | ✓ | ✗ | ✗ | ✗ | | Cloud escalation | ✓ /cloud | native | ✗ | native | | MCP support | ✓ | ✓ | ✗ | ✓ |

Where Claude Code wins: direct access to Anthropic's best models, larger context windows, deeper IDE integration. If you're on a paid Anthropic plan and privacy isn't a constraint, it's excellent.

Where miii wins: your code never leaves by default, no subscription, and Beacon keeps the model focused across long tool chains in a way no other local-first tool does. A 7B local model with miii's call graph and context engine surprises people who expect local to underperform.

Get started

# Local — free, offline, private (recommended)
ollama pull qwen2.5-coder:7b
npm install -g miii-cli
cd your-project && miii

# Anthropic Claude (Opus 4 / Sonnet 4)
ANTHROPIC_API_KEY=sk-ant-... miii
# → /config model claude-opus-4-7

# OpenAI (o3, o4-mini, GPT-4o) or any compatible endpoint
miii → /config provider openai-compat → /config url → /config key
# → /config model o3

Requirements: 16 GB RAM minimum for a 7B local model. 32 GB or a dedicated GPU recommended.

Config lives at ~/.config/miii/config.json (global) or .miii.json (project). Drop a MIII.md in your repo for project-specific instructions — same mechanic as Claude Code's CLAUDE.md.

Commands

Type / to open the palette. A few highlights:

| Command | What it does | |---|---| | /cloud [prompt] | Escalate to Claude Opus 4 / o3 for this turn | | /plan <goal> | Think before touching code | | /tdd <feature> | Write failing test → implement → green | | /graph build | Build the symbol call graph | | /index build | Build semantic vector index | | /refactor <goal> | Multi-file refactor with checkpoints | | /model <name> | Hot-swap model mid-session | | /undo | Revert last model edit | | /history | Shadow git log of model changes | | /compact | Summarise history, extract facts to memory |

Full palette available in-app. Esc aborts and rolls back. Ctrl+C exits.

Build from source

git clone https://github.com/maruakshay/miii-cli
cd miii-cli && npm install && npm run build && npm link
npm run dev    # tsx watch — no build step
npm test       # vitest, 48 tests

Node 20.12+.

Architecture

src/index.ts mounts an Ink TUI. Submissions hit useSubmit.ts → useRunLoop.ts. The loop builds context (system prompt + memory + history), streams via src/llm/stream.ts (Ollama / Anthropic / OpenAI-compat), parses <tool_call> blocks through src/parser/stream-parser.ts, gates writes through a permission modal, snapshots via src/checkpoints/store.ts, and executes tools from src/tools/index.ts.

Beacon (src/beacon/) manages context across tool-call depths. goal.ts extracts the user's goal synchronously and tracks progress from tool outcomes. compress.ts applies per-tool age-based compression to the middle of the message array. index.ts wires both into BeaconContext, which is created at depth 0 and threaded through every recursive runLoop call. src/context/window.ts detects the actual model context window at session start (Anthropic/OpenAI hardcoded, Ollama via /api/show, openai-compat via /v1/models) and updates compactor.ts's threshold dynamically. The emergency LLM summariser in compactor.ts remains as a last-resort fallback and shows a live compression animation during the LLM call. Tool-call depth cap: 20.

MIT · miii.in · @maruakshay