deepflow
v0.1.140
Published
Doing reveals what thinking can't predict — spec-driven iterative development for Claude Code
Maintainers
Readme
██████╗ ███████╗ ███████╗ ██████╗ ███████╗ ██╗ ██████╗ ██╗ ██╗
██╔══██╗ ██╔════╝ ██╔════╝ ██╔══██╗ ██╔════╝ ██║ ██╔═══██╗ ██║ ██║
██║ ██║ █████╗ █████╗ ██████╔╝ █████╗ ██║ ██║ ██║ ██║ █╗ ██║
██║ ██║ ██╔══╝ ██╔══╝ ██╔═══╝ ██╔══╝ ██║ ██║ ██║ ██║███╗██║
██████╔╝ ███████╗ ███████╗ ██║ ██║ ███████╗ ╚██████╔╝ ╚███╔███╔╝
╚═════╝ ╚══════╝ ╚══════╝ ╚═╝ ╚═╝ ╚══════╝ ╚═════╝ ╚══╝╚══╝
Why Deepflow
You can't foresee what you don't know to ask. Doing reveals — at every layer.
Most spec-driven frameworks start from a finished spec and execute a static plan. Deepflow treats the entire process as discovery: asking reveals hidden requirements, debating reveals blind spots, spiking reveals technical risks, implementing reveals edge cases. Each step makes the next one sharper.
- Asking reveals what assuming hides — Before any code, Socratic questioning surfaces the requirements you didn't know you had. Four AI perspectives collide to expose tensions in your approach. The spec isn't written from what you think you know — it's written from what the conversation uncovered.
- Spec as living hypothesis — Core intent stays fixed, details refine through implementation. "The spec becomes bulletproof because you built it, not before."
- Parallel probes reveal the best path — Uncertain approaches spawn parallel spikes in isolated worktrees. The machine selects the winner (fewer regressions > better coverage > fewer files changed). Failed approaches stay recorded and never repeat.
- Metrics decide, not opinions — No LLM judges another LLM. Build, tests, typecheck, lint, and invariant checks are the only judges. After an agent commits, the orchestrator runs health checks. Pass = keep. Fail = revert + new hypothesis.
- Browser verification closes the loop — L5 launches headless Chromium via Playwright, captures the accessibility tree, and evaluates structured assertions extracted at plan-time from your spec's acceptance criteria. Deterministic pass/fail — no LLM calls during verification. Screenshots saved as evidence.
- The loop is the product — Not "execute a plan" — "evolve the codebase toward the spec's goals through iterative cycles." Each cycle reveals what the previous one couldn't see.
What We Learned by Doing
Deepflow started with adversarial selection: one AI evaluated another AI's code in a fresh context. The "doing reveals" philosophy applied to the system itself — we discovered that LLM judging LLM produces gaming: agents that estimated instead of measuring, simulated instead of implementing, presented shortcuts as deliverables.
The fix: eliminate subjective judgment. Only objective metrics decide. Tests created by the agent itself are excluded from the baseline to prevent self-validation. We call this a ratchet — inspired by Karpathy's autoresearch: a mechanism where the metric can only improve, never regress. Each cycle ratchets quality forward.
Quick Start
# Install (or update)
npx deepflow
# Uninstall
npx deepflow --uninstallThe installer configures granular permissions so background agents can read, write, run git, and execute health checks (build/test/typecheck/lint) without blocking on approval prompts. All permissions are scoped and cleaned up on uninstall.
Two Modes
Interactive (human-in-the-loop)
You explore the problem, shape the spec, and trigger execution — all inside a Claude Code session.
claude
# 1. Discover — understand the problem before solving it
/df:discover image-upload
# "Why do you need image upload? What exists today?
# What file sizes? What formats? Where are images stored?
# What does 'done' look like? What should this NOT do?"
# 2. Debate — stress-test the approach (optional)
/df:debate upload-strategy
# User Advocate: "Drag-and-drop is table stakes, not a feature"
# Tech Skeptic: "Client-side resize before upload, or you'll hit memory limits"
# Systems Thinker: "What happens when storage goes down mid-upload?"
# LLM Efficiency: "Split this into two specs: upload + processing"
# 3. Spec — now the conversation is rich enough to produce a solid spec
/df:spec image-upload
# 4-5: the AI takes over
/df:execute # Curator orchestrator spawns parallel agents in shared worktree, ratchet validates
/df:verify # Check spec satisfied, merge to mainWhat requires you: Steps 1-3 (defining the problem and approving the spec — /df:spec's curate phase produces the task list inline). Steps 4-5 run with the orchestrator session as curator; you trigger each one and can intervene.
Two Phases, One Handoff
HUMAN PHASE AI PHASE
───────────────────────────────── ──────────────────────────────────
/df:discover — ask, surface gaps /df:execute — curator spawns subagents,
/df:debate — stress-test approach ratchet validates each commit
/df:spec — curate task list /df:verify — health checks, merge
↻ refine until solid ↻ retry until converged
───────────────────────────────── ──────────────────────────────────
specs/*.md (with ## Tasks (curated)) is the handoff pointSpec lifecycle: feature.md (new) → doing-feature.md (in progress) → done-feature.md (decisions extracted, then deleted)
Commands
Human phase — you drive these to shape intent into a curated spec.
| Command | Purpose |
|---------|---------|
| /df:discover <name> | Explore problem space with Socratic questioning |
| /df:debate <topic> | Multi-perspective analysis (4 agents) |
| /df:spec <name> | Generate spec from conversation; curate phase emits ## Tasks (curated) (LSP-first impact, file-ownership-aware [P]) |
| /df:fix <done-spec> | Derive a follow-up spec from a completed one (regressions, unmet ACs) |
AI phase — the orchestrator session itself acts as curator and drives these.
| Command | Purpose |
|---------|---------|
| /df:execute | Curator orchestrator spawns sub-agents per ## Tasks (curated) wave in a shared worktree, ratchet validates each commit |
| /df:verify | Check specs satisfied (L0–L5), merge to main, rename doing- → done- |
Support
| Command | Purpose |
|---------|---------|
| /df:map | Generate codebase artifacts (sketch / impact / findings / structure / testing / CLAUDE.md) |
| /df:dashboard | View deepflow dashboard (team mode via URL or local mode via CLI server) |
| /df:eval | Run a skill against a benchmark suite, or scaffold a new benchmark dir |
| /df:update | Update or uninstall deepflow, check installed version |
File Structure
your-project/
+-- specs/
| +-- auth.md # new spec (curated tasks live in this file)
| +-- doing-upload.md # in progress
| +-- done-upload.md # completed (decisions extracted)
+-- .deepflow/
+-- config.yaml # project settings (build/test commands, parallelism, optimize keys)
+-- decisions.md # auto-extracted [APPROACH] / [PROVISIONAL] / [FUTURE] / [UPDATE]
+-- checkpoint.json # /df:execute resume state (waves, completed tasks)
+-- context.json # statusline-written context % (read by /df:execute for parallelism gate)
+-- auto-snapshot.txt # ratchet baseline (pre-existing tests)
+-- token-history.jsonl # per-render token usage
+-- experiments/ # spike results, named {topic}--{hypothesis}--{status}.md
+-- maps/ # /df:map artifacts per spec (sketch.md, impact.md, findings.md)
+-- codebase/ # global artifacts (STRUCTURE.md, TESTING.md, CLAUDE.md)
+-- results/ # per-task result archives + verify JSON
+-- worktrees/curator-active/ # single shared execution branchWhat Deepflow Rejects
- Predicting everything before doing — You discover what you need by building it. TDD assumes you already know the correct behavior before coding. Deepflow assumes that execution reveals what planning can't anticipate.
- LLM judging LLM — We started with adversarial selection (AI evaluating AI). We discovered gaming. We replaced it with objective metrics. Deepflow's own evolution proved the principle.
- Agents role-playing job titles — Flat orchestrator + model routing. No PM agent, no QA agent, no Scrum Master agent.
- Automated research before understanding — Conversation with you first. AI research comes after you've defined the problem.
- Ceremony — 8 commands, one flow. Markdown, not schemas. No sprint planning, no story points, no retrospectives.
Principles
- Discover before specifying, spike before implementing — Ask, debate, probe — then commit
- You define WHAT, AI figures out HOW — Specs are the contract
- Metrics decide, not opinions — Build/test/typecheck/lint/invariant-check/browser-verify are the only judges
- Confirm before assume — Search the code before marking "missing"
- Complete implementations — No stubs, no placeholders
- Atomic commits — One task = one commit
- Context-aware — Checkpoint before limits, resume seamlessly
Why This Architecture Works
Deepflow's design isn't opinionated — it's a direct response to measured LLM limitations:
Focused tasks > giant context — LLMs lose ~2% effectiveness per 100K additional tokens, even on trivial tasks (Chroma "Context Rot", 2025, 18 models tested). Accuracy drops from 89% at 8K tokens to 25% at 1M tokens (Augment Code, 2025). Deepflow keeps each task's context minimal and focused instead of loading the entire codebase.
Search efficiency > model capability — Coding agents spend 60% of their time searching, not coding (Cognition, 2025). Input tokens dominate cost with up to 10x variance driven entirely by search efficiency, not coding ability. Deepflow's LSP-first search and 3-phase explore protocol (DIVERSIFY/CONVERGE/EARLY STOP) minimize search waste.
The framework matters more than the model — Same model, same tasks, different orchestration: 25.6 percentage point swing on SWE-Bench Lite (GPT-4: 2.7% with naive retrieval vs 28.3% with structured orchestration). On SWE-Bench Pro, three products using the same model scored 17 problems apart on 731 issues — the only difference was how they managed context, search, and edits. Deepflow is that orchestration layer.
Tool use > context stuffing — Information in the middle of context has up to 40% less recall than at the start/end (Lost in the Middle, 2024, Stanford/TACL). LongMemEval (ICLR 2025) found GPT-4O scoring 60-64% at full context vs 87-92% with oracle retrieval. Agents access code on-demand via LSP (findReferences, incomingCalls) and grep — always fresh, no attention dilution.
Fresh context beats long sessions — Every AI agent's success rate decreases after 35 minutes of equivalent task time; doubling duration quadruples failure rate. Deepflow's curator orchestrator spawns each subagent in a fresh context with an inline bundle — checkpoint state, not conversation history.
Input:output ratio matters — Agent token ratio is ~100:1 input to output (Manus, 2025). Deepflow truncates ratchet output (success = zero tokens), context-forks high-ratio skills, and strips prompt sections by effort level to keep the ratio low.
Model routing > one-size-fits-all — Mechanical tasks with cheap models (haiku), complex tasks with powerful models (opus). Fewer tokens per task = less degradation = better results. Effort-aware context budgets strip unnecessary sections from prompts for simpler tasks.
Prompt order follows attention — Execute prompts follow the attention U-curve: critical instructions (task definition, failure history, success criteria) at start and end, navigable data (impact analysis, dependency context) in the middle. Distractors eliminated by design.
LSP-powered impact analysis — Plan-time uses findReferences and incomingCalls to map blast radius precisely. Execute-time runs a freshness check before implementing — catching callers added after planning. Grep as fallback — though embedding-based retrieval has a hard mathematical ceiling (Google DeepMind, 2025) that LSP doesn't share.
Sub-agents
Sub-agents are spawned by the orchestrator with strict input/output contracts (enforced by the df-delegation-contract PreToolUse hook reading agents/DELEGATION.md). The orchestrator routes each task to a sub-agent by subagent_type; the sub-agent returns a structured payload, never freeform prose.
| Sub-agent | Role |
|-----------|------|
| df-implement | Executes one curated task — read by path, edit, run tests. No search tools. |
| df-integration | Cross-spec integration — implements tasks that span multiple specs or touch shared interfaces |
| df-spike | Proof-of-concept investigator — validates risky hypotheses, never edits existing files |
| df-test | Writes tests for a given module or feature, runs the suite, reports status |
| df-optimize | Performance/quality optimization — profiles, refactors for efficiency, no behavior changes |
| df-haiku-ops | Fast Bash-only ops (git, shell). No code reading/editing — pure mechanical execution |
| reasoner | Complex analysis — prioritization, debugging, architectural decisions (Opus-based) |
Skills
Skills are reusable capabilities invoked by sub-agents or commands. Some run in forked context to keep input:output ratios low; others encode protocols (commit format, AC coverage, decision capture).
| Skill | Purpose |
|-------|---------|
| gap-discovery | Surface missing requirements during ideation |
| df-decisions | Decision capture protocol — extracts [APPROACH] / [PROVISIONAL] / [ASSUMPTION] / [FUTURE] / [UPDATE] tags from agent output to .deepflow/decisions.md |
| df-ac-coverage | Acceptance-criteria tagging protocol — tests carry specs/<slug>.md#AC-<n> literals so /df:verify L3 can lint that every AC is exercised |
| atomic-commits | One logical change per commit, conventional format |
| code-completeness | Find TODOs, stubs, placeholders, skipped tests, missing implementations |
| browse-fetch | Fetch external API docs via headless Chromium (replaces context-hub) |
| browse-verify | L5 browser verification — Playwright a11y tree assertions |
| repo-inspect | Structured JSON intelligence for a remote GitHub repo — no local clone |
Lifecycle hooks
23 hooks fire at specific Claude Code lifecycle events to enforce invariants without requiring orchestrator decisions:
- PreToolUse (Task spawns):
df-codebase-injectinjects relevant artifacts into agent prompts;df-delegation-contractenforces input/output contracts;df-implement-protocol/df-verify-protocol/df-explore-protocoladd tool restrictions;df-bash-worktree-guardblocks cross-worktree mutations;df-worktree-precheckprevents stale-base spawns. - PostToolUse (after edits):
df-artifact-validatechecks sketch/impact/findings consistency and emits a JSON Schema-validated drift report;df-spike-validaterejects schema-mismatch results;df-codebase-stalenessflags artifact rot;df-experiment-immutableprotects spike results;df-validate-tasks-gatesaudits curated task structure;df-harness-scorerecords benchmark deltas. - UserPromptSubmit:
df-spec-lintvalidates spec format;df-invariant-checkruns project invariants;df-check-updatenotifies when a new deepflow version is published.
All hooks are zero-dep Node, fail-open by default (warnings to stderr), and idempotent.
More
- Concepts — Philosophy and flow in depth
- Configuration — All options, models, parallelism
License
MIT
