@dzhechkov/skills-bto
v1.3.0
Published
Build-Benchmark-Test-Optimize skill pack for Claude Code — deterministic benchmarking, quality gates, witness chain, judge attestation, and optimization
Maintainers
Readme
@dzhechkov/skills-bto
Build-Benchmark-Test-Optimize skill pack for Claude Code
Multi-agent evaluation and iterative optimization pipeline for Claude Code skills, commands, and prompts. Includes deterministic benchmarking with golden sample comparison, test suites, consistency probes, and performance metrics. Part of the Keysarium ecosystem.
Quick Start
# One-command install via npx
npx @dzhechkov/skills-bto
# Or install globally
npm install -g @dzhechkov/skills-bto
skills-bto init
# Install into a project that already has @dzhechkov/keysarium
npx @dzhechkov/skills-bto initAfter installation, open Claude Code in your project directory and start using BTO commands.
What You Get
| Component | Count | Description |
|-----------|-------|-------------|
| Skill | 1 | bto — core Build-Benchmark-Test-Optimize skill with 4 modules |
| Commands | 5 | /bto, /bto-build, /bto-benchmark, /bto-test, /bto-optimize |
| Rules | 1 | bto-quality-gates — quality gate enforcement (incl. benchmark gates) |
| Shards | 1 | bto-evaluation — context shard for BTO evaluation pipeline |
| Agent Templates | 2 | bto-judge-panel, bto-optimizer-worker |
| References | 5 | Eval patterns, judge rubrics, optimization methods, quality checklist, golden samples |
| Examples | 2 | Sample evaluation report, sample benchmark report |
Everything is installed into your project's .claude/ directory and works natively with Claude Code.
Commands
npx @dzhechkov/skills-bto # Full install (interactive, same as init)
npx @dzhechkov/skills-bto init # Install all components
npx @dzhechkov/skills-bto init --force # Overwrite existing files
npx @dzhechkov/skills-bto init --dry-run # Preview without making changes
npx @dzhechkov/skills-bto update # Update to latest version
npx @dzhechkov/skills-bto remove # Clean uninstall
npx @dzhechkov/skills-bto list # Show installed components
npx @dzhechkov/skills-bto doctor # Health checkBTO Pipeline
BUILD ──→ BENCHMARK ──→ TEST ──→ OPTIMIZE
│ │ │ │
│ │ │ └── Evolutionary mutation + re-evaluation (3 rounds)
│ │ └── Multi-layer evaluation: Layer 0 → Layer 1 → Layer 2
│ └── Deterministic benchmarking: golden samples, test suite, consistency, metrics
└── Generate skill/command from descriptionUsage in Claude Code
# Full BTO cycle: build → benchmark → test → optimize
/bto Create a skill for code review automation
# Build only — generate a new skill or command
/bto-build Create a skill that analyzes git commit patterns
# Benchmark only — deterministic benchmarking against golden samples
/bto-benchmark .claude/skills/my-skill/SKILL.md
# Test only — evaluate an existing artifact
/bto-test .claude/skills/my-skill/SKILL.md
# Optimize only — iteratively improve an artifact
/bto-optimize .claude/skills/my-skill/SKILL.mdEvaluation Architecture
Benchmark Layers (deterministic, pre-TEST)
| Layer | Cost | Purpose | |-------|------|---------| | B0 | Zero (deterministic) | Golden sample comparison — section coverage, ordering, proportions | | B1 | Zero (deterministic) | Deterministic test suite — 5 tests per artifact type, PASS/FAIL | | B2 | Minimal (3× haiku) | Consistency probe — 3 parallel agents, agreement measurement | | B3 | Zero (deterministic) | Performance metrics — token efficiency, bloat detection, redundancy |
Scoring: BENCHMARK = B0×0.30 + B1×0.35 + B2×0.15 + B3×0.20
Gate: < 0.50 BLOCK | 0.50–0.70 WARN | > 0.70 PASS → proceed to TEST
TEST Layer Model
| Layer | Agents | Model | Purpose | |-------|--------|-------|---------| | Layer 0 | 0 | — | Deterministic pre-checks (structure, completeness, encoding) | | Layer 1 | 1 | haiku | Fast semantic evaluation across 5 dimensions | | Layer 2 | 3 | sonnet | Full judge panel: Domain Expert + Critic + Completeness Auditor | | Meta | 1 | opus | Disagreement resolution (triggered when score delta > 3) |
Judge Panel
- 3 independent judges evaluate each artifact in isolation
- Judges never see each other's scores before submitting
- Standard weights: Domain Expert (0.4) / Critic (0.3) / Completeness Auditor (0.3)
- If
max_score - min_score > 3→ meta-judge escalation
Quality Gates
- BENCHMARK must pass (score ≥ 0.50) before TEST begins
- BENCHMARK score < 0.50 → BLOCK (artifact needs rework)
- Layer 0 must pass before Layer 1
- Layer 1 must pass before Layer 2
- Optimization accepted only if
new_score - prev_score > 0.5 - 3 consecutive iterations with delta ≤ 0.5 → convergence declared
- Score decrease > 1.0 → automatic rollback to previous best
Optimization Process
The optimizer runs up to 3 rounds of evolutionary improvement:
- Round 1 — 5 parallel haiku agents generate mutations, fast-rank variants
- Round 2 — Top variants evaluated by sonnet judge panel
- Round 3 — 3×3 parallel sonnet agents for full Layer 2 evaluation of finalists
Each round selects the best-performing variant and uses it as the base for the next iteration.
Integration with Keysarium
BTO works standalone but integrates seamlessly with @dzhechkov/keysarium:
# Install Keysarium first (optional)
npx @dzhechkov/keysarium init
# Then add BTO — it detects Keysarium automatically
npx @dzhechkov/skills-bto initWhen installed alongside Keysarium, BTO can evaluate and optimize any skill or command in the Keysarium toolkit.
Requirements
- Claude Code CLI — installed and configured (installation guide)
- Node.js >= 16.0.0 — required for the npm install method
