@dzhechkov/skills-bto

v1.3.4

Published

2 days ago

Build-Benchmark-Test-Optimize skill pack for Claude Code — deterministic benchmarking, quality gates, witness chain, judge attestation, and optimization

0High
0Medium
0Low

dzhechkov

claude claude-code ai bto benchmark build-test-optimize skills quality-gates testing optimization skill-pack

@dzhechkov/skills-bto

Build-Benchmark-Test-Optimize skill pack for Claude Code

Multi-agent evaluation and iterative optimization pipeline for Claude Code skills, commands, and prompts. Includes deterministic benchmarking with golden sample comparison, test suites, consistency probes, and performance metrics. Part of the Keysarium ecosystem.

Quick Start

# One-command install via npx
npx @dzhechkov/skills-bto

# Or install globally
npm install -g @dzhechkov/skills-bto
skills-bto init

# Install into a project that already has @dzhechkov/keysarium
npx @dzhechkov/skills-bto init

After installation, open Claude Code in your project directory and start using BTO commands.

What You Get

| Component | Count | Description | |-----------|-------|-------------| | Skill | 1 | bto — core Build-Benchmark-Test-Optimize skill with 4 modules | | Commands | 5 | /bto, /bto-build, /bto-benchmark, /bto-test, /bto-optimize | | Rules | 1 | bto-quality-gates — quality gate enforcement (incl. benchmark gates) | | Shards | 1 | bto-evaluation — context shard for BTO evaluation pipeline | | Agent Templates | 2 | bto-judge-panel, bto-optimizer-worker | | References | 5 | Eval patterns, judge rubrics, optimization methods, quality checklist, golden samples | | Examples | 2 | Sample evaluation report, sample benchmark report |

Everything is installed into your project's .claude/ directory and works natively with Claude Code.

Commands

npx @dzhechkov/skills-bto                    # Full install (interactive, same as init)
npx @dzhechkov/skills-bto init               # Install all components
npx @dzhechkov/skills-bto init --force       # Overwrite existing files
npx @dzhechkov/skills-bto init --dry-run     # Preview without making changes
npx @dzhechkov/skills-bto update             # Update to latest version
npx @dzhechkov/skills-bto remove             # Clean uninstall
npx @dzhechkov/skills-bto list               # Show installed components
npx @dzhechkov/skills-bto doctor             # Health check

BTO Pipeline

BUILD ──→ BENCHMARK ──→ TEST ──→ OPTIMIZE
  │           │           │         │
  │           │           │         └── Evolutionary mutation + re-evaluation (3 rounds)
  │           │           └── Multi-layer evaluation: Layer 0 → Layer 1 → Layer 2
  │           └── Deterministic benchmarking: golden samples, test suite, consistency, metrics
  └── Generate skill/command from description

Usage in Claude Code

# Full BTO cycle: build → benchmark → test → optimize
/bto Create a skill for code review automation

# Build only — generate a new skill or command
/bto-build Create a skill that analyzes git commit patterns

# Benchmark only — deterministic benchmarking against golden samples
/bto-benchmark .claude/skills/my-skill/SKILL.md

# Test only — evaluate an existing artifact
/bto-test .claude/skills/my-skill/SKILL.md

# Optimize only — iteratively improve an artifact
/bto-optimize .claude/skills/my-skill/SKILL.md

Evaluation Architecture

Benchmark Layers (deterministic, pre-TEST)

| Layer | Cost | Purpose | |-------|------|---------| | B0 | Zero (deterministic) | Golden sample comparison — section coverage, ordering, proportions | | B1 | Zero (deterministic) | Deterministic test suite — 5 tests per artifact type, PASS/FAIL | | B2 | Minimal (3× haiku) | Consistency probe — 3 parallel agents, agreement measurement | | B3 | Zero (deterministic) | Performance metrics — token efficiency, bloat detection, redundancy |

Scoring: BENCHMARK = B0×0.30 + B1×0.35 + B2×0.15 + B3×0.20

Gate: < 0.50 BLOCK | 0.50–0.70 WARN | > 0.70 PASS → proceed to TEST

TEST Layer Model

| Layer | Agents | Model | Purpose | |-------|--------|-------|---------| | Layer 0 | 0 | — | Deterministic pre-checks (structure, completeness, encoding) | | Layer 1 | 1 | haiku | Fast semantic evaluation across 5 dimensions | | Layer 2 | 3 | sonnet | Full judge panel: Domain Expert + Critic + Completeness Auditor | | Meta | 1 | opus | Disagreement resolution (triggered when score delta > 3) |

Judge Panel

3 independent judges evaluate each artifact in isolation
Judges never see each other's scores before submitting
Standard weights: Domain Expert (0.4) / Critic (0.3) / Completeness Auditor (0.3)
If max_score - min_score > 3 → meta-judge escalation

Quality Gates

BENCHMARK must pass (score ≥ 0.50) before TEST begins
BENCHMARK score < 0.50 → BLOCK (artifact needs rework)
Layer 0 must pass before Layer 1
Layer 1 must pass before Layer 2
Optimization accepted only if new_score - prev_score > 0.5
3 consecutive iterations with delta ≤ 0.5 → convergence declared
Score decrease > 1.0 → automatic rollback to previous best

Optimization Process

The optimizer runs up to 3 rounds of evolutionary improvement:

Round 1 — 5 parallel haiku agents generate mutations, fast-rank variants
Round 2 — Top variants evaluated by sonnet judge panel
Round 3 — 3×3 parallel sonnet agents for full Layer 2 evaluation of finalists

Each round selects the best-performing variant and uses it as the base for the next iteration.

Integration with Keysarium

BTO works standalone but integrates seamlessly with @dzhechkov/keysarium:

# Install Keysarium first (optional)
npx @dzhechkov/keysarium init

# Then add BTO — it detects Keysarium automatically
npx @dzhechkov/skills-bto init

When installed alongside Keysarium, BTO can evaluate and optimize any skill or command in the Keysarium toolkit.

Requirements

Claude Code CLI — installed and configured (installation guide)
Node.js >= 16.0.0 — required for the npm install method

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@dzhechkov/skills-bto

Quick Start

What You Get

Commands

BTO Pipeline

Usage in Claude Code

Evaluation Architecture

Benchmark Layers (deterministic, pre-TEST)

TEST Layer Model

Judge Panel

Quality Gates

Optimization Process

Integration with Keysarium

Requirements

License

Links