sniffbench
v0.1.1
Published
A benchmark suite for coding agents. Think pytest, but for evaluating AI assistants.
Maintainers
Readme
Sniffbench
A custom benchmark suite for coding agents.
What is this?
When you change your AI coding setup—switching models, adjusting prompts, or trying new tools—you're flying blind. Did it actually get better? Worse? Hard to say without data.
Sniffbench gives you that data. It runs your coding agent through evaluation tasks and measures what matters.
Quick Start
# Clone and build
git clone https://github.com/answerlayer/sniffbench.git
cd sniffbench
npm install
npm run build
# Link globally (optional)
npm link
# Check it's working
sniff --help
sniff doctorWhat Works Now
Comprehension Interview
Test how well your agent understands a codebase:
sniff interviewThis runs your agent through 12 comprehension questions about the codebase architecture. You grade each answer on a 1-10 scale to establish baselines. Future runs compare against your baseline.
╭─ sniff interview ────────────────────────────────────────────────╮
│ Comprehension Interview │
│ │
│ Test how well your agent understands this codebase. │
│ You'll grade each answer on a 1-10 scale to establish baselines. │
╰──────────────────────────────────────────────────────────────────╯
✔ Found 12 comprehension questions
Questions to cover:
○ not graded comp-001: Project Overview
○ not graded comp-002: How to Add New Features
...Case Management
# List all test cases
sniff cases
# Show details of a specific case
sniff cases show comp-001
# List categories
sniff cases categoriesSystem Status
# Check sniffbench configuration
sniff status
# Run diagnostics (Docker, dependencies)
sniff doctorWhat We Measure
Sniffbench evaluates agents on behaviors that matter for real-world development:
- Style Adherence - Does the agent follow existing patterns in the repo?
- Targeted Changes - Does it make specific, focused changes without over-engineering?
- Efficient Navigation - Does it research the codebase efficiently?
- Non-Regression - Do existing tests still pass?
We explicitly do NOT measure generic "best practices" divorced from project context. See VALUES.md for our full philosophy.
Case Types
| Type | Description | Status | |------|-------------|--------| | Comprehension | Questions about codebase architecture | ✅ Ready | | Bootstrap | Common tasks (fix linting, rename symbols) | 🚧 In Progress | | Closed Issues | Real issues from your repo's history | 🚧 In Progress | | Generated | LLM discovers improvement opportunities | 🚧 Planned |
Roadmap
We're building in phases:
- ✅ Foundation - CLI, Docker sandboxing, case management
- 🚧 Case Types - Comprehension, bootstrap, closed issues, generated
- ⬜ Agent Integration - Claude Code, Cursor, Aider wrappers
- ⬜ Metrics - Comprehensive scoring and comparison
- ⬜ Multi-Agent - Cross-agent benchmarking
See ROADMAP.md for detailed phases.
Contributing
We welcome contributions! Areas that need work:
- Agent wrappers - Integrate with OpenCode, Cursor, Gemini, OpenCode, or your favourite CLI-bsed coding agent
- Bootstrap cases - Detection and validation for common tasks
- Closed issues scanner - Extract cases from git history
- Documentation - Examples, tutorials, case studies
See CONTRIBUTING.md to get started.
Prior Art
We researched existing solutions (SWE-Bench, CORE-Bench, Aider benchmarks). See existing_work.md for analysis.
License
MIT - see LICENSE
Questions?
Open an issue. We're building this in public and welcome feedback.
