npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

sniffbench

v0.1.1

Published

A benchmark suite for coding agents. Think pytest, but for evaluating AI assistants.

Readme

Sniffbench

A custom benchmark suite for coding agents.

What is this?

When you change your AI coding setup—switching models, adjusting prompts, or trying new tools—you're flying blind. Did it actually get better? Worse? Hard to say without data.

Sniffbench gives you that data. It runs your coding agent through evaluation tasks and measures what matters.

Quick Start

# Clone and build
git clone https://github.com/answerlayer/sniffbench.git
cd sniffbench
npm install
npm run build

# Link globally (optional)
npm link

# Check it's working
sniff --help
sniff doctor

What Works Now

Comprehension Interview

Test how well your agent understands a codebase:

sniff interview

This runs your agent through 12 comprehension questions about the codebase architecture. You grade each answer on a 1-10 scale to establish baselines. Future runs compare against your baseline.

╭─ sniff interview ────────────────────────────────────────────────╮
│ Comprehension Interview                                          │
│                                                                  │
│ Test how well your agent understands this codebase.              │
│ You'll grade each answer on a 1-10 scale to establish baselines. │
╰──────────────────────────────────────────────────────────────────╯

✔ Found 12 comprehension questions

  Questions to cover:

  ○ not graded  comp-001: Project Overview
  ○ not graded  comp-002: How to Add New Features
  ...

Case Management

# List all test cases
sniff cases

# Show details of a specific case
sniff cases show comp-001

# List categories
sniff cases categories

System Status

# Check sniffbench configuration
sniff status

# Run diagnostics (Docker, dependencies)
sniff doctor

What We Measure

Sniffbench evaluates agents on behaviors that matter for real-world development:

  1. Style Adherence - Does the agent follow existing patterns in the repo?
  2. Targeted Changes - Does it make specific, focused changes without over-engineering?
  3. Efficient Navigation - Does it research the codebase efficiently?
  4. Non-Regression - Do existing tests still pass?

We explicitly do NOT measure generic "best practices" divorced from project context. See VALUES.md for our full philosophy.

Case Types

| Type | Description | Status | |------|-------------|--------| | Comprehension | Questions about codebase architecture | ✅ Ready | | Bootstrap | Common tasks (fix linting, rename symbols) | 🚧 In Progress | | Closed Issues | Real issues from your repo's history | 🚧 In Progress | | Generated | LLM discovers improvement opportunities | 🚧 Planned |

Roadmap

We're building in phases:

  1. Foundation - CLI, Docker sandboxing, case management
  2. 🚧 Case Types - Comprehension, bootstrap, closed issues, generated
  3. Agent Integration - Claude Code, Cursor, Aider wrappers
  4. Metrics - Comprehensive scoring and comparison
  5. Multi-Agent - Cross-agent benchmarking

See ROADMAP.md for detailed phases.

Contributing

We welcome contributions! Areas that need work:

  • Agent wrappers - Integrate with OpenCode, Cursor, Gemini, OpenCode, or your favourite CLI-bsed coding agent
  • Bootstrap cases - Detection and validation for common tasks
  • Closed issues scanner - Extract cases from git history
  • Documentation - Examples, tutorials, case studies

See CONTRIBUTING.md to get started.

Prior Art

We researched existing solutions (SWE-Bench, CORE-Bench, Aider benchmarks). See existing_work.md for analysis.

License

MIT - see LICENSE

Questions?

Open an issue. We're building this in public and welcome feedback.