npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@collabb/hammurabi

v0.2.0

Published

Spec → rubric → fixtures → runner. Trust your outputs without re-reading every diff.

Readme

Hammurabi

Spec → rubric → fixtures → runner. A small, opinionated framework for authoring LLM evaluations you'll actually trust.

npm

Status: alpha (v0.2.0). On-disk loaders, CLI runner, a multi-provider judge panel (Anthropic + OpenAI + Google + DeepSeek) configured in the spec frontmatter, deterministic code-scored criteria, baseline-aware exit semantics (known-fail tier), import-resolution preflight, baseline regression detection, and a repo-wide hammurabi-check CI command are all shipped. Schemas use Zod v4.

Why

Teams ship faster when they can trust outputs without re-reading every diff. Hammurabi is the connective tissue: a single schema for specs, rubrics, and fixtures that any runner (or downstream tool like Knack) can consume.

Install

npm install @collabb/hammurabi

Or run the CLI directly without installing:

npx @collabb/hammurabi hammurabi-run path/to/foo.spec.md

CLI

Hammurabi ships two bins (available after install via npm bin).

hammurabi-run — one bundle:

hammurabi-run path/to/foo.spec.md \
  --aggregator min \
  --filter foo-001,foo-007 \
  --update-baseline

The judge panel resolves from the spec's eval block (see below); CLI flags override it. A sibling foo.baseline.report.json is auto-discovered for regression detection (--no-baseline to skip, --update-baseline to bless a new one). Writes foo.report.json and foo.report.md alongside the spec (or in --out <dir>).

hammurabi-check — every bundle under a directory (the CI entry point):

hammurabi-check evals/

Discovers each *.spec.md, runs it against its committed baseline, and aggregates into one check-report.json + a combined exit code.

Exit codes (both bins) are CI-meaningful and baseline-aware (since 0.1.1):

  • 0 — no NEW failures (vs baseline) and no regressions
  • 1 — any new failure (fixture passed in baseline, fails now) or regression
  • 2 — could not run (bad args, malformed bundle, runner error, unresolved import)

A fixture failing identically in the committed baseline surfaces as a known-fail (baselined) warning and does NOT break the gate — bundles with an honest baseline that isn't 100% green can still sit in CI. A baselined fixture that now passes surfaces as an improvement prompt to re-bless via --update-baseline.

Before scoring, both bins preflight every dynamic import the bundle uses (function-target modules + code-evaluator modules). Unresolvable deps exit 2 with a precise message naming the criterion or target. Skip with --no-preflight.

Run hammurabi-run --help / hammurabi-check --help for the full flag list. A drop-in GitHub Action template lives at templates/eval-gate.yml.

Slash commands

Copy them into any project's .claude/commands/:

cp node_modules/@collabb/hammurabi/commands/*.md .claude/commands/
  • /hammurabi <path-to-spec.md> — 4-step authoring flow: refine spec via Q&A → generate rubric → generate fixtures
  • /hammurabi-run <path-to-spec.md> — invoke the CLI, then summarize the report in chat (failures, regressions, errored fixtures, judge votes)

Schema

Three artifact types, all language-neutral on disk:

| Artifact | Format | TS type | |----------|--------|---------| | Spec | *.spec.md (frontmatter + body) | Spec | | Rubric | *.rubric.json | Rubric | | Fixtures | *.fixtures.jsonl | FixtureSet |

Reports are JSON: Report type.

import type { Spec, Rubric, FixtureSet, Report } from "@collabb/hammurabi/schema";
import { run } from "@collabb/hammurabi/runner";

Authoring guidance

Three patterns that consistently bite first-time bundle authors. Adopt them before you author the rubric. See docs/authoring-guide.md for the long-form walkthrough.

Import the contract — never re-encode it

If your target produces output conforming to a schema, JSON spec, or function signature, your evaluator should import the same definition the target uses. Hand-copying the key list, type shape, or validation rules into the evaluator guarantees drift the first time the target changes. Common form of the bug: a requiredKeys array drifts out of sync with the source *.schema.json, the gate silently rots, and the rubric reports green for non-conformant output.

Mirror your production validator's library and config

If the eval mirrors a production validator (ajv, zod, joi, …), pin the same major version the production code uses and apply the same config (formats, coerceTypes, strict, …). An ajv 8 evaluator gated against ajv 6 production code is not the same gate.

Bundles using external deps must be installed packages

A bundle directory that imports npm packages must contain a package.json with the deps declared. Relying on "the parent project happens to hoist it" breaks the documented hammurabi-check evals/ use case the moment the bundle runs in a CI workspace or fresh checkout. Since 0.1.1, both CLI bins preflight every dynamic import before scoring and emit a precise error pointing at the offending criterion or target — the root-cause fix is still a declared dep.

The eval block — judge panel in the spec frontmatter

How rigorously a spec is judged — the size and reasoning of the panel — lives in the spec's frontmatter eval block, so it's version-controlled and reviewable rather than an ephemeral CLI flag:

---
name: trade-sizer
version: "1.0.0"
description: ...
target: { kind: http, url: https://... }
eval:
  riskTier: critical          # shorthand → a default cross-provider panel
  aggregator: min             # any single judge flagging a problem fails it
  generatorProvider: anthropic # warns if a judge shares this provider
  judges:                     # ...or spell the panel out explicitly
    - { provider: anthropic, model: claude-opus-4-8,   role: primary,    reasoning: high }
    - { provider: openai,    model: gpt-4o,            role: secondary,  reasoning: none }
    - { provider: google,    model: gemini-2.5-pro,    role: tiebreaker, reasoning: medium }
---
  • riskTier (low | medium | high | critical) expands to a default panel via RISK_TIER_PRESETS — higher tier means more judges, more providers, more reasoning, more conservative aggregation. Override any of it with explicit judges / aggregator. The presets use Anthropic/OpenAI/Google only; DeepSeek is opt-in — add it via an explicit judges entry.
  • reasoning (none | low | medium | high | a token budget) maps to each provider's mechanism: Anthropic extended thinking, OpenAI reasoning_effort, Gemini thinking budget, DeepSeek deepseek-reasoner.
  • Cross-provider bias mitigation — no judge should share the output's provider (same-family models rate their own style leniently). Set generatorProvider and the runner warns when a judge collides with it.

Resolution precedence: CLI/RunOptions override > eval.judges > eval.riskTier preset > a single default Haiku judge.

Set the provider keys you use: ANTHROPIC_API_KEY, OPENAI_API_KEY, GEMINI_API_KEY, DEEPSEEK_API_KEY. Base-URL overrides are per provider and each only redirects its own provider's traffic: AI_GATEWAY_URL (Anthropic), OPENAI_BASE_URL (OpenAI), DEEPSEEK_BASE_URL (DeepSeek). There is no single variable that routes every provider through one gateway.

Deterministic (code-scored) criteria

Reserve the judge panel for judgment calls; let code check what code can check exactly (recall, presence, format, latency). A criterion with a code evaluator is scored by importing a function — no judge call, perfectly reproducible:

{
  "id": "semantic-recall",
  "name": "Semantic recall",
  "description": "Fraction of must-include elements present in the top-N.",
  "weight": 0.35,
  "scale": { "kind": "ordinal", "min": 0, "max": 1 },
  "evaluator": { "kind": "code", "module": "./evaluators.ts", "export": "semanticRecall" }
}

The export receives { input, expected, output, fixture } and returns a number or { score, reasoning } in the criterion's scale. LLM and code criteria mix freely in one rubric; a rubric with no LLM criteria makes zero API calls.

Runner

import { run } from "@collabb/hammurabi";

const report = await run({
  spec,      // panel comes from spec.frontmatter.eval unless overridden here
  rubric,
  fixtures,
  judges: [{ provider: "anthropic", model: "claude-haiku-4-5" }], // optional override
  aggregator: "min",
  baseline: previousReport,
});

Judge panel

The runner judges each fixture's output with a configurable panel of LLMs, normally authored in the spec's eval block. These RunOptions override it for a one-off run. Default (no eval block, no override) is a single Haiku 4.5 judge.

| Option | Default | Notes | |---|---|---| | judges | from eval block, else [{ provider: "anthropic", model: "claude-haiku-4-5" }] | Array of judge configs (provider, model, role, reasoning, weight). Panel calls run in parallel per fixture. | | aggregator | "mean" | "mean" \| "median" \| "min" \| "max" or a custom (scores: number[]) => number. Risk-sensitive callers should use "min". | | regressionThreshold | 0.05 | Per-fixture weighted-score delta below which a regression is flagged vs the baseline. | | execute | — | Required for target.kind === "free-form". Custom executor that returns the output for a given input. |

Cost note: panel size multiplies cost linearly. A 2-judge panel costs 2× a 1-judge panel. Wall-clock latency is roughly constant (panel calls run in parallel via Promise.all).

Audit trail

Each CriterionScore in the report preserves the full panel's votes:

{
  criterionId: "is-uppercase",
  score: 0.5,                     // aggregated
  reasoning: "[claude-haiku-4-5] ...\n\n[claude-sonnet-4-6] ...",
  judgeVotes: [
    { model: "claude-haiku-4-5", score: 1, reasoning: "..." },
    { model: "claude-sonnet-4-6", score: 0, reasoning: "..." },
  ],
}

If a judge errors or omits a criterion, its vote still lands in judgeVotes with an error field — but it is excluded from the aggregate, so a transient failure can never fabricate a low score and a false regression. A criterion that no judge could score marks the fixture errored (⚠), never failed (✗), and still trips a non-zero CI exit so it can't pass silently.

Target kinds

Spec.frontmatter.target is a tagged union:

| kind | Required fields | How the runner executes | |---|---|---| | cli | command | Shell-exec; fixture input is piped to stdin as JSON; stdout parsed as JSON (falls back to plain text). | | function | module, export | Dynamic import(module), calls module[export](input). Module path should be importable from the caller's resolution context. | | http | url, optional method (default POST) | fetch with application/json body. | | free-form | description | Calls options.execute(input). Required when kind is free-form. |

Smoke test

npm install
npm run smoke

Requires ANTHROPIC_API_KEY (and optionally AI_GATEWAY_URL for Cloudflare AI Gateway routing).

Panel smoke:

HAMMURABI_SMOKE_JUDGES="claude-haiku-4-5,claude-sonnet-4-6" HAMMURABI_SMOKE_AGGREGATOR=min npm run smoke

Expected cost: ~$0.02 for a single-judge run, ~$0.10 for a 2-judge panel including Sonnet.

Layout

hammurabi/
├── src/
│   ├── schema/       # Spec, Rubric, Fixture, Report types
│   ├── loaders/      # On-disk parsers (*.spec.md, *.rubric.json, *.fixtures.jsonl)
│   ├── runner/       # Execute fixtures + judge panel + score
│   ├── cli/          # hammurabi-run CLI + Report → markdown renderer
│   └── index.ts
├── commands/         # /hammurabi, /hammurabi-run
├── examples/smoke/   # End-to-end smoke (inline + disk variants)
├── tests/            # Unit tests (tsx --test)
└── plans/            # Approved implementation plans