npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@vibeatlas/ship-reliability-spec

v1.0.0

Published

SHIP Reliability Specification v1.0 — an open standard for measuring AI agent output reliability through outcome-validated scoring

Readme

SHIP Reliability Specification v1.0

An open standard for measuring AI agent output reliability through outcome-validated scoring.

What This Is

The SHIP Reliability Specification defines a methodology for computing calibrated reliability scores (0-100) for any AI agent's output. A score of 70 means: historically, outputs with similar characteristics succeed approximately 70% of the time.

The specification is:

  • Domain-agnostic: The core methodology applies to any AI agent (code generation, customer service, finance, etc.)
  • Outcome-validated: Scores are calibrated against real-world outcomes, not heuristics or static rules
  • Open: Licensed CC-BY-4.0. Anyone can implement SHIP-compatible scoring.

Files

| File | Description | |------|-------------| | SPEC.md | The full specification (~5000 words, RFC 2119 style) | | SCORE_SCHEMA.json | JSON Schema (draft 2020-12) for score exchange format | | examples/code-domain.md | Reference Domain Adapter for code (CI pass/fail outcomes) | | examples/cs-domain.md | Reference Domain Adapter for customer service (ticket resolution) | | examples/score-example.json | Example score response conforming to the schema | | CHANGELOG.md | Version history | | LICENSE | CC-BY-4.0 |

Quick Start: Implement SHIP-Compatible Scoring

1. Choose a Domain

Pick the domain you want to score (code, customer service, or define a new one following Section 6 of the spec).

2. Collect Ground Truth

Gather at least 500 output-outcome pairs. For the code domain, this means AI-generated commits paired with CI pass/fail results. For customer service, this means AI agent responses paired with ticket resolution outcomes.

3. Extract Features

Implement feature extractors for your domain. At minimum, you need three metadata features. See examples/code-domain.md Section 3 for the code domain feature taxonomy.

4. Train a Model

Train any supervised classifier (logistic regression, XGBoost, etc.) on your labeled data. The model must achieve AUC-ROC >= 0.70 on a held-out test set.

5. Calibrate

Apply Platt scaling (recommended) or isotonic regression to your model's raw probabilities. Measure ECE and Brier Score.

p_calibrated = sigmoid(A * logit(p_raw) + B)
score = round(p_calibrated * 100)

6. Emit Conforming Scores

Return scores in the format defined by SCORE_SCHEMA.json:

{
  "score": 73,
  "grade": "C",
  "confidence": 0.82,
  "domain": "code",
  "spec_version": "1.0.0"
}

7. Validate

Validate your score responses against the JSON Schema:

# Using ajv-cli
npm install -g ajv-cli ajv-formats
ajv validate -s SCORE_SCHEMA.json -d examples/score-example.json --spec=draft2020
# Using Python jsonschema
import json
from jsonschema import validate, Draft202012Validator

with open('SCORE_SCHEMA.json') as f:
    schema = json.load(f)
with open('examples/score-example.json') as f:
    instance = json.load(f)

Draft202012Validator(schema).validate(instance)
print("Valid!")

Adding a New Domain

Follow the six-step Domain Extension Protocol in SPEC.md Section 6:

  1. Define Outcome Signals -- What does success/failure look like?
  2. Define Feature Extractors -- What inputs predict the outcome?
  3. Collect Ground Truth -- Gather 500+ output-outcome pairs
  4. Train and Validate -- AUC >= 0.70 on held-out data
  5. Calibrate -- ECE < 0.10 before production use
  6. Register the Domain -- Document the adapter (see examples/)

Grade Scale

| Grade | Score Range | Meaning | |-------|-------------|---------| | A+ | 95-100 | Very high reliability | | A | 90-94 | High reliability | | B | 80-89 | Above average | | C | 65-79 | Moderate; review recommended | | D | 50-64 | Below average; significant review recommended | | F | 0-49 | Low reliability; manual review required |

Key Formulas

Platt Scaling:

p_cal = sigmoid(A * logit(p_raw) + B)

Expected Calibration Error:

ECE = sum(|B_m|/N * |avg_pred(B_m) - avg_obs(B_m)|)  for m = 1..10

Confidence:

confidence = 2 * model_certainty * data_sufficiency / (model_certainty + data_sufficiency)
model_certainty = 1 - 4 * p * (1 - p)
data_sufficiency = min(1.0, n_similar / N_threshold)

Bayesian Shrinkage (cross-tool comparison):

adjusted_rate = (n * observed_rate + k * global_rate) / (n + k)    where k = 50

Reference Implementation

The SHIP Protocol API at https://ship-protocol.dhruvaapi.workers.dev is the reference implementation of this specification for the code domain. It scores AI-generated code commits against CI pass/fail outcomes using an XGBoost model with Platt scaling calibration.

Contributing

This specification is open for community input. To propose changes:

  1. Open an issue describing the proposed change and its rationale
  2. Reference the specific section(s) affected
  3. Include any empirical evidence supporting the change

Changes to MUST/SHOULD/MAY requirements or grade boundaries require a major version increment.

License

This specification is licensed under CC-BY-4.0. You are free to implement, adapt, and redistribute with attribution.