@vibeatlas/ship-reliability-spec
v1.0.0
Published
SHIP Reliability Specification v1.0 — an open standard for measuring AI agent output reliability through outcome-validated scoring
Maintainers
Readme
SHIP Reliability Specification v1.0
An open standard for measuring AI agent output reliability through outcome-validated scoring.
What This Is
The SHIP Reliability Specification defines a methodology for computing calibrated reliability scores (0-100) for any AI agent's output. A score of 70 means: historically, outputs with similar characteristics succeed approximately 70% of the time.
The specification is:
- Domain-agnostic: The core methodology applies to any AI agent (code generation, customer service, finance, etc.)
- Outcome-validated: Scores are calibrated against real-world outcomes, not heuristics or static rules
- Open: Licensed CC-BY-4.0. Anyone can implement SHIP-compatible scoring.
Files
| File | Description |
|------|-------------|
| SPEC.md | The full specification (~5000 words, RFC 2119 style) |
| SCORE_SCHEMA.json | JSON Schema (draft 2020-12) for score exchange format |
| examples/code-domain.md | Reference Domain Adapter for code (CI pass/fail outcomes) |
| examples/cs-domain.md | Reference Domain Adapter for customer service (ticket resolution) |
| examples/score-example.json | Example score response conforming to the schema |
| CHANGELOG.md | Version history |
| LICENSE | CC-BY-4.0 |
Quick Start: Implement SHIP-Compatible Scoring
1. Choose a Domain
Pick the domain you want to score (code, customer service, or define a new one following Section 6 of the spec).
2. Collect Ground Truth
Gather at least 500 output-outcome pairs. For the code domain, this means AI-generated commits paired with CI pass/fail results. For customer service, this means AI agent responses paired with ticket resolution outcomes.
3. Extract Features
Implement feature extractors for your domain. At minimum, you need three metadata features. See examples/code-domain.md Section 3 for the code domain feature taxonomy.
4. Train a Model
Train any supervised classifier (logistic regression, XGBoost, etc.) on your labeled data. The model must achieve AUC-ROC >= 0.70 on a held-out test set.
5. Calibrate
Apply Platt scaling (recommended) or isotonic regression to your model's raw probabilities. Measure ECE and Brier Score.
p_calibrated = sigmoid(A * logit(p_raw) + B)
score = round(p_calibrated * 100)6. Emit Conforming Scores
Return scores in the format defined by SCORE_SCHEMA.json:
{
"score": 73,
"grade": "C",
"confidence": 0.82,
"domain": "code",
"spec_version": "1.0.0"
}7. Validate
Validate your score responses against the JSON Schema:
# Using ajv-cli
npm install -g ajv-cli ajv-formats
ajv validate -s SCORE_SCHEMA.json -d examples/score-example.json --spec=draft2020# Using Python jsonschema
import json
from jsonschema import validate, Draft202012Validator
with open('SCORE_SCHEMA.json') as f:
schema = json.load(f)
with open('examples/score-example.json') as f:
instance = json.load(f)
Draft202012Validator(schema).validate(instance)
print("Valid!")Adding a New Domain
Follow the six-step Domain Extension Protocol in SPEC.md Section 6:
- Define Outcome Signals -- What does success/failure look like?
- Define Feature Extractors -- What inputs predict the outcome?
- Collect Ground Truth -- Gather 500+ output-outcome pairs
- Train and Validate -- AUC >= 0.70 on held-out data
- Calibrate -- ECE < 0.10 before production use
- Register the Domain -- Document the adapter (see examples/)
Grade Scale
| Grade | Score Range | Meaning | |-------|-------------|---------| | A+ | 95-100 | Very high reliability | | A | 90-94 | High reliability | | B | 80-89 | Above average | | C | 65-79 | Moderate; review recommended | | D | 50-64 | Below average; significant review recommended | | F | 0-49 | Low reliability; manual review required |
Key Formulas
Platt Scaling:
p_cal = sigmoid(A * logit(p_raw) + B)Expected Calibration Error:
ECE = sum(|B_m|/N * |avg_pred(B_m) - avg_obs(B_m)|) for m = 1..10Confidence:
confidence = 2 * model_certainty * data_sufficiency / (model_certainty + data_sufficiency)
model_certainty = 1 - 4 * p * (1 - p)
data_sufficiency = min(1.0, n_similar / N_threshold)Bayesian Shrinkage (cross-tool comparison):
adjusted_rate = (n * observed_rate + k * global_rate) / (n + k) where k = 50Reference Implementation
The SHIP Protocol API at https://ship-protocol.dhruvaapi.workers.dev is the reference implementation of this specification for the code domain. It scores AI-generated code commits against CI pass/fail outcomes using an XGBoost model with Platt scaling calibration.
Contributing
This specification is open for community input. To propose changes:
- Open an issue describing the proposed change and its rationale
- Reference the specific section(s) affected
- Include any empirical evidence supporting the change
Changes to MUST/SHOULD/MAY requirements or grade boundaries require a major version increment.
License
This specification is licensed under CC-BY-4.0. You are free to implement, adapt, and redistribute with attribution.
