conkurrence
v1.0.3
Published
One command. Find out if your AI agrees with itself. Statistically validated consensus measurement using multi-model AI raters.
Maintainers
Readme
conkurrence
One command. Find out if your AI agrees with itself.
ConKurrence measures whether multiple AI models produce consistent outputs on your evaluation tasks — using the same psychometric methods trusted in clinical research (Fleiss' κ, Kendall's W, bootstrap confidence intervals).
Stop guessing whether your golden dataset is reliable. Know statistically.
Install
npm install -g conkurrenceQuick Start
# Initialize from a template
conkurrence init --template classification
# Edit data.json with your items, then run
conkurrence run --schema schema.json --data data.json --config config.json
# See the results
conkurrence report --input results.jsonOutput:
## Quick Assessment
🟢 sentiment — Strong agreement (κ = 0.847)
🟡 severity — Moderate agreement (κ = 0.523)
🔴 category — Poor agreement (κ = 0.189)
## Validity Assessment
✅ Instrument Calibrated — W = 0.8234 (threshold: 0.7)You now know sentiment is reliable, severity needs review, and category should be redesigned before trusting it.
Why ConKurrence
Your golden dataset has a hidden problem: you don't know if it's reproducible. If different models (or the same model on different days) give different answers, your "ground truth" is noise.
ConKurrence sends your evaluation tasks to multiple AI models, collects their independent ratings, and computes agreement statistics that tell you exactly which fields are reliable and which aren't.
Typical run: 30 items, 4 models, under $1 USD via AWS Bedrock.
Templates
Start from a template, customize for your domain:
conkurrence init --template <name>| Template | Use Case |
|----------|----------|
| classification | Binary or multi-class labeling |
| extraction | Structured field extraction |
| summarization | Summary quality assessment |
| evidence-evaluation | Evidence relevance and limitations |
Commands
conkurrence run Run convergence analysis
conkurrence report Generate Markdown report from results
conkurrence estimate Estimate cost before running (no API calls)
conkurrence compare Compare two runs (before/after schema changes)
conkurrence finalize Merge expert decisions into golden dataset
conkurrence suggest AI-assisted schema improvement suggestions
conkurrence init Initialize from templateMCP Server
ConKurrence includes a Model Context Protocol server for AI-assisted workflows:
conkurrence-mcp11 tools for running evaluations, analyzing trends, and getting schema suggestions — all from your AI coding assistant. Learn more →
Programmatic API
import { runConvergence } from 'conkurrence';
const result = await runConvergence({
schema: mySchema,
data: myItems,
config: { models: ['claude-sonnet', 'llama-3-70b', 'mistral-large'] }
});
console.log(result.fields.sentiment.kappa); // 0.847Requirements
- Node.js ≥ 20
- AWS credentials with Bedrock access (
bedrock:InvokeModel) - 3 free runs included — no license key needed to evaluate
Pricing
First 3 runs are free. After that, purchase run packs at conkurrence.com.
Documentation
Full docs, interpretation guide, and API reference at conkurrence.com
License
BUSL-1.1 — source available, production use beyond 3 runs requires a license. Converts to Apache 2.0 on the Change Date. See LICENSE.md for details.
