conkurrence

v1.0.3

Published

2 months ago

One command. Find out if your AI agrees with itself. Statistically validated consensus measurement using multi-model AI raters.

0High
0Medium
0Low

joe-quiry.ai

ai-evaluation golden-dataset ai-consensus llm-evaluation ai-testing conkurrence fleiss-kappa kendall-w inter-rater-reliability mcp bedrock statistical-validation

conkurrence

One command. Find out if your AI agrees with itself.

ConKurrence measures whether multiple AI models produce consistent outputs on your evaluation tasks — using the same psychometric methods trusted in clinical research (Fleiss' κ, Kendall's W, bootstrap confidence intervals).

Stop guessing whether your golden dataset is reliable. Know statistically.

Install

npm install -g conkurrence

Quick Start

# Initialize from a template
conkurrence init --template classification

# Edit data.json with your items, then run
conkurrence run --schema schema.json --data data.json --config config.json

# See the results
conkurrence report --input results.json

Output:

## Quick Assessment

🟢 sentiment — Strong agreement (κ = 0.847)
🟡 severity — Moderate agreement (κ = 0.523)
🔴 category — Poor agreement (κ = 0.189)

## Validity Assessment

✅ Instrument Calibrated — W = 0.8234 (threshold: 0.7)

You now know sentiment is reliable, severity needs review, and category should be redesigned before trusting it.

Why ConKurrence

Your golden dataset has a hidden problem: you don't know if it's reproducible. If different models (or the same model on different days) give different answers, your "ground truth" is noise.

ConKurrence sends your evaluation tasks to multiple AI models, collects their independent ratings, and computes agreement statistics that tell you exactly which fields are reliable and which aren't.

Typical run: 30 items, 4 models, under $1 USD via AWS Bedrock.

Templates

Start from a template, customize for your domain:

conkurrence init --template <name>

| Template | Use Case | |----------|----------| | classification | Binary or multi-class labeling | | extraction | Structured field extraction | | summarization | Summary quality assessment | | evidence-evaluation | Evidence relevance and limitations |

Commands

conkurrence run         Run convergence analysis
conkurrence report      Generate Markdown report from results
conkurrence estimate    Estimate cost before running (no API calls)
conkurrence compare     Compare two runs (before/after schema changes)
conkurrence finalize    Merge expert decisions into golden dataset
conkurrence suggest     AI-assisted schema improvement suggestions
conkurrence init        Initialize from template

MCP Server

ConKurrence includes a Model Context Protocol server for AI-assisted workflows:

conkurrence-mcp

11 tools for running evaluations, analyzing trends, and getting schema suggestions — all from your AI coding assistant. Learn more →

Programmatic API

import { runConvergence } from 'conkurrence';

const result = await runConvergence({
  schema: mySchema,
  data: myItems,
  config: { models: ['claude-sonnet', 'llama-3-70b', 'mistral-large'] }
});

console.log(result.fields.sentiment.kappa); // 0.847

Requirements

Node.js ≥ 20
AWS credentials with Bedrock access (bedrock:InvokeModel)
3 free runs included — no license key needed to evaluate

Pricing

First 3 runs are free. After that, purchase run packs at conkurrence.com.

Documentation

Full docs, interpretation guide, and API reference at conkurrence.com

License

BUSL-1.1 — source available, production use beyond 3 runs requires a license. Converts to Apache 2.0 on the Change Date. See LICENSE.md for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

conkurrence

Install

Quick Start

Why ConKurrence

Templates

Commands

MCP Server

Programmatic API

Requirements

Pricing

Documentation

License