agent-orchestra
v0.1.0
Published
Multi-agent AI orchestration with confidence scoring for TypeScript
Maintainers
Readme
Why agent-orchestra?
Most AI agent frameworks give you building blocks but leave the hard problems to you: when should an agent's output be trusted? What happens when two agents disagree? How do you prevent a confident-but-wrong agent from causing damage?
agent-orchestra is a TypeScript framework for multi-agent orchestration where confidence scoring is the core primitive, not an afterthought. Every agent returns a calibrated confidence score. The orchestrator uses these scores to decide whether to proceed, cross-validate with another agent, or escalate to a human. The result is an AI system that knows what it doesn't know.
npm install agent-orchestraQuickstart
import {
Orchestrator,
defineAgent,
ConfidenceThresholds,
} from "agent-orchestra";
// 1. Define specialized agents
const reviewer = defineAgent({
id: "code-reviewer",
description: "Reviews code changes for correctness",
execute: async (task, context) => {
const analysis = await yourLLM.analyze(task.payload);
return {
result: analysis.findings,
confidence: analysis.confidence,
rationale: analysis.reasoning,
};
},
});
const security = defineAgent({
id: "security-scanner",
description: "Checks for security vulnerabilities",
execute: async (task, context) => {
const scan = await yourLLM.scan(task.payload);
return {
result: scan.vulnerabilities,
confidence: scan.confidence,
rationale: scan.reasoning,
};
},
});
// 2. Create the orchestrator
const orchestra = new Orchestrator({
agents: [reviewer, security],
thresholds: ConfidenceThresholds.DEFAULT,
onEscalation: async (result) => {
console.log(`Escalating: ${result.rationale}`);
},
});
// 3. Run
const result = await orchestra.run({
type: "code-review",
payload: { diff: "..." },
});
console.log(result.aggregateConfidence); // 0.87
console.log(result.escalations); // []That's it. Sixteen lines to a working multi-agent system with confidence gating.
Architecture
graph TB
subgraph Orchestrator
TQ[Task Queue] --> DC[Decomposer]
DC --> RT[Router]
RT --> AG[Aggregator]
AG --> CG[Confidence Gate]
CG -->|≥ 0.85| PR[Proceed]
CG -->|0.60 – 0.84| CV[Cross-Validate]
CG -->|< 0.60| ES[Escalate]
end
subgraph Agents
A1[Agent A]
A2[Agent B]
A3[Agent C]
end
subgraph Context
CB[Context Bus]
HS[Score History]
end
RT --> A1 & A2 & A3
A1 & A2 & A3 --> AG
A1 & A2 & A3 -.-> CB
ES -.-> HS
style CG fill:#1a1a2e,color:#fff
style PR fill:#16a34a,color:#fff
style CV fill:#ca8a04,color:#fff
style ES fill:#dc2626,color:#fffThe orchestrator receives a task, decomposes it into sub-tasks, routes each to the appropriate agent, collects results with confidence scores, and makes a decision. It never writes code or produces content itself — its job is coordination and judgment.
Core Concepts
Agents
An agent is a specialized unit that performs one task well. Each agent implements the Agent interface: a single execute method that takes a Task and returns an AgentResult with a confidence score.
import { defineAgent } from "agent-orchestra";
const myAgent = defineAgent({
id: "my-agent",
description: "Does one thing well",
taskTypes: ["analysis"], // optional: restrict to task types
schemaVersion: 1, // optional: reject incompatible tasks
execute: async (task, context) => {
// context.bus — read/write to the shared context bus
// context.history — past results for this task chain
const priorFindings = context.bus.get("upstream-findings");
const result = await doWork(task.payload, priorFindings);
// Write to context bus for downstream agents
context.bus.set("my-findings", result.findings);
return {
result: result.data,
confidence: result.confidence, // 0–1
rationale: "Explanation of confidence level",
evidencePaths: result.filesExamined,
};
},
});Agents are model-agnostic. Use OpenAI, Anthropic, a local model, or a deterministic function — agent-orchestra doesn't care how you get the result, only that you return a confidence score with it.
Confidence Scoring
Every AgentResult includes a confidence field (0–1). The framework provides rubric helpers to keep scores calibrated:
import { ConfidenceRubric } from "agent-orchestra";
const reviewRubric = new ConfidenceRubric({
high: { range: [0.9, 1.0], criteria: "Full context, established patterns, small diff" },
medium: { range: [0.7, 0.9], criteria: "Well-understood but touches integration boundaries" },
low: { range: [0.5, 0.7], criteria: "Multiple plausible interpretations exist" },
guess: { range: [0.0, 0.5], criteria: "Insufficient context to make a determination" },
});
// Use in your agent
const score = reviewRubric.score("medium", 0.78);
// Returns 0.78, validated against the rangeThe orchestrator uses thresholds to gate decisions:
| Score | Action | Description | |-------|--------|-------------| | ≥ 0.85 | Proceed | Result is trusted. Move to next step. | | 0.60 – 0.84 | Cross-validate | Route to a different agent for a second opinion. | | < 0.60 | Escalate | Pause for human review. |
Thresholds are configurable per deployment:
import { ConfidenceThresholds } from "agent-orchestra";
// Built-in presets
ConfidenceThresholds.DEFAULT // { proceed: 0.85, review: 0.60 }
ConfidenceThresholds.STRICT // { proceed: 0.92, review: 0.75 }
ConfidenceThresholds.PERMISSIVE // { proceed: 0.75, review: 0.50 }
// Custom
const custom = { proceed: 0.88, review: 0.65 };Cross-Agent Chaining
When Agent A's output becomes Agent B's input, confidence propagates:
import { propagateConfidence } from "agent-orchestra";
const scores = [0.85, 0.78, 0.91];
propagateConfidence(scores); // 0.60 — conservative by designThe formula: max(product(scores), min(scores) * 0.9). Multiplicative attenuation penalizes long uncertain chains while the floor prevents unbounded pessimism.
The orchestrator handles chaining automatically:
const orchestra = new Orchestrator({
agents: [analyzer, reviewer, tester],
chains: [
{ from: "analyzer", to: "reviewer", when: "always" },
{ from: "reviewer", to: "tester", when: "confidence_below", threshold: 0.85 },
],
});Disagreement Detection
When two agents assess the same artifact and disagree, the orchestrator detects it:
const orchestra = new Orchestrator({
agents: [reviewer, security],
onDisagreement: async (disagreement) => {
// disagreement.agents — ["code-reviewer", "security-scanner"]
// disagreement.summaryA — reviewer's rationale
// disagreement.summaryB — security's rationale
await notifyHuman(disagreement);
},
});The orchestrator never resolves disagreements by averaging scores or picking the higher one. Disagreement between confident agents is a signal that human judgment is needed.
Context Bus
Agents communicate through a shared context bus rather than stuffing full outputs into prompts:
// Agent A writes structured findings
context.bus.set("security-findings", {
vulnerabilities: [...],
scannedFiles: [...],
uncertainties: ["Could not determine if input is sanitized at L42"],
});
// Agent B reads only what it needs
const findings = context.bus.get("security-findings");This keeps each agent's prompt focused and prevents context pollution.
Circuit Breaker
Built-in protection against runaway agents:
const orchestra = new Orchestrator({
agents: [reviewer],
circuitBreaker: {
maxConsecutiveFailures: 3,
maxActionsPerMinute: 20,
maxTokenBudget: 500_000,
onTrip: async (reason) => {
await alertOps(`Circuit breaker tripped: ${reason}`);
},
},
});Circuit breakers require manual reset by default. An agent that enters a failure loop should not be allowed to retry on its own.
API Reference
defineAgent(config)
Creates a new agent instance.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| id | string | Yes | Unique agent identifier |
| description | string | Yes | Human-readable description |
| taskTypes | string[] | No | Task types this agent handles (all if omitted) |
| schemaVersion | number | No | Reject tasks with incompatible schema versions |
| execute | (task, context) => Promise<AgentResult> | Yes | The agent's work function |
Returns: Agent
new Orchestrator(config)
Creates the orchestrator.
| Parameter | Type | Required | Description |
|-----------|------|----------|-------------|
| agents | Agent[] | Yes | Array of agents to coordinate |
| thresholds | ThresholdConfig | No | Confidence thresholds (default: ConfidenceThresholds.DEFAULT) |
| chains | ChainConfig[] | No | Cross-agent chaining rules |
| circuitBreaker | CircuitBreakerConfig | No | Circuit breaker settings |
| maxChainDepth | number | No | Max chaining depth before forced escalation (default: 4) |
| onEscalation | (result) => Promise<void> | No | Called when a result is escalated |
| onDisagreement | (disagreement) => Promise<void> | No | Called when agents disagree |
orchestrator.run(task)
Executes a task through the orchestration pipeline.
| Parameter | Type | Description |
|-----------|------|-------------|
| task | Task | The task to execute |
Returns: Promise<OrchestratorResult>
interface OrchestratorResult {
results: AgentResult[];
escalations: Escalation[];
disagreements: Disagreement[];
aggregateConfidence: number;
tokensUsed: number;
durationMs: number;
}AgentResult
Returned by every agent execution.
interface AgentResult {
agentId: string;
taskId: string;
result: unknown;
confidence: number; // 0–1
rationale: string; // why this confidence level
evidencePaths?: string[]; // files/resources examined
}propagateConfidence(scores)
Computes aggregate confidence across a chain.
propagateConfidence([0.9, 0.85]) // 0.765
propagateConfidence([0.5, 0.9]) // 0.45
propagateConfidence([]) // 0ConfidenceRubric
Helper for anchoring confidence scores to observable criteria.
const rubric = new ConfidenceRubric({ ... });
rubric.score(level, value) // Validates value is in the level's range
rubric.describe() // Returns human-readable rubric descriptionComparison
| | agent-orchestra | LangGraph | CrewAI | AutoGen/AG2 | |---|---|---|---|---| | Core abstraction | Confidence-gated orchestration | State machine graphs | Role-based crews | Multi-party conversation | | Confidence scoring | First-class primitive with rubrics, calibration, propagation | Not built-in | Not built-in | Not built-in | | Disagreement detection | Automatic with configurable resolution | Manual | Manual | Emergent (uncontrolled) | | Circuit breakers | Built-in | Not built-in | Not built-in | Not built-in | | Cross-agent chaining | Declarative with confidence gating | Graph edges | Sequential/parallel tasks | Conversation turns | | Human-in-the-loop | Confidence-triggered escalation | Checkpoint-based | Limited | Manual | | Language | TypeScript | Python, TypeScript | Python | Python | | Model lock-in | None | LangChain ecosystem | None | None | | Framework weight | ~12 KB (zero dependencies) | Heavy (LangChain) | Medium | Medium |
When to use agent-orchestra: You need multiple AI agents to coordinate on tasks where reliability matters more than speed, and you want fine-grained control over when the system trusts its own output.
When to use something else: You need a full application framework (LangGraph), rapid prototyping with role-based agents (CrewAI), or conversational multi-agent research (AutoGen).
Features at a Glance
- Zero runtime dependencies — ships at ~13 KB minified
- Dual CJS/ESM — works in Node.js, Bun, Deno, and bundlers
- Full TypeScript — strict types with exported interfaces for everything
- Model-agnostic — use OpenAI, Anthropic, local models, or deterministic functions
- 38 tests — comprehensive coverage of confidence, agents, circuit breakers, and orchestration
Observability
agent-orchestra emits structured events for every orchestration decision:
const orchestra = new Orchestrator({
agents: [reviewer, security],
logger: {
onAgentStart: (agentId, task) => { /* ... */ },
onAgentComplete: (agentId, result) => { /* ... */ },
onConfidenceGate: (result, decision) => { /* ... */ },
onDisagreement: (disagreement) => { /* ... */ },
onCircuitBreak: (reason) => { /* ... */ },
},
});Every event includes agentId, taskId, timestamp, tokensUsed, and confidence. Pipe these to your observability stack (Datadog, Grafana, LangSmith, or a plain JSON log) for dashboards, alerting, and calibration monitoring.
Contributing
See CONTRIBUTING.md for development setup, testing, and PR guidelines.
