@templum/aoide

v0.1.1

Published

2 months ago

A TypeScript testing framework for LLM prompts and AI workflows — LLM-as-judge evaluations, snapshot caching, and cost telemetry.

Downloads

0High
0Medium
0Low

templum

llm testing ai prompt langchain openai anthropic ollama lm-studio llm-judge snapshot-testing typescript

aoide

Aoide (Ancient Greek: Ἀοιδή, "the singing one") was one of the three original Muses in Greek mythology — the Muse of song and vocal expression. Before words are written they must first be spoken, and before a prompt reaches a model it must be tested. aoide is that test.

A TypeScript testing framework for LLM-powered applications. Write tests that send real prompts to language models and assert on their responses — including deterministic checks, JSON schema validation, and LLM-as-judge evaluations.

import { describe, it, expect, runPrompt, registerProvider, beforeAll } from '@templum/aoide';
import { OpenAIProvider } from '@templum/aoide/providers/openai';

beforeAll(() => {
  registerProvider(new OpenAIProvider('openai', process.env.OPENAI_API_KEY!));
});

describe('Customer support bot', () => {
  it('responds empathetically to a complaint', async () => {
    const response = await runPrompt(
      { provider: 'openai', model: 'gpt-4o-mini' },
      { messages: [{ role: 'user', content: 'My order is late and I am frustrated.' }] },
    );

    await expect(response).toPassLLMJudge({
      criteria: 'The response acknowledges the frustration and offers to help.',
      threshold: 0.8,
    });
  });
});

Features

Familiar API — describe, it, beforeAll/Each, afterAll/Each, expect
Built-in assertions — string, numeric, regex, JSON Schema, token/cost budgets, and .not negation
LLM evaluators — judge scoring, semantic similarity, tone checking, factual consistency, persona matching, topic avoidance, and structural equivalence
Prompt caching — dual SHA-256 snapshot cache (app + eval) so re-runs are free; --update-snapshots refreshes app prompts without wiping eval cache
Telemetry — per-test and global token counts and cost estimates
Multi-provider — OpenAI, Anthropic, Ollama, LM Studio, or any custom provider
Concurrency — configurable per-provider; local providers default to 1 to protect host resources
Retry policy — automatic exponential back-off on transient API errors (429, 502/503/504, DNS/network blips)
Watch mode — re-run tests on file change; new test files are discovered automatically
Programmatic API — run tests from Node scripts or CI pipelines without shelling out

Built with AI

Note: aoide was developed with significant assistance from Large Language Models. This library is a product of AI-assisted engineering, leveraging advanced models to help write its code, tests, and documentation.

Installation

npm install --save-dev @templum/aoide

Requirements: Node.js ≥ 22

Note: aoide is currently in active pre-release development (0.x). The public API is stable but minor breaking changes may occur before 1.0. Pin your version in package.json if you need stability across installs.

Quick Start

npx @templum/aoide init

This creates aoide.config.ts and examples/basic.promptest.ts. Edit the config to add your judge target and API key, then run:

npx @templum/aoide

Configuration

Create aoide.config.ts in your project root:

import type { AoideConfig } from '@templum/aoide';

const config: AoideConfig = {
  // LLM used to evaluate judge-based assertions
  judge: {
    target: { provider: 'openai', model: 'gpt-4o-mini' },
    temperature: 0.0,
    // systemPrompt: 'You are a strict evaluator. Be concise.',
  },

  // Embedder used for toBeSemanticallySimilarTo assertions (optional)
  // embedder: {
  //   target: { provider: 'openai', model: 'text-embedding-3-small' },
  // },

  // Glob patterns for test files (default: ['**/*.promptest.ts'])
  testMatch: ['**/*.promptest.ts'],

  // Reporters: 'terminal' (default), 'json'
  reporters: ['terminal', 'json'],

  // Output path for the JSON reporter (default: 'aoide-results.json')
  // jsonReporterOutputPath: 'results/aoide.json',

  // Per-test timeout in ms (default: 30 000)
  defaultTestTimeout: 60_000,

  // Retry policy for transient API errors (optional)
  retryPolicy: {
    maxRetries: 3,    // default: 3
    backoffMs: 100,   // base back-off; actual delay uses full-jitter exponential back-off
  },

  // Override pricing for cost tracking (optional)
  // pricingOverrides: {
  //   'openai:gpt-4o': { input: 2.5, output: 10 }, // per 1M tokens in USD
  // },
};

export default config;

Config Fields

| Field | Type | Default | Description | | --- | --- | --- | --- | | judge.target | ModelTarget | — | Model to use as judge for LLM assertions | | judge.temperature | number | — | Judge model temperature | | judge.systemPrompt | string | — | Optional custom system prompt for the judge | | embedder.target | ModelTarget | — | Model to use for toBeSemanticallySimilarTo assertions | | testMatch | string[] | ['**/*.promptest.ts'] | Glob patterns for test files | | reporters | string[] | ['terminal'] | Active reporters (terminal, json); unknown names log a warning and fall back to terminal | | jsonReporterOutputPath | string | 'aoide-results.json' | Output file path for the JSON reporter | | defaultTestTimeout | number | 30000 | Per-test timeout in milliseconds (must be a positive number > 0) | | retryPolicy.maxRetries | number | 3 | Max retry attempts on transient API errors | | retryPolicy.backoffMs | number | 100 | Base back-off in ms (full-jitter exponential) | | pricingOverrides | Record<string, { input: number; output: number }> | — | Override token pricing for cost estimates (USD per 1M tokens) |

Test File Format

Test files match **/*.promptest.ts by default.

import {
  describe, it, expect,
  runPrompt, runTournament,
  registerProvider, setupJudge, setupEmbedder,
  beforeAll, afterAll, beforeEach, afterEach,
} from '@templum/aoide';
import { OpenAIProvider } from '@templum/aoide/providers/openai';

beforeAll(() => {
  registerProvider(new OpenAIProvider('openai', process.env.OPENAI_API_KEY!));
});

describe('My suite', () => {
  it('test name', async () => {
    const response = await runPrompt(
      { provider: 'openai', model: 'gpt-4o-mini' },
      { messages: [{ role: 'user', content: 'Say hello.' }] },
    );
    expect(response).toContain('hello');
  });
});

// Top-level tests (no describe block) are supported:
it('top-level test', async () => {
  // ...
});

Focused and Skipped Tests

it.only('only this runs', async () => { /* ... */ });
it.skip('skip this', async () => { /* ... */ });
describe.only('only this suite', () => { /* ... */ });
describe.skip('skip this suite', () => { /* ... */ });

If any .only is present, all other tests in that file are automatically skipped. Focus is scoped per file — a .only in one file does not affect other files.

Per-Model Tests

Run the same test against multiple models concurrently:

const targets = [
  { provider: 'openai', model: 'gpt-4o-mini' },
  { provider: 'openai', model: 'gpt-4o' },
];

it.eachModel(targets)('summarises correctly', async (target) => {
  const response = await runPrompt(target, { messages: [...] });
  expect(response).toContain('summary');
});

Each model becomes a separate test named summarises correctly [openai:gpt-4o-mini], summarises correctly [openai:gpt-4o], and so on. All run concurrently within their provider's concurrency limit.

Note: Provider and model names are sanitised in the generated test name (brackets replaced, whitespace normalised). The original values are always used for dispatch. Local providers (id prefix local:) are automatically limited to 1 concurrent request. Remote providers default to 5. Use --max-workers or setProviderConcurrency() to override.

Providers

OpenAI

import { OpenAIProvider } from '@templum/aoide/providers/openai';
// also available as: import { OpenAIProvider } from '@templum/aoide';

registerProvider(new OpenAIProvider('openai', process.env.OPENAI_API_KEY!));

// Custom base URL (e.g. Azure OpenAI):
registerProvider(new OpenAIProvider('azure', process.env.AZURE_KEY!, 'https://...'));

Supports embeddings (toBeSemanticallySimilarTo).

Anthropic

import { AnthropicProvider } from '@templum/aoide';

registerProvider(new AnthropicProvider('anthropic', process.env.ANTHROPIC_API_KEY!));

// Custom API version (default: '2024-06-01'):
registerProvider(new AnthropicProvider('anthropic', process.env.ANTHROPIC_API_KEY!, 'https://api.anthropic.com/v1', '2024-06-01'));

// Custom default max_tokens (default: 4096). Anthropic requires this field in
// every request. Raise it if your tests need longer responses:
registerProvider(new AnthropicProvider('anthropic', process.env.ANTHROPIC_API_KEY!, undefined, undefined, 8192));

Note: Anthropic requires max_tokens in every API call. aoide defaults to 4096, which is suitable for most test responses. Per-request overrides take precedence: runPrompt(target, { maxTokens: 1024, ... }).

Ollama (local)

import { OllamaProvider } from '@templum/aoide/providers/ollama';
// also available as: import { OllamaProvider } from '@templum/aoide';

// Default id is 'local:ollama' — automatically runs at concurrency 1
registerProvider(new OllamaProvider());

// Custom id and URL:
registerProvider(new OllamaProvider('local:ollama', 'http://localhost:11434'));

Supports embeddings (toBeSemanticallySimilarTo).

Local provider tip: Any provider id starting with local: is automatically limited to 1 concurrent request, preventing host overload during it.eachModel or parallel tests.

LM Studio (local)

import { LMStudioProvider } from '@templum/aoide/providers/lmstudio';
// also available as: import { LMStudioProvider } from '@templum/aoide';

// Default id is 'local:lmstudio' — automatically runs at concurrency 1
registerProvider(new LMStudioProvider());

Custom Provider

import type { LLMProvider } from '@templum/aoide';

const myProvider: LLMProvider = {
  id: 'my-provider',
  async execute(model, request) {
    // ... call your API
    return { text, rawResponse, usage, metadata };
  },
  // Optional — required for toBeSemanticallySimilarTo
  async getEmbeddings(model, request) {
    return { embeddings, usage, metadata };
  },
};

registerProvider(myProvider);

Embedder Setup (required for semantic similarity)

Configure via aoide.config.ts:

const config: AoideConfig = {
  embedder: {
    target: { provider: 'openai', model: 'text-embedding-3-small' },
  },
};

Or programmatically in a beforeAll:

import { setupEmbedder } from '@templum/aoide';

setupEmbedder({ target: { provider: 'openai', model: 'text-embedding-3-small' } });

Assertions API

All assertions are available on expect(value). value may be a string, a ModelResponse (from runPrompt), a number, or any other value for the general assertions.

Synchronous

// Exact equality
expect(response).toBe('exact value');

// Null / defined / truthiness
expect(response.text).toBeDefined();
expect(noResponse).toBeUndefined();
expect(value).toBeNull();
expect(value).toBeTruthy();
expect(value).toBeFalsy();

// String containment
expect(response).toContain('substring');
expect(response).toContain('substring', { ignoreCase: true });

// Regex format
expect(response).toMatchExactFormat(/^\d{3}-\d{4}$/);

// JSON Schema
expect(response).toMatchJsonSchema({
  type: 'object',
  properties: { name: { type: 'string' } },
  required: ['name'],
});

// Numeric (useful for token counts, scores)
expect(42).toBeGreaterThan(10);
expect(42).toBeGreaterThanOrEqual(42);
expect(5).toBeLessThan(10);
expect(5).toBeLessThanOrEqual(5);

Negation (`.not`)

Every assertion has a .not form:

expect(response).not.toContain('error');
expect(response).not.toMatchExactFormat(/^\s*$/);
expect(value).not.toBeNull();
expect(value).not.toBeFalsy();
await expect(response).not.toPassLLMJudge({
  criteria: 'Contains harmful content',
  threshold: 0.5,
});
await expect(response).not.toBeSemanticallySimilarTo('off-topic text', 0.5);

LLM-as-Judge

Requires judge in config or setupJudge().

Important: These assertions return a Promise — always await them. A missing await silently drops the assertion and the test will always pass.

await expect(response).toPassLLMJudge({
  criteria: 'The response is concise and directly answers the question.',
  threshold: 0.75,          // 0.0–1.0, default: 0.7
  judgeOverride: { provider: 'openai', model: 'gpt-4o' },  // optional per-call override
});

The judge score and reasoning are attached to the test result and visible in the JSON report.

Tone Checking

Important: Always await tone assertions.

await expect(response).toHaveTone('empathetic');
await expect(response).toHaveTone('professional');
await expect(response).toHaveTone('urgent');
await expect(response).toHaveTone('concise', { threshold: 0.8 });
// Any freeform descriptor works:
await expect(response).toHaveTone('playful and informal');

Built-in tones: empathetic, professional, urgent. Any other string is used verbatim as the tone description.

Semantic Similarity

Requires setupEmbedder(). Always await.

await expect(response).toBeSemanticallySimilarTo(
  'The capital of France is Paris.',
  0.85,  // cosine similarity threshold, default: 0.85
);

`toMatchJsonSchema` — accepted input types

expect('{"name":"Alice"}').toMatchJsonSchema({ type: 'object', required: ['name'] });
expect(response).toMatchJsonSchema({ type: 'object', required: ['name'] });
expect({ name: 'Alice' }).toMatchJsonSchema({ type: 'object', required: ['name'] });

Actions

`runPrompt`

const response = await runPrompt(
  { provider: 'openai', model: 'gpt-4o-mini' },
  {
    system: 'You are a helpful assistant.',
    messages: [{ role: 'user', content: 'Hello' }],
    temperature: 0.7,
    maxTokens: 256,
  },
);

response.text          // string — the model's reply
response.usage         // { promptTokens, completionTokens, totalTokens }
response.metadata      // { latencyMs, providerId, model }
response.rawResponse   // raw API response body

Must be called inside an it() callback.

`runTournament`

Evaluates multiple models on the same prompt and returns the winner. Must be called inside an it() callback.

const result = await runTournament('summarisation quality', {
  targets: [
    { provider: 'openai', model: 'gpt-4o-mini' },
    { provider: 'anthropic', model: 'claude-haiku-4-5-20251001' },
  ],
  request: { messages: [{ role: 'user', content: 'Summarise: ...' }] },
  judgeCriteria: 'The summary is accurate, concise, and covers the key points.',
  iterations: 3,  // runs per model, default: 1
});

result.winner   // ModelTarget — highest average score
result.scores   // Array<{ target, averageScore, responses, reasoning }>

Note: When iterations > 1, each iteration bypasses the snapshot cache to ensure independent results. With iterations: 1 (the default), caching is used normally.

Programmatic API

Run tests from a Node.js script or CI pipeline without shelling out to the CLI:

import { runTests } from '@templum/aoide';

const result = await runTests({
  // Optional inline config — skips file discovery
  config: {
    judge: { target: { provider: 'openai', model: 'gpt-4o-mini' } },
    reporters: ['terminal'],
  },
  // Or point to a config file:
  // configPath: './my-config.ts',

  // Explicit test files (overrides testMatch globs):
  // testFiles: ['tests/summarise.promptest.ts'],

  // Only run tests whose name matches this regex:
  grep: 'summarise',

  noCache: false,
  updateSnapshots: false,
});

console.log(`Passed: ${result.passed}, Failed: ${result.failed}`);
console.log(`Total cost: $${(result.telemetry.appCost + result.telemetry.evalCost).toFixed(4)}`);

if (!result.ok) process.exit(1);

`RunTestsOptions`

| Field | Type | Description | | --- | --- | --- | | config | PromptestConfig | Inline config; takes precedence over configPath | | configPath | string | Path to a aoide.config.ts file | | testFiles | string[] | Explicit file list; overrides testMatch globs | | grep | string | Regex pattern — only matching tests run | | noCache | boolean | Bypass snapshot cache | | updateSnapshots | boolean | Re-fetch all prompts and refresh cache |

`TestRunResult`

| Field | Type | Description | | --- | --- | --- | | passed | number | Tests that passed | | failed | number | Tests that failed | | skipped | number | Tests that were skipped | | durationMs | number | Total wall-clock time in ms | | telemetry | TelemetrySummary | Aggregated token usage and cost | | ok | boolean | true if no tests or afterAll hooks failed |

CLI Reference

aoide [options]
aoide init [--force]

Commands

| Command | Description | | --- | --- | | aoide | Run all test files matching testMatch | | aoide init | Scaffold aoide.config.ts and an example test file | | aoide init --force | Overwrite existing config and example files |

Options

| Flag | Short | Description | | --- | --- | --- | | --help | -h | Show help | | --watch | -w | Re-run tests on file change; new test files matching testMatch are picked up automatically | | --update-snapshots | -u | Refresh snapshot cache | | --no-cache | | Disable caching for this run | | --config <path> | -c | Config file (default: aoide.config.ts) | | --test-match <glob> | | Override test file glob | | --grep <pattern> | | Only run tests whose name matches the regex | | --reporter <name> | | Reporter: terminal, json (repeatable) | | --json-output <path> | | Output path for JSON reporter (default: aoide-results.json) | | --max-workers <n> | | Max concurrent requests (remote providers only) | | --timeout <ms> | | Override default test timeout (default: 30000) | | --max-retries <n> | | Max retries on transient API errors (default: 3) |

Snapshot Caching

aoide maintains two separate caches, both stored inside __prompt_snapshots__/:

| Cache | Directory | What is stored | | --- | --- | --- | | App cache | __prompt_snapshots__/ | runPrompt / runTournament responses | | Eval cache | __prompt_snapshots__/eval/ | Judge and embedding evaluator responses |

Both caches are keyed by a SHA-256 hash of the inputs (provider, model, messages, system prompt, temperature). Add __prompt_snapshots__/ to .gitignore, or commit it to make CI runs free.

npx @templum/aoide --update-snapshots   # re-fetch all app prompts and refresh their cache
npx @templum/aoide --no-cache           # bypass both caches entirely

| Strategy | Command | Effect | | --- | --- | --- | | Use cache (default) | npx @templum/aoide | Both caches read — no API spend | | Refresh app prompts | npx @templum/aoide --update-snapshots | Re-fetches app prompts only; eval cache is preserved | | Skip all caches | npx @templum/aoide --no-cache | Always hits live APIs (e.g. CI with live keys) |

Why two caches? --update-snapshots is intended for when you change a prompt. It must not silently re-run all your judge evaluations — those are deterministic enough to cache independently and can be expensive to repeat.

Cache errors: If a snapshot file is unreadable or corrupted, aoide logs a warning to stderr and re-fetches from the live API. The run is not aborted. The warning looks like: [aoide] Failed to read snapshot cache: <reason>.

Retry Policy

aoide automatically retries requests that fail with transient errors — HTTP 429 (rate limit), 502/503/504 (service unavailable), or network-level errors (ECONNRESET, ETIMEDOUT, ECONNREFUSED, ENOTFOUND, EHOSTUNREACH, ECONNABORTED). Non-transient errors (400, 401, 404, etc.) are never retried.

The retry delay uses full-jitter exponential back-off: random(0, baseMs × 2^attempt). This spreads retries to avoid thundering-herd problems on shared rate limits.

Configure in aoide.config.ts:

const config: AoideConfig = {
  retryPolicy: {
    maxRetries: 3,   // default: 3 — set to 0 to disable retries
    backoffMs: 100,  // default: 100 — base for the exponential back-off
  },
};

Override from the CLI:

npx @templum/aoide --max-retries 5
npx @templum/aoide --max-retries 0   # disable retries

Retry count: maxRetries: 3 (default) allows up to 4 total attempts — the original attempt plus 3 retries. maxRetries: 0 disables retries entirely, making exactly 1 attempt with no retries.

Telemetry

After each run, aoide prints a cost summary:

AI Telemetry Summary:
  App Tokens:  4,821   (Cost: $0.0007)
  Eval Tokens: 1,203   (Cost: $0.0002)
  Total Cost:  $0.0009
  14 cached requests — no network calls made

App tokens — tokens used by runPrompt / runTournament
Eval tokens — tokens used by the LLM judge and tone/semantic evaluators

Override bundled pricing in config:

pricingOverrides: {
  'openai:gpt-4o': { input: 2.5, output: 10 },  // USD per 1M tokens
},

Concurrency

| Provider type | Default concurrency | | --- | --- | | Remote (OpenAI, Anthropic, …) | 5 | | Local (local:* prefix) | 1 |

Override per provider in code:

import { setProviderConcurrency } from '@templum/aoide';
setProviderConcurrency('openai', 10);

Override globally via CLI:

npx @templum/aoide --max-workers 3

Common Pitfalls

Forgetting await on async assertions

toPassLLMJudge, toBeSemanticallySimilarTo, and toHaveTone all return a Promise. Forgetting await means the assertion never executes and the test passes unconditionally.

// ❌ Wrong — the assertion is silently dropped
expect(response).toPassLLMJudge({ criteria: 'Is concise' });

// ✅ Correct
await expect(response).toPassLLMJudge({ criteria: 'Is concise' });

Calling runPrompt or runTournament outside an it() block

Both functions require an active test context. Calling them from beforeAll, afterAll, or module scope throws "getCurrentTest() called outside of a running test".

// ❌ Wrong
beforeAll(async () => {
  const response = await runPrompt(target, request); // throws
});

// ✅ Correct
it('my test', async () => {
  const response = await runPrompt(target, request);
});

Using .only expecting it to apply across files

.only is scoped to the file it appears in. A describe.only in fileA.promptest.ts does not suppress tests in fileB.promptest.ts.

Troubleshooting

"Judge not configured" — Call setupJudge() in beforeAll, or set judge in aoide.config.ts.

"Provider not found: X" — Call registerProvider(new XProvider(...)) before your tests run. Typically in beforeAll.

"Provider does not support embeddings" — The provider in setupEmbedder must implement getEmbeddings(). Use OpenAIProvider or OllamaProvider.

"getCurrentTest() called outside of a running test" — runPrompt and runTournament must be called inside an it() callback.

Local model tests fail under concurrent load — Ensure the provider id starts with local: so concurrency is automatically capped at 1.

Tests are slow — Check that caching is enabled (no --no-cache). For remote providers, increase --max-workers.

Requests keep failing with 429 — The default retry policy (3 retries, 100 ms base back-off) may not be enough for aggressive rate limits. Increase with --max-retries 5 or reduce --max-workers.

Anthropic responses are cut off — The default max_tokens for the Anthropic provider is 4096. For longer responses, pass a higher value per request (runPrompt(target, { maxTokens: 8192, ... })) or set it globally on the provider: new AnthropicProvider(id, key, undefined, undefined, 8192).

aoide init not recognised — The init sub-command is case-insensitive (Init, INIT all work). If it still fails, check that you are running npx @templum/aoide init from the project root.

Reporter name has no effect — Check spelling. Supported names are terminal and json. An unknown name logs [aoide] Unknown reporter: "...". Supported: terminal, json and is otherwise ignored. If all reporter names are unknown, terminal is used as a fallback.

defaultTestTimeout is rejected at startup — The value must be a positive number greater than 0. defaultTestTimeout: 0 or a negative value throws a ConfigValidationError at startup.