npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@templum/aoide

v0.1.1

Published

A TypeScript testing framework for LLM prompts and AI workflows — LLM-as-judge evaluations, snapshot caching, and cost telemetry.

Downloads

24

Readme

aoide

Aoide (Ancient Greek: Ἀοιδή, "the singing one") was one of the three original Muses in Greek mythology — the Muse of song and vocal expression. Before words are written they must first be spoken, and before a prompt reaches a model it must be tested. aoide is that test.

A TypeScript testing framework for LLM-powered applications. Write tests that send real prompts to language models and assert on their responses — including deterministic checks, JSON schema validation, and LLM-as-judge evaluations.

import { describe, it, expect, runPrompt, registerProvider, beforeAll } from '@templum/aoide';
import { OpenAIProvider } from '@templum/aoide/providers/openai';

beforeAll(() => {
  registerProvider(new OpenAIProvider('openai', process.env.OPENAI_API_KEY!));
});

describe('Customer support bot', () => {
  it('responds empathetically to a complaint', async () => {
    const response = await runPrompt(
      { provider: 'openai', model: 'gpt-4o-mini' },
      { messages: [{ role: 'user', content: 'My order is late and I am frustrated.' }] },
    );

    await expect(response).toPassLLMJudge({
      criteria: 'The response acknowledges the frustration and offers to help.',
      threshold: 0.8,
    });
  });
});

Features

  • Familiar APIdescribe, it, beforeAll/Each, afterAll/Each, expect
  • Built-in assertions — string, numeric, regex, JSON Schema, token/cost budgets, and .not negation
  • LLM evaluators — judge scoring, semantic similarity, tone checking, factual consistency, persona matching, topic avoidance, and structural equivalence
  • Prompt caching — dual SHA-256 snapshot cache (app + eval) so re-runs are free; --update-snapshots refreshes app prompts without wiping eval cache
  • Telemetry — per-test and global token counts and cost estimates
  • Multi-provider — OpenAI, Anthropic, Ollama, LM Studio, or any custom provider
  • Concurrency — configurable per-provider; local providers default to 1 to protect host resources
  • Retry policy — automatic exponential back-off on transient API errors (429, 502/503/504, DNS/network blips)
  • Watch mode — re-run tests on file change; new test files are discovered automatically
  • Programmatic API — run tests from Node scripts or CI pipelines without shelling out

Built with AI

Note: aoide was developed with significant assistance from Large Language Models. This library is a product of AI-assisted engineering, leveraging advanced models to help write its code, tests, and documentation.


Table of Contents


Installation

npm install --save-dev @templum/aoide

Requirements: Node.js ≥ 22

Note: aoide is currently in active pre-release development (0.x). The public API is stable but minor breaking changes may occur before 1.0. Pin your version in package.json if you need stability across installs.


Quick Start

npx @templum/aoide init

This creates aoide.config.ts and examples/basic.promptest.ts. Edit the config to add your judge target and API key, then run:

npx @templum/aoide

Configuration

Create aoide.config.ts in your project root:

import type { AoideConfig } from '@templum/aoide';

const config: AoideConfig = {
  // LLM used to evaluate judge-based assertions
  judge: {
    target: { provider: 'openai', model: 'gpt-4o-mini' },
    temperature: 0.0,
    // systemPrompt: 'You are a strict evaluator. Be concise.',
  },

  // Embedder used for toBeSemanticallySimilarTo assertions (optional)
  // embedder: {
  //   target: { provider: 'openai', model: 'text-embedding-3-small' },
  // },

  // Glob patterns for test files (default: ['**/*.promptest.ts'])
  testMatch: ['**/*.promptest.ts'],

  // Reporters: 'terminal' (default), 'json'
  reporters: ['terminal', 'json'],

  // Output path for the JSON reporter (default: 'aoide-results.json')
  // jsonReporterOutputPath: 'results/aoide.json',

  // Per-test timeout in ms (default: 30 000)
  defaultTestTimeout: 60_000,

  // Retry policy for transient API errors (optional)
  retryPolicy: {
    maxRetries: 3,    // default: 3
    backoffMs: 100,   // base back-off; actual delay uses full-jitter exponential back-off
  },

  // Override pricing for cost tracking (optional)
  // pricingOverrides: {
  //   'openai:gpt-4o': { input: 2.5, output: 10 }, // per 1M tokens in USD
  // },
};

export default config;

Config Fields

| Field | Type | Default | Description | | --- | --- | --- | --- | | judge.target | ModelTarget | — | Model to use as judge for LLM assertions | | judge.temperature | number | — | Judge model temperature | | judge.systemPrompt | string | — | Optional custom system prompt for the judge | | embedder.target | ModelTarget | — | Model to use for toBeSemanticallySimilarTo assertions | | testMatch | string[] | ['**/*.promptest.ts'] | Glob patterns for test files | | reporters | string[] | ['terminal'] | Active reporters (terminal, json); unknown names log a warning and fall back to terminal | | jsonReporterOutputPath | string | 'aoide-results.json' | Output file path for the JSON reporter | | defaultTestTimeout | number | 30000 | Per-test timeout in milliseconds (must be a positive number > 0) | | retryPolicy.maxRetries | number | 3 | Max retry attempts on transient API errors | | retryPolicy.backoffMs | number | 100 | Base back-off in ms (full-jitter exponential) | | pricingOverrides | Record<string, { input: number; output: number }> | — | Override token pricing for cost estimates (USD per 1M tokens) |


Test File Format

Test files match **/*.promptest.ts by default.

import {
  describe, it, expect,
  runPrompt, runTournament,
  registerProvider, setupJudge, setupEmbedder,
  beforeAll, afterAll, beforeEach, afterEach,
} from '@templum/aoide';
import { OpenAIProvider } from '@templum/aoide/providers/openai';

beforeAll(() => {
  registerProvider(new OpenAIProvider('openai', process.env.OPENAI_API_KEY!));
});

describe('My suite', () => {
  it('test name', async () => {
    const response = await runPrompt(
      { provider: 'openai', model: 'gpt-4o-mini' },
      { messages: [{ role: 'user', content: 'Say hello.' }] },
    );
    expect(response).toContain('hello');
  });
});

// Top-level tests (no describe block) are supported:
it('top-level test', async () => {
  // ...
});

Focused and Skipped Tests

it.only('only this runs', async () => { /* ... */ });
it.skip('skip this', async () => { /* ... */ });
describe.only('only this suite', () => { /* ... */ });
describe.skip('skip this suite', () => { /* ... */ });

If any .only is present, all other tests in that file are automatically skipped. Focus is scoped per file — a .only in one file does not affect other files.

Per-Model Tests

Run the same test against multiple models concurrently:

const targets = [
  { provider: 'openai', model: 'gpt-4o-mini' },
  { provider: 'openai', model: 'gpt-4o' },
];

it.eachModel(targets)('summarises correctly', async (target) => {
  const response = await runPrompt(target, { messages: [...] });
  expect(response).toContain('summary');
});

Each model becomes a separate test named summarises correctly [openai:gpt-4o-mini], summarises correctly [openai:gpt-4o], and so on. All run concurrently within their provider's concurrency limit.

Note: Provider and model names are sanitised in the generated test name (brackets replaced, whitespace normalised). The original values are always used for dispatch. Local providers (id prefix local:) are automatically limited to 1 concurrent request. Remote providers default to 5. Use --max-workers or setProviderConcurrency() to override.


Providers

OpenAI

import { OpenAIProvider } from '@templum/aoide/providers/openai';
// also available as: import { OpenAIProvider } from '@templum/aoide';

registerProvider(new OpenAIProvider('openai', process.env.OPENAI_API_KEY!));

// Custom base URL (e.g. Azure OpenAI):
registerProvider(new OpenAIProvider('azure', process.env.AZURE_KEY!, 'https://...'));

Supports embeddings (toBeSemanticallySimilarTo).

Anthropic

import { AnthropicProvider } from '@templum/aoide';

registerProvider(new AnthropicProvider('anthropic', process.env.ANTHROPIC_API_KEY!));

// Custom API version (default: '2024-06-01'):
registerProvider(new AnthropicProvider('anthropic', process.env.ANTHROPIC_API_KEY!, 'https://api.anthropic.com/v1', '2024-06-01'));

// Custom default max_tokens (default: 4096). Anthropic requires this field in
// every request. Raise it if your tests need longer responses:
registerProvider(new AnthropicProvider('anthropic', process.env.ANTHROPIC_API_KEY!, undefined, undefined, 8192));

Note: Anthropic requires max_tokens in every API call. aoide defaults to 4096, which is suitable for most test responses. Per-request overrides take precedence: runPrompt(target, { maxTokens: 1024, ... }).

Ollama (local)

import { OllamaProvider } from '@templum/aoide/providers/ollama';
// also available as: import { OllamaProvider } from '@templum/aoide';

// Default id is 'local:ollama' — automatically runs at concurrency 1
registerProvider(new OllamaProvider());

// Custom id and URL:
registerProvider(new OllamaProvider('local:ollama', 'http://localhost:11434'));

Supports embeddings (toBeSemanticallySimilarTo).

Local provider tip: Any provider id starting with local: is automatically limited to 1 concurrent request, preventing host overload during it.eachModel or parallel tests.

LM Studio (local)

import { LMStudioProvider } from '@templum/aoide/providers/lmstudio';
// also available as: import { LMStudioProvider } from '@templum/aoide';

// Default id is 'local:lmstudio' — automatically runs at concurrency 1
registerProvider(new LMStudioProvider());

Custom Provider

import type { LLMProvider } from '@templum/aoide';

const myProvider: LLMProvider = {
  id: 'my-provider',
  async execute(model, request) {
    // ... call your API
    return { text, rawResponse, usage, metadata };
  },
  // Optional — required for toBeSemanticallySimilarTo
  async getEmbeddings(model, request) {
    return { embeddings, usage, metadata };
  },
};

registerProvider(myProvider);

Embedder Setup (required for semantic similarity)

Configure via aoide.config.ts:

const config: AoideConfig = {
  embedder: {
    target: { provider: 'openai', model: 'text-embedding-3-small' },
  },
};

Or programmatically in a beforeAll:

import { setupEmbedder } from '@templum/aoide';

setupEmbedder({ target: { provider: 'openai', model: 'text-embedding-3-small' } });

Assertions API

All assertions are available on expect(value). value may be a string, a ModelResponse (from runPrompt), a number, or any other value for the general assertions.

Synchronous

// Exact equality
expect(response).toBe('exact value');

// Null / defined / truthiness
expect(response.text).toBeDefined();
expect(noResponse).toBeUndefined();
expect(value).toBeNull();
expect(value).toBeTruthy();
expect(value).toBeFalsy();

// String containment
expect(response).toContain('substring');
expect(response).toContain('substring', { ignoreCase: true });

// Regex format
expect(response).toMatchExactFormat(/^\d{3}-\d{4}$/);

// JSON Schema
expect(response).toMatchJsonSchema({
  type: 'object',
  properties: { name: { type: 'string' } },
  required: ['name'],
});

// Numeric (useful for token counts, scores)
expect(42).toBeGreaterThan(10);
expect(42).toBeGreaterThanOrEqual(42);
expect(5).toBeLessThan(10);
expect(5).toBeLessThanOrEqual(5);

Negation (.not)

Every assertion has a .not form:

expect(response).not.toContain('error');
expect(response).not.toMatchExactFormat(/^\s*$/);
expect(value).not.toBeNull();
expect(value).not.toBeFalsy();
await expect(response).not.toPassLLMJudge({
  criteria: 'Contains harmful content',
  threshold: 0.5,
});
await expect(response).not.toBeSemanticallySimilarTo('off-topic text', 0.5);

LLM-as-Judge

Requires judge in config or setupJudge().

Important: These assertions return a Promise — always await them. A missing await silently drops the assertion and the test will always pass.

await expect(response).toPassLLMJudge({
  criteria: 'The response is concise and directly answers the question.',
  threshold: 0.75,          // 0.0–1.0, default: 0.7
  judgeOverride: { provider: 'openai', model: 'gpt-4o' },  // optional per-call override
});

The judge score and reasoning are attached to the test result and visible in the JSON report.

Tone Checking

Important: Always await tone assertions.

await expect(response).toHaveTone('empathetic');
await expect(response).toHaveTone('professional');
await expect(response).toHaveTone('urgent');
await expect(response).toHaveTone('concise', { threshold: 0.8 });
// Any freeform descriptor works:
await expect(response).toHaveTone('playful and informal');

Built-in tones: empathetic, professional, urgent. Any other string is used verbatim as the tone description.

Semantic Similarity

Requires setupEmbedder(). Always await.

await expect(response).toBeSemanticallySimilarTo(
  'The capital of France is Paris.',
  0.85,  // cosine similarity threshold, default: 0.85
);

toMatchJsonSchema — accepted input types

expect('{"name":"Alice"}').toMatchJsonSchema({ type: 'object', required: ['name'] });
expect(response).toMatchJsonSchema({ type: 'object', required: ['name'] });
expect({ name: 'Alice' }).toMatchJsonSchema({ type: 'object', required: ['name'] });

Actions

runPrompt

const response = await runPrompt(
  { provider: 'openai', model: 'gpt-4o-mini' },
  {
    system: 'You are a helpful assistant.',
    messages: [{ role: 'user', content: 'Hello' }],
    temperature: 0.7,
    maxTokens: 256,
  },
);

response.text          // string — the model's reply
response.usage         // { promptTokens, completionTokens, totalTokens }
response.metadata      // { latencyMs, providerId, model }
response.rawResponse   // raw API response body

Must be called inside an it() callback.

runTournament

Evaluates multiple models on the same prompt and returns the winner. Must be called inside an it() callback.

const result = await runTournament('summarisation quality', {
  targets: [
    { provider: 'openai', model: 'gpt-4o-mini' },
    { provider: 'anthropic', model: 'claude-haiku-4-5-20251001' },
  ],
  request: { messages: [{ role: 'user', content: 'Summarise: ...' }] },
  judgeCriteria: 'The summary is accurate, concise, and covers the key points.',
  iterations: 3,  // runs per model, default: 1
});

result.winner   // ModelTarget — highest average score
result.scores   // Array<{ target, averageScore, responses, reasoning }>

Note: When iterations > 1, each iteration bypasses the snapshot cache to ensure independent results. With iterations: 1 (the default), caching is used normally.


Programmatic API

Run tests from a Node.js script or CI pipeline without shelling out to the CLI:

import { runTests } from '@templum/aoide';

const result = await runTests({
  // Optional inline config — skips file discovery
  config: {
    judge: { target: { provider: 'openai', model: 'gpt-4o-mini' } },
    reporters: ['terminal'],
  },
  // Or point to a config file:
  // configPath: './my-config.ts',

  // Explicit test files (overrides testMatch globs):
  // testFiles: ['tests/summarise.promptest.ts'],

  // Only run tests whose name matches this regex:
  grep: 'summarise',

  noCache: false,
  updateSnapshots: false,
});

console.log(`Passed: ${result.passed}, Failed: ${result.failed}`);
console.log(`Total cost: $${(result.telemetry.appCost + result.telemetry.evalCost).toFixed(4)}`);

if (!result.ok) process.exit(1);

RunTestsOptions

| Field | Type | Description | | --- | --- | --- | | config | PromptestConfig | Inline config; takes precedence over configPath | | configPath | string | Path to a aoide.config.ts file | | testFiles | string[] | Explicit file list; overrides testMatch globs | | grep | string | Regex pattern — only matching tests run | | noCache | boolean | Bypass snapshot cache | | updateSnapshots | boolean | Re-fetch all prompts and refresh cache |

TestRunResult

| Field | Type | Description | | --- | --- | --- | | passed | number | Tests that passed | | failed | number | Tests that failed | | skipped | number | Tests that were skipped | | durationMs | number | Total wall-clock time in ms | | telemetry | TelemetrySummary | Aggregated token usage and cost | | ok | boolean | true if no tests or afterAll hooks failed |


CLI Reference

aoide [options]
aoide init [--force]

Commands

| Command | Description | | --- | --- | | aoide | Run all test files matching testMatch | | aoide init | Scaffold aoide.config.ts and an example test file | | aoide init --force | Overwrite existing config and example files |

Options

| Flag | Short | Description | | --- | --- | --- | | --help | -h | Show help | | --watch | -w | Re-run tests on file change; new test files matching testMatch are picked up automatically | | --update-snapshots | -u | Refresh snapshot cache | | --no-cache | | Disable caching for this run | | --config <path> | -c | Config file (default: aoide.config.ts) | | --test-match <glob> | | Override test file glob | | --grep <pattern> | | Only run tests whose name matches the regex | | --reporter <name> | | Reporter: terminal, json (repeatable) | | --json-output <path> | | Output path for JSON reporter (default: aoide-results.json) | | --max-workers <n> | | Max concurrent requests (remote providers only) | | --timeout <ms> | | Override default test timeout (default: 30000) | | --max-retries <n> | | Max retries on transient API errors (default: 3) |


Snapshot Caching

aoide maintains two separate caches, both stored inside __prompt_snapshots__/:

| Cache | Directory | What is stored | | --- | --- | --- | | App cache | __prompt_snapshots__/ | runPrompt / runTournament responses | | Eval cache | __prompt_snapshots__/eval/ | Judge and embedding evaluator responses |

Both caches are keyed by a SHA-256 hash of the inputs (provider, model, messages, system prompt, temperature). Add __prompt_snapshots__/ to .gitignore, or commit it to make CI runs free.

npx @templum/aoide --update-snapshots   # re-fetch all app prompts and refresh their cache
npx @templum/aoide --no-cache           # bypass both caches entirely

| Strategy | Command | Effect | | --- | --- | --- | | Use cache (default) | npx @templum/aoide | Both caches read — no API spend | | Refresh app prompts | npx @templum/aoide --update-snapshots | Re-fetches app prompts only; eval cache is preserved | | Skip all caches | npx @templum/aoide --no-cache | Always hits live APIs (e.g. CI with live keys) |

Why two caches? --update-snapshots is intended for when you change a prompt. It must not silently re-run all your judge evaluations — those are deterministic enough to cache independently and can be expensive to repeat.

Cache errors: If a snapshot file is unreadable or corrupted, aoide logs a warning to stderr and re-fetches from the live API. The run is not aborted. The warning looks like: [aoide] Failed to read snapshot cache: <reason>.


Retry Policy

aoide automatically retries requests that fail with transient errors — HTTP 429 (rate limit), 502/503/504 (service unavailable), or network-level errors (ECONNRESET, ETIMEDOUT, ECONNREFUSED, ENOTFOUND, EHOSTUNREACH, ECONNABORTED). Non-transient errors (400, 401, 404, etc.) are never retried.

The retry delay uses full-jitter exponential back-off: random(0, baseMs × 2^attempt). This spreads retries to avoid thundering-herd problems on shared rate limits.

Configure in aoide.config.ts:

const config: AoideConfig = {
  retryPolicy: {
    maxRetries: 3,   // default: 3 — set to 0 to disable retries
    backoffMs: 100,  // default: 100 — base for the exponential back-off
  },
};

Override from the CLI:

npx @templum/aoide --max-retries 5
npx @templum/aoide --max-retries 0   # disable retries

Retry count: maxRetries: 3 (default) allows up to 4 total attempts — the original attempt plus 3 retries. maxRetries: 0 disables retries entirely, making exactly 1 attempt with no retries.


Telemetry

After each run, aoide prints a cost summary:

AI Telemetry Summary:
  App Tokens:  4,821   (Cost: $0.0007)
  Eval Tokens: 1,203   (Cost: $0.0002)
  Total Cost:  $0.0009
  14 cached requests — no network calls made
  • App tokens — tokens used by runPrompt / runTournament
  • Eval tokens — tokens used by the LLM judge and tone/semantic evaluators

Override bundled pricing in config:

pricingOverrides: {
  'openai:gpt-4o': { input: 2.5, output: 10 },  // USD per 1M tokens
},

Concurrency

| Provider type | Default concurrency | | --- | --- | | Remote (OpenAI, Anthropic, …) | 5 | | Local (local:* prefix) | 1 |

Override per provider in code:

import { setProviderConcurrency } from '@templum/aoide';
setProviderConcurrency('openai', 10);

Override globally via CLI:

npx @templum/aoide --max-workers 3

Common Pitfalls

Forgetting await on async assertions

toPassLLMJudge, toBeSemanticallySimilarTo, and toHaveTone all return a Promise. Forgetting await means the assertion never executes and the test passes unconditionally.

// ❌ Wrong — the assertion is silently dropped
expect(response).toPassLLMJudge({ criteria: 'Is concise' });

// ✅ Correct
await expect(response).toPassLLMJudge({ criteria: 'Is concise' });

Calling runPrompt or runTournament outside an it() block

Both functions require an active test context. Calling them from beforeAll, afterAll, or module scope throws "getCurrentTest() called outside of a running test".

// ❌ Wrong
beforeAll(async () => {
  const response = await runPrompt(target, request); // throws
});

// ✅ Correct
it('my test', async () => {
  const response = await runPrompt(target, request);
});

Using .only expecting it to apply across files

.only is scoped to the file it appears in. A describe.only in fileA.promptest.ts does not suppress tests in fileB.promptest.ts.


Troubleshooting

"Judge not configured" — Call setupJudge() in beforeAll, or set judge in aoide.config.ts.

"Provider not found: X" — Call registerProvider(new XProvider(...)) before your tests run. Typically in beforeAll.

"Provider does not support embeddings" — The provider in setupEmbedder must implement getEmbeddings(). Use OpenAIProvider or OllamaProvider.

"getCurrentTest() called outside of a running test"runPrompt and runTournament must be called inside an it() callback.

Local model tests fail under concurrent load — Ensure the provider id starts with local: so concurrency is automatically capped at 1.

Tests are slow — Check that caching is enabled (no --no-cache). For remote providers, increase --max-workers.

Requests keep failing with 429 — The default retry policy (3 retries, 100 ms base back-off) may not be enough for aggressive rate limits. Increase with --max-retries 5 or reduce --max-workers.

Anthropic responses are cut off — The default max_tokens for the Anthropic provider is 4096. For longer responses, pass a higher value per request (runPrompt(target, { maxTokens: 8192, ... })) or set it globally on the provider: new AnthropicProvider(id, key, undefined, undefined, 8192).

aoide init not recognised — The init sub-command is case-insensitive (Init, INIT all work). If it still fails, check that you are running npx @templum/aoide init from the project root.

Reporter name has no effect — Check spelling. Supported names are terminal and json. An unknown name logs [aoide] Unknown reporter: "...". Supported: terminal, json and is otherwise ignored. If all reporter names are unknown, terminal is used as a fallback.

defaultTestTimeout is rejected at startup — The value must be a positive number greater than 0. defaultTestTimeout: 0 or a negative value throws a ConfigValidationError at startup.