@docshield/didactic

v0.1.5

Published

2 months ago

Eval/optimization framework for LLM workflows

0High
0Medium
0Low

azhar-hussain

tylerdocshield

azharhussain96

llm evaluation optimization prompt-engineering ai anthropic claude openai gpt

Didactic

Eval your LLM workflows by comparing actual outputs against expected results with smart comparators that handle real-world variations. Optimize prompts automatically through iterative self-improvement—the system analyzes its own mistakes and rewrites prompts to boost accuracy.

Use it to test extraction and classification based AI workflows, monitor regression, and improve performance

Installation

npm install @docshield/didactic

Requires Node.js >= 18.0.0

Quick Start

import {
  didactic,
  within,
  oneOf,
  exact,
  unordered,
  numeric,
} from '@docshield/didactic';

const result = await didactic.eval({
  executor: didactic.endpoint('https://api.example.com/extract'),
  comparators: {
    premium: within({ tolerance: 0.05 }),
    policyType: oneOf(['claims-made', 'occurrence']),
    carrier: exact,
    // Nested comparators for arrays
    coverages: unordered({
      type: exact,
      limit: numeric,
    }),
  },
  testCases: [
    {
      input: { emailId: 'email-123' },
      expected: {
        premium: 12500,
        policyType: 'claims-made',
        carrier: 'Acme Insurance',
        coverages: [
          { type: 'liability', limit: 1000000 },
          { type: 'property', limit: 500000 },
        ],
      },
    },
  ],
});

console.log(
  `${result.passed}/${result.total} passed (${result.accuracy * 100}% field accuracy)`
);

Example

Eval - Invoice Parser

Real-world invoice extraction using Anthropic's Claude with structured outputs. Tests field accuracy across vendor names, line items, and payment terms.

# Set your API key
export ANTHROPIC_API_KEY=your_key_here

# Run the example
npm run example:eval:invoice-parser

Shows how to use numeric, name, exact, unordered(), and llmCompare comparators for financial data extraction with nested comparator structures.

Optimizer - Expense Categorizer

Iteratively feed eval failures back into an optimization loop to self-improve prompt and performance. Runs evals until it reaches targeted performance or runs out of budget.

# Set your API key
export ANTHROPIC_API_KEY=your_key_here

# Run the example
npm run example:optimizer:expense-categorizer

Shows how to use Didactic to self-heal failures and improve prompt to better perform across test set data.

Core Concepts

Didactic has three core components:

Executors — Abstraction for running your LLM workflow (local function or HTTP endpoint)
Comparators — Nested structure matching your data shape, with per-field comparison logic and unordered() for arrays
Optimization — Iterative prompt improvement loop to hit a target success rate

How they work together: Your executor runs each test case's input through your LLM workflow, returning output that matches your test case's expected output shape. Comparators then evaluate each field of the output against expected values, using nested structures that mirror your data shape. For arrays, use unordered() to match by similarity rather than index position.

In optimization mode, these results feed into an LLM that analyzes failures and generates improved system prompts—repeating until your target success rate or iteration/cost limit is reached.

Eval Flow

Optimize Flow

API

`didactic.eval(config)`

The main entry point. Runs your executor over test cases and reports field-level pass/fail results. When optimize is provided, it enters optimization mode and iteratively improves the system prompt.

const result = await didactic.eval(config);

EvalConfig

| Property | Type | Kind | Required | Default | Description | | -------------------- | ----------------------------- | --------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | executor | Executor<TInput, TOutput> | Object | Yes | — | Function that executes your LLM workflow. Receives input and optional system prompt, returns structured output. | | testCases | TestCase<TInput, TOutput>[] | Array | Yes | — | Array of { input, expected } pairs. Each test case runs through the executor and compares output to expected. | | comparators | ComparatorsConfig | Object/Function | No | exact | Nested comparator structure matching your data shape. Can be a single comparator function (e.g., exact), or a nested object with per-field comparators. Use unordered() wrapper for arrays that should match by similarity rather than index. | | comparatorOverride | Comparator<TOutput> | Function | No | — | Custom whole-object comparison function. Use when you need complete control over comparison logic and want to bypass field-level matching. | | llmConfig | LLMConfig | Object | No | — | Default LLM configuration for LLM-based comparators (e.g., llmCompare). Provides apiKey and optional provider so you don't repeat them in each comparator call. | | systemPrompt | string | Primitive | No | — | System prompt passed to the executor. Required if using optimization. | | perTestThreshold | number | Primitive | No | 1.0 | Minimum field pass rate for a test case to pass (0.0–1.0). At default 1.0, all fields must pass. Set to 0.8 to pass if 80% of fields match. | | rateLimitBatch | number | Primitive | No | — | Number of test cases to run concurrently. Use with rateLimitPause for rate-limited APIs. | | rateLimitPause | number | Primitive | No | — | Seconds to wait between batches. Pairs with rateLimitBatch. | | optimize | OptimizeConfig | Object | No | — | Inline optimization config. When provided, triggers optimization mode instead of single eval. |

`didactic.optimize(evalConfig, optimizeConfig)`

Run optimization as a separate call instead of inline.

const result = await didactic.optimize(evalConfig, optimizeConfig);

const config = {
  ...evalConfig,
  optimize: {
    systemPrompt: 'Extract information from an invoice.',
    targetSuccessRate: 0.9,
    apiKey: 'your-llm-provider-api-key',
    provider: LLMProviders.openai_gpt5,
    maxIterations: 10,
    maxCost: 10,
    storeLogs: true,
    thinking: true,
  },
};

OptimizeConfig

| Property | Type | Required | Default | Description | | ------------------- | ------------------- | -------- | --------------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------- | | systemPrompt | string | Yes | — | Initial system prompt to optimize. This is the starting point that the optimizer will iteratively improve. | | targetSuccessRate | number | Yes | — | Target success rate to achieve (0.0–1.0). Optimization stops when this rate is reached. | | apiKey | string | Yes | — | API key for the LLM provider used by the optimizer (not your workflow's LLM). | | provider | LLMProviders | Yes | — | LLM provider the optimizer uses to analyze failures and generate improved prompts. | | maxIterations | number | No | 5 | Maximum optimization iterations before stopping, even if target not reached. | | maxCost | number | No | — | Maximum cost budget in dollars. Optimization stops if cumulative cost exceeds this. | | storeLogs | boolean \| string | No | — | Save optimization logs. true uses default path (./didactic-logs/optimize_<timestamp>/summary.md), or provide custom summary path. | | thinking | boolean | No | — | Enable extended thinking mode for deeper analysis (provider must support it). | | patchSystemPrompt | string | No | DEFAULT_PATCH_SYSTEM_PROMPT | Custom system prompt for patch generation. Completely replaces the default prompt that analyzes failures and suggests improvements. | | mergeSystemPrompt | string | No | DEFAULT_MERGE_SYSTEM_PROMPT | Custom system prompt for merging patches. Completely replaces the default prompt that combines multiple patches into a coherent system prompt. |

Executors

Executors abstract your LLM workflow from the evaluation harness. Whether your workflow runs locally, calls a remote API, or orchestrates Temporal activities, executors provide a consistent interface: take input + optional system prompt, return expected output.

This separation enables:

Swap execution strategies — Switch between local/remote without changing tests
Dynamic prompt injection — System prompts flow through for optimization
Cost tracking — Aggregate execution costs across test runs

didactic provides two built-in executors:

endpoint for calling a remote API
fn for calling a local function

For each of these, you will want to provide a mapResponse function to transform the raw response into the output shape you want compared against expected. You will also want to provide a mapCost function to extract the execution cost from the response. You may want to provide a mapAdditionalContext function to extract metadata from the response for debugging.

Note: If you do not provide a mapResponse function, the executor will assume the response from the executor is the output you want to compare against expected.

`endpoint(url, config?)`

Create an executor that calls an HTTP endpoint. The executor sends input + systemPrompt as the request body and expects structured JSON back.

import { endpoint } from '@docshield/didactic';

const executor = endpoint('https://api.example.com/workflow', {
  headers: { Authorization: 'Bearer token' },
  timeout: 60000,
  mapResponse: (response) => response.data.result,
  mapCost: (response) => response.cost,
  mapAdditionalContext: (response) => response.metadata,
});

EndpointConfig

| Property | Type | Required | Default | Description | | ---------------------- | ---------------------------- | -------- | -------- | ------------------------------------------------------------------------------------------ | | method | 'POST' \| 'GET' | No | 'POST' | HTTP method for the request. | | headers | Record<string, string> | No | {} | Headers to include (auth tokens, content-type overrides, etc). | | mapResponse | (response: any) => TOutput | No | — | Transform the raw response to your expected output shape. Use when your API wraps results. | | mapAdditionalContext | (response: any) => unknown | No | — | Extract metadata (logs, debug info) from response for inspection. | | mapCost | (response: any) => number | No | — | Extract execution cost from response (e.g., token counts in headers). | | timeout | number | No | 30000 | Request timeout in milliseconds. |

`fn(config)`

Create an executor from a local async function. Use this to write a custom executor for your LLM workflow.

import { fn } from '@docshield/didactic';

const executor = fn({
  fn: async (input, systemPrompt) => {
    return await myLLMCall(input, systemPrompt);
  },
  mapResponse: (result) => result.output,
  mapCost: (result) =>
    result.usage.input_tokens * 0.000003 +
    result.usage.output_tokens * 0.000015,
  mapAdditionalContext: (result) => ({
    model: result.model,
    finishReason: result.stop_reason,
  }),
});

FnConfig

| Property | Type | Required | Default | Description | | ---------------------- | --------------------------------------------------------- | -------- | ------- | ------------------------------------------------------------------------------------------------------------------ | | fn | (input: TInput, systemPrompt?: string) => Promise<TRaw> | Yes | — | Async function that executes your workflow. Receives test input and optional system prompt. | | mapResponse | (result: TRaw) => TOutput | No | — | Transform raw result from fn into the expected output shape to compare. Without this, raw result is used directly. | | mapAdditionalContext | (result: TRaw) => unknown | No | — | Map additional context about the run to pass to the optimizer prompt. | | mapCost | (result: TRaw) => number | No | — | Extract cost from the result (if your function tracks it). Used to track the total cost of the runs. |

Response Mapping

Executors support optional mapping functions to extract and transform data from responses:

`mapResponse`

Transform the raw response into the expected output shape you want compared against expected.

// For endpoint: API returns { data: { result: {...} }, metadata: {...} }
const executor = endpoint('https://api.example.com/extract', {
  mapResponse: (response) => response.data.result,
});

// For fn: Workflow returns full response, we only want specific fields
const executor = fn({
  fn: async (input, systemPrompt) => {
    return await startWorkflow({ ... });  // Returns { documentType, cost, confidence, ... }
  },
  mapResponse: (result) => ({ documentType: result.documentType }),
  mapCost: (result) => result.cost,
  mapAdditionalContext: (result) => ({ confidence: result.confidence }),
});

Without mapResponse:

endpoint: uses the raw JSON response as output
fn: uses the function's return value directly as output

`mapAdditionalContext`

Extract additional context from the output to be passed to the optimizer prompt. You can use this to include additional information about the run that could be useful for the optimizer understand the failure and generate a better prompt.

// For endpoint: receives the raw API response
const executor = endpoint('https://api.example.com/extract', {
  mapAdditionalContext: (response) => ({
    fileNames: response.fileNames,
    parsedFiles: response.parsedFiles,
  }),
});

// For fn: receives the function's return value
const executor = fn({
  fn: async (input, systemPrompt) => {
    const result = await myLLMCall(input, systemPrompt);
    return result;
  },
  mapAdditionalContext: (result) => ({
    tokensUsed: result.usage?.total_tokens,
    finishReason: result.finish_reason,
  }),
});

`mapCost`

Extract execution cost from responses for budget tracking. Returns a number representing cost (typically in dollars). Aggregated in EvalResult.cost and OptimizeResult.totalCost.

// For endpoint: extract from response body or calculate from token counts
const executor = endpoint('https://api.example.com/extract', {
  mapCost: (response) => {
    const tokens = response.usage?.total_tokens ?? 0;
    return tokens * 0.00001;  // assuming $0.01 per 1000 tokens
  },
});

// For fn: calculate from the result
const executor = fn({
  fn: async (input, systemPrompt) => {
    const result = await anthropic.messages.create({ ... });
    return result;
  },
  mapCost: (result) => {
    const inputCost = result.usage.input_tokens * (3 / 1_000_000);   // Sonnet input
    const outputCost = result.usage.output_tokens * (15 / 1_000_000); // Sonnet output
    return inputCost + outputCost;
  },
});

Comparators

Comparators bridge the gap between messy LLM output and semantic correctness. Rather than requiring exact string matches, comparators handle real-world data variations—currency formatting, date formats, name suffixes, numeric tolerance—while maintaining semantic accuracy.

Nested structure: Comparators mirror your data shape. Use objects to define per-field comparators, and unordered() to wrap arrays that should match by similarity rather than index position.

Each comparator returns a passed boolean and a similarity score (0.0–1.0). The pass/fail determines test results, while similarity enables Hungarian matching for unordered() arrays.

`comparators` vs `comparatorOverride`

Use comparators for standard comparison. It accepts:

1. A single comparator function — Applied uniformly across the output:

// Clean syntax for primitives or arrays
const result = await didactic.eval({
  executor: myNumberExtractor,
  comparators: exact, // Single comparator for root-level output
  testCases: [
    { input: 'twenty-three', expected: 23 },
    { input: 'one hundred', expected: 100 },
  ],
});

// For unordered arrays, use the unordered() wrapper
const result = await didactic.eval({
  executor: myListExtractor,
  comparators: unordered(exact), // Match by similarity, not index
  testCases: [{ input: 'numbers', expected: [1, 2, 3, 4] }],
});

2. A nested object structure — Mirrors your data shape with per-field comparators:

const result = await didactic.eval({
  executor: myExecutor,
  comparators: {
    premium: within({ tolerance: 0.05 }), // 5% tolerance for numbers
    carrier: exact, // Exact string match
    effectiveDate: date, // Flexible date parsing
    // Use unordered() for arrays that can be in any order
    lineItems: unordered({
      description: name,
      amount: numeric,
    }),
  },
  testCases: [
    {
      input: { emailId: 'email-123' },
      expected: {
        premium: 12500,
        carrier: 'Acme Insurance',
        effectiveDate: '2024-01-15',
        lineItems: [
          { description: 'Service Fee', amount: 100 },
          { description: 'Tax', amount: 25 },
        ],
      },
    },
  ],
});

3. Optional (defaults to exact) — If omitted, uses exact for entire output:

// No comparators needed for simple exact matching
const result = await didactic.eval({
  executor: myExecutor,
  testCases: [{ input: 'hello', expected: 'hello' }],
});

Use comparatorOverride when you need:

Complete control over comparison logic
Custom cross-field validation
Whole-object semantic comparison that doesn't map to individual fields

// Custom whole-object comparison
const result = await didactic.eval({
  executor: myExecutor,
  comparatorOverride: (expected, actual) => {
    // Custom logic that considers multiple fields together
    const statusMatch = expected.status === actual.status;
    const idMatch = expected.id === actual.id;
    const scoreClose = Math.abs(expected.score - actual.score) < 10;
    return { passed: statusMatch && idMatch && scoreClose };
  },
  testCases: [...],
});

Built-in Comparators

| Comparator | Usage | Description | | ------------------ | --------------------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------- | | exact | exact | Deep equality with cycle detection. Default when no comparator specified. | | within | within({ tolerance, mode? }) | Numeric tolerance. mode: 'percentage' (default) or 'absolute'. | | oneOf | oneOf(allowedValues) | Enum validation. Passes if actual equals expected AND both are in the allowed set. | | contains | contains(substring) | String contains check. Passes if actual includes the substring. | | presence | presence | Existence check. Passes if expected is absent, or if actual has any value when expected does. | | numeric | numeric | Numeric comparison after stripping currency symbols, commas, accounting notation. | | numeric.nullable | numeric.nullable | Same as numeric, but treats null/undefined/empty as 0. | | date | date | Date comparison after normalizing formats (ISO, US MM/DD, EU DD/MM, written). | | name | name | Name comparison with case normalization, suffix removal (Inc, LLC), fuzzy matching. | | unordered | unordered(comparator) or unordered({ fields }) | Wrapper for arrays that should match by similarity (Hungarian algorithm) rather than index. Pass a comparator for primitives or nested config for objects. | | llmCompare | llmCompare({ systemPrompt?, apiKey?, provider? }) | LLM-based semantic comparison. Uses llmConfig from eval config if apiKey not provided. Returns rationale and tracks cost. | | custom | custom({ compare }) | User-defined logic. compare(expected, actual, context?) => boolean. Context provides access to parent objects for cross-field logic. |

Examples

import {
  didactic,
  within,
  oneOf,
  exact,
  contains,
  presence,
  numeric,
  date,
  name,
  unordered,
  llmCompare,
  custom,
  LLMProviders,
} from '@docshield/didactic';

const result = await didactic.eval({
  executor: myInvoiceParser,
  testCases: [...],
  // LLM config for all llmCompare calls (no need to repeat apiKey)
  llmConfig: {
    apiKey: process.env.ANTHROPIC_API_KEY,
    provider: LLMProviders.anthropic_claude_haiku,
  },
  comparators: {
    premium: within({ tolerance: 0.05 }), // 5% tolerance
    deductible: within({ tolerance: 100, mode: 'absolute' }), // $100 tolerance
    policyType: oneOf(['claims-made', 'occurrence', 'entity']),
    carrier: exact,
    notes: contains('approved'),
    entityName: name,
    effectiveDate: date,
    amount: numeric,
    optionalField: presence,

    // Unordered array of objects with nested comparators
    lineItems: unordered({
      description: llmCompare({
        // Uses llmConfig.apiKey from above!
        systemPrompt: 'Compare line item descriptions semantically.',
      }),
      quantity: exact,
      price: numeric,
    }),

    // LLM-based comparison for flexible semantic matching
    companyName: llmCompare({
      systemPrompt:
        'Compare company names considering abbreviations and legal suffixes.',
    }),

    customField: custom({
      compare: (expected, actual, context) => {
        // Access sibling fields via context.actualParent
        return actual.toLowerCase() === expected.toLowerCase();
      },
    }),
  },
});

LLMProviders

Supported LLM providers for the optimizer:

import { LLMProviders } from '@docshield/didactic';

| Value | Description | | -------------------------------------- | --------------------------------------------- | | LLMProviders.anthropic_claude_opus | Claude Opus 4.7 — Most capable, highest cost | | LLMProviders.anthropic_claude_sonnet | Claude Sonnet 4.5 — Balanced performance/cost | | LLMProviders.anthropic_claude_haiku | Claude Haiku 4.5 — Fastest, lowest cost | | LLMProviders.openai_gpt5 | GPT-5.2 — OpenAI flagship | | LLMProviders.openai_gpt5_mini | GPT-5 Mini — OpenAI lightweight |

Output Types

EvalResult

Returned by didactic.eval() when no optimization is configured.

| Property | Type | Description | | ---------------- | --------------------- | ----------------------------------------------------------------------------- | | systemPrompt | string \| undefined | System prompt that was used for this eval run. | | testCases | TestCaseResult[] | Detailed results for each test case. Inspect for field-level failure details. | | passed | number | Count of test cases that passed (met perTestThreshold). | | total | number | Total number of test cases run. | | successRate | number | Pass rate (0.0–1.0). passed / total. | | correctFields | number | Total correct fields across all test cases. | | totalFields | number | Total fields evaluated across all test cases. | | accuracy | number | Field-level accuracy (0.0–1.0). correctFields / totalFields. | | cost | number | Total execution cost aggregated from executor results. | | comparatorCost | number | Total cost from LLM-based comparators (e.g., llmCompare). |

TestCaseResult

Per-test-case detail, accessible via EvalResult.testCases.

| Property | Type | Description | | ------------------- | ----------------------------- | ------------------------------------------------------------------------- | | input | TInput | The input that was passed to the executor. | | expected | TOutput | The expected output from the test case. | | actual | TOutput \| undefined | Actual output returned by executor. Undefined if execution failed. | | passed | boolean | Whether this test case passed (met perTestThreshold). | | fields | Record<string, FieldResult> | Per-field comparison results. Key is field path (e.g., "address.city"). | | passedFields | number | Count of fields that passed comparison. | | totalFields | number | Total fields compared. | | passRate | number | Field pass rate for this test case (0.0–1.0). | | cost | number \| undefined | Execution cost for this test case, if reported by executor. | | comparatorCost | number \| undefined | Total cost from LLM-based comparators in this test case. | | additionalContext | unknown \| undefined | Extra context extracted by executor (logs, debug info). | | error | string \| undefined | Error message if executor threw an exception. |

OptimizeResult

Returned by didactic.optimize() or didactic.eval() with optimization configured.

| Property | Type | Description | | ------------- | --------------------- | ------------------------------------------------------------------------------------ | | success | boolean | Whether the target success rate was achieved. | | finalPrompt | string | The final optimized system prompt. Use this in production. | | iterations | IterationResult[] | Results from each optimization iteration. Inspect to see how the prompt evolved. | | totalCost | number | Total cost across all iterations (optimizer + executor costs). | | logFolder | string \| undefined | Folder path where optimization logs were written (only when storeLogs is enabled). |

IterationResult

Per-iteration detail, accessible via OptimizeResult.iterations.

| Property | Type | Description | | -------------- | ------------------ | ---------------------------------------------- | | iteration | number | Iteration number (1-indexed). | | systemPrompt | string | System prompt used for this iteration. | | passed | number | Test cases passed in this iteration. | | total | number | Total test cases in this iteration. | | testCases | TestCaseResult[] | Detailed test case results for this iteration. | | cost | number | Cost for this iteration. |

Optimization Logs

When storeLogs is enabled in OptimizeConfig, four files are written to the log folder after optimization completes:

Default path: ./didactic-logs/optimize_<timestamp>/

| File | Description | | -------------- | ------------------------------------------------------------------------- | | summary.md | Human-readable report with configuration, metrics, and iteration progress | | prompts.md | All system prompts used in each iteration | | rawData.json | Complete iteration data for programmatic analysis | | bestRun.json | Detailed results from the best-performing iteration |

rawData.json

Contains the complete optimization run data for programmatic analysis:

interface OptimizationReport {
  metadata: {
    timestamp: string; // ISO timestamp
    model: string; // LLM model used
    provider: string; // Provider (anthropic, openai, etc)
    thinking: boolean; // Extended thinking enabled
    targetSuccessRate: number; // Target (0.0-1.0)
    maxIterations: number | null; // Max iterations or null
    maxCost: number | null; // Max cost budget or null
    testCaseCount: number; // Number of test cases
    perTestThreshold: number; // Per-test threshold (default 1.0)
    rateLimitBatch?: number; // Batch size for rate limiting
    rateLimitPause?: number; // Pause seconds between batches
  };
  summary: {
    totalIterations: number;
    totalDurationMs: number;
    totalCost: number;
    totalInputTokens: number;
    totalOutputTokens: number;
    startRate: number; // Success rate at start
    endRate: number; // Success rate at end
    targetMet: boolean;
  };
  best: {
    iteration: number; // Which iteration was best
    successRate: number; // Success rate (0.0-1.0)
    passed: number; // Number of passing tests
    total: number; // Total tests
    fieldAccuracy: number; // Field-level accuracy
  };
  iterations: Array<{
    iteration: number;
    successRate: number;
    passed: number;
    total: number;
    correctFields: number;
    totalFields: number;
    fieldAccuracy: number;
    cost: number; // Cost for this iteration
    cumulativeCost: number; // Total cost so far
    durationMs: number;
    inputTokens: number;
    outputTokens: number;
    failures: Array<{
      testIndex: number;
      input: unknown;
      expected: unknown;
      actual: unknown;
      fields: Record<
        string,
        { expected: unknown; actual: unknown; passed: boolean }
      >;
    }>;
  }>;
}

bestRun.json

Contains detailed results from the best-performing iteration, with test results categorized into failures, partial failures, and successes:

interface BestRunReport {
  metadata: {
    iteration: number; // Which iteration was best
    model: string;
    provider: string;
    thinking: boolean;
    targetSuccessRate: number;
    perTestThreshold: number;
    rateLimitBatch?: number;
    rateLimitPause?: number;
  };
  results: {
    successRate: number; // Overall success rate
    passed: number; // Passed tests
    total: number; // Total tests
    fieldAccuracy: number; // Field-level accuracy
    correctFields: number;
    totalFields: number;
  };
  cost: {
    iteration: number; // Cost for this iteration
    cumulative: number; // Total cumulative cost
  };
  timing: {
    durationMs: number;
    inputTokens: number;
    outputTokens: number;
  };
  failures: Array<{
    // Tests that didnt meet the configured perTestThreshold
    testIndex: number;
    input: unknown;
    expected: unknown;
    actual: unknown;
    failedFields: Record<string, { expected: unknown; actual: unknown }>;
  }>;
  partialFailures: Array<{
    // Tests that passed but have some failing fields
    testIndex: number;
    passRate: number; // Percentage of fields passing
    input: unknown;
    expected: unknown;
    actual: unknown;
    failedFields: Record<string, { expected: unknown; actual: unknown }>;
  }>;
  successes: Array<{
    // Tests with 100% field accuracy
    testIndex: number;
    input: unknown;
    expected: unknown;
    actual: unknown;
  }>;
}

Exports

// Namespace
import { didactic } from '@docshield/didactic';
import didactic from '@docshield/didactic'; // default export

// Comparators
import {
  exact,
  within,
  oneOf,
  contains,
  presence,
  numeric,
  date,
  name,
  unordered,
  llmCompare,
  custom,
} from '@docshield/didactic';

// Executors
import { endpoint, fn } from '@docshield/didactic';

// Functions
import { evaluate, optimize } from '@docshield/didactic';

// Types
import type {
  // Creating custom comparators
  Comparator,
  ComparatorResult,
  ComparatorContext,
  // Creating custom executors
  Executor,
  ExecutorResult,
  // Main API types
  TestCase,
  EvalConfig,
  EvalResult,
  OptimizeConfig,
  OptimizeResult,
  // Executor configs
  EndpointConfig,
  FnConfig,
  // LLM configuration
  LLMConfig,
} from '@docshield/didactic';

// Enum
import { LLMProviders } from '@docshield/didactic';

Local Development

# Build and publish locally
npm run build && yalc publish

# In your project
yalc add @docshield/didactic