@scoutos/evals

v0.1.0

Published

22 days ago

Scout Evaluation SDK - Evaluate AI agent outputs

0High
0Medium
0Low

aplchian4287

scout evaluation ai agent testing llm evals

@scoutos/evals

Scout Evaluation SDK for TypeScript - Evaluate AI agent outputs with type-safe assertions.

What is this?

@scoutos/evals is an SDK for evaluating AI agent outputs. Use it to:

Assert on outputs — Check fields, types, patterns, and values
Use AI as a judge — Let Claude evaluate subjective qualities like helpfulness and accuracy
Run in CI — Fail builds when quality drops below thresholds
Track over time — Named evaluations group runs so you can spot regressions

Results are stored in the Scout dashboard where you can view history, compare runs, and debug failures.

Get your API key: app.scoutos.com/settings/api-keys

Installation

npm install @scoutos/evals

Quick Start

import { createClient, checks } from "@scoutos/evals";

const client = createClient(); // Uses SCOUT_API_KEY env var

const result = await client.evaluate({
  name: "my-agent-tests",
  tests: [
    {
      input: { query: "hello" },
      output: { response: "Hi there!" },
    },
  ],
  assert: [
    checks.field("response").exists(),
    checks.agent("Is this a friendly greeting?"),
  ],
});

console.log(result.passed);   // true
console.log(result.passRate); // 1.0
console.log(result.runUrl);   // https://app.scoutos.com/evaluations/runs/...

Client

Creating a Client

import { createClient } from "@scoutos/evals";

// Uses SCOUT_API_KEY environment variable
const client = createClient();

// Or pass API key directly
const client = createClient("sk_...");

// Or full configuration
const client = createClient({
  apiKey: "sk_...",
  baseUrl: "https://api.scoutos.com", // Optional
});

Choosing a Method

| Method | Use When | |--------|----------| | evaluate() | You already have outputs (from logs, a database, or another system) | | run() | You want to execute a function and evaluate its outputs in one step | | runDataset() | You have a dataset stored in Scout and want to run your function against it |

Quick decision:

Have outputs already? → evaluate()
Want to run code and evaluate? → run()
Using Scout datasets? → runDataset()

evaluate()

Evaluate pre-computed outputs against assertions.

const result = await client.evaluate({
  // Name for tracking across runs (recommended)
  name: "product-search-tests",

  // Metadata for tracking versions
  metadata: {
    version: "v2.1",
    model: "gpt-4",
    gitCommit: "abc123",
  },

  // Test cases
  tests: [
    {
      name: "basic-search",
      input: { query: "laptops" },
      output: { results: [{ name: "MacBook", price: 1299 }] },
      assert: [checks.field("results").length.gte(1)], // Per-test assertions
    },
  ],

  // Global assertions (applied to all tests)
  assert: [
    checks.field("results").exists(),
    checks.agent("Are these results relevant to the query?"),
  ],

  // Tags for filtering in Scout UI
  tags: ["v1.0", "production"],
});

Single test shorthand:

const result = await client.evaluate({
  name: "single-test",
  input: { query: "hello" },
  output: { response: "Hi!" },
  assert: [checks.field("response").exists()],
});

run()

Execute a function against inputs and evaluate the outputs.

const result = await client.run(
  // Inputs
  [
    { query: "laptops" },
    { query: "phones" },
  ],

  // Function to execute (sync or async)
  async (input) => {
    return await mySearchFunction(input.query);
  },

  // Options
  {
    name: "search-function-tests",
    metadata: { version: "v1.0" },
    assert: [
      checks.field("results").exists(),
      checks.field("results").length.gte(1),
    ],
    tags: ["integration"],
  }
);

The SDK automatically measures latency for each execution.

runDataset()

Run a function against a dataset stored in Scout.

const result = await client.runDataset({
  dataset: "product-search-golden-v2", // Dataset name or ID from Scout UI
  fn: async (input) => {
    return await mySearchFunction(input.query);
  },
  name: "dataset-regression-test",
  assert: [
    checks.field("results").exists(),
    checks.agent("Are results accurate?"),
  ],
});

Check Builders

Field Checks

Assert on any field in the output using a fluent builder.

import { checks } from "@scoutos/evals";

// Existence
checks.field("results").exists()

// Equality
checks.field("status").equals("success")
checks.field("status").notEquals("error")

// Numeric comparisons
checks.field("count").gt(0)       // greater than
checks.field("count").gte(1)      // greater than or equal
checks.field("count").lt(100)     // less than
checks.field("count").lte(99)     // less than or equal

// String/array contains
checks.field("message").includes("hello")   // substring
checks.field("tags").includes("urgent")     // array element

// Regex matching
checks.field("email").matches(/^.+@.+\..+$/)

// Type checking
checks.field("count").isType("number")
checks.field("items").isType("array")
checks.field("data").isType("object")

// Length (for strings and arrays)
checks.field("items").length.gte(5)
checks.field("name").length.lte(100)

Nested paths:

// Dot notation
checks.field("user.profile.name").exists()

// Array indices
checks.field("items[0].name").equals("first")

// Mixed
checks.field("users[0].profile.email").matches(/^.+@.+$/)

Agent Checks

Use AI to evaluate subjective qualities.

// Simple prompt
checks.agent("Is this response helpful and accurate?")

// With criteria and weights
checks.agent({
  prompt: "Evaluate the quality of this search result",
  criteria: [
    {
      name: "relevance",
      description: "Are results relevant to the query?",
      weight: 2.0,
    },
    {
      name: "completeness",
      description: "Does the response cover all aspects?",
      weight: 1.0,
    },
  ],
  model: "claude-sonnet-4", // Default model
})

Available models:

claude-sonnet-4 (default) — Good balance of speed and quality
claude-opus-4 — Highest quality, slower and more expensive
claude-haiku — Fastest, best for simple checks

Custom Checks

Define assertions that run locally in your environment.

// Return boolean
checks.custom("has-items", (input, output) => {
  return output.items && output.items.length > 0;
})

// Return detailed result
checks.custom("price-valid", (input, output) => ({
  passed: output.price > 0 && output.price < 10000,
  score: output.price > 0 ? 1.0 : 0.0,
  reason: output.price <= 0 ? "Price must be positive" : undefined,
  metadata: { actualPrice: output.price },
}))

// Async check (call external APIs, databases, etc.)
checks.custom("api-validation", async (input, output) => {
  const isValid = await validateWithExternalAPI(output);
  return { passed: isValid };
})

Scorer References

Reference scorers saved in Scout UI.

checks.scorer("scorer_abc123")

Config Files

Define evaluations in YAML or JSON for version control and sharing.

YAML Format

# eval.yaml
name: product-search-tests

metadata:
  version: "1.0"
  environment: staging

tests:
  - name: laptop-search
    input:
      query: "gaming laptops"
    output:
      results:
        - name: "ASUS ROG"
          price: 1299
    assert:
      - field: results
        exists: true
      - field: results
        length:
          gte: 1
      - agent: "Are these gaming laptops?"

# Global assertions
assert:
  - field: results
    exists: true

tags:
  - regression

JSON Format

{
  "name": "product-search-tests",
  "tests": [
    {
      "name": "basic-search",
      "input": { "query": "laptops" },
      "output": { "results": [{ "name": "MacBook" }] },
      "assert": [
        { "field": "results", "exists": true },
        { "agent": "Are results relevant?" }
      ]
    }
  ]
}

Loading Config Files

import { loadConfig, createClient } from "@scoutos/evals";

const config = await loadConfig("./eval.yaml");
const client = createClient();
const result = await client.evaluate(config);

Evaluation Results

const result = await client.evaluate({ ... });

// Overall status
result.passed      // boolean — did all tests pass?
result.passRate    // number — 0.0 to 1.0
result.passCount   // number
result.failCount   // number
result.totalCount  // number

// Performance
result.avgLatencyMs // number — average latency in ms

// Details
result.results     // TestResult[] — all test results
result.failures    // TestResult[] — only failed tests

// Scout UI
result.runId       // string
result.runUrl      // string — link to view in dashboard

// Get specific test by name
const test = result.getTest("laptop-search");

// Full API response
result.raw

Error Handling

import { createClient, checks } from "@scoutos/evals";

const client = createClient();

try {
  const result = await client.evaluate({ ... });
} catch (error) {
  if (error.message.includes("401")) {
    // Invalid or expired API key
    console.error("Authentication failed. Check your SCOUT_API_KEY.");
  } else if (error.message.includes("429")) {
    // Rate limited
    console.error("Rate limited. Slow down requests or contact support.");
  } else if (error.message.includes("5")) {
    // Server error (500, 502, 503, etc.)
    console.error("Scout API is temporarily unavailable. Retry later.");
  } else {
    // Network error or other issue
    console.error("Evaluation failed:", error.message);
  }
}

Common errors:

| Status | Meaning | Solution | |--------|---------|----------| | 401 | Invalid API key | Check SCOUT_API_KEY is set correctly | | 403 | Forbidden | Your API key doesn't have access to this resource | | 404 | Not found | Dataset or scorer ID doesn't exist | | 429 | Rate limited | Add delays between requests or contact support | | 500+ | Server error | Retry with exponential backoff |

CI Integration

import { createClient, checks } from "@scoutos/evals";

async function runEvaluation() {
  const client = createClient();

  const result = await client.evaluate({
    name: "ci-regression-tests",
    metadata: {
      gitCommit: process.env.GIT_COMMIT,
      branch: process.env.GIT_BRANCH,
    },
    tests: [...],
    assert: [...],
  });

  console.log(`Pass rate: ${(result.passRate * 100).toFixed(1)}%`);
  console.log(`Results: ${result.passCount}/${result.totalCount}`);
  console.log(`Details: ${result.runUrl}`);

  if (!result.passed) {
    console.error("\nFailed tests:");
    for (const failure of result.failures) {
      console.error(`  ${failure.test_name}`);
      for (const scorer of failure.scorer_results.filter(s => !s.passed)) {
        console.error(`    - ${scorer.scorer_id}: ${scorer.reason}`);
      }
    }
    process.exit(1);
  }
}

runEvaluation();

GitHub Actions:

name: AI Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - name: Run Evaluations
        env:
          SCOUT_API_KEY: ${{ secrets.SCOUT_API_KEY }}
          GIT_COMMIT: ${{ github.sha }}
          GIT_BRANCH: ${{ github.ref_name }}
        run: npx tsx scripts/evaluate.ts

TypeScript Types

Key interfaces for IDE autocompletion and type safety:

interface TestCase {
  name?: string;
  input?: Record<string, unknown>;
  output: Record<string, unknown>;
  expected_output?: Record<string, unknown>; // For custom comparison checks
  assert?: Check[];
  latency_ms?: number;
}

interface TestResult {
  test_name: string;
  passed: boolean;
  status: "pass" | "fail" | "error";
  input: Record<string, unknown>;
  output: Record<string, unknown>;
  latency_ms: number;
  scorer_results: ScorerResult[];
  composite_score: number;
}

interface ScorerResult {
  scorer_id: string;
  passed: boolean;
  score: number;          // 0.0 to 1.0
  reason?: string;        // Why it failed
  metadata?: Record<string, unknown>;
}

interface EvaluationMetrics {
  pass_rate: number;
  pass_count: number;
  fail_count: number;
  total_count: number;
  avg_latency_ms: number;
}

// Check is a union of all check types
type Check =
  | FieldExistsCheck
  | ExactMatchCheck
  | ContainsCheck
  | RegexMatchCheck
  | TypeCheckCheck
  | AgentCheck
  | CustomCheck;

Troubleshooting

"API key required" error

Make sure SCOUT_API_KEY is set:

export SCOUT_API_KEY=sk_...

Or pass it directly:

const client = createClient("sk_...");

Tests pass locally but fail in CI

Check that SCOUT_API_KEY is set in your CI secrets
Ensure the secret name matches exactly (case-sensitive)

Agent checks are slow

Agent checks use AI models which have latency. Options:

Use claude-haiku for faster checks: checks.agent({ prompt: "...", model: "claude-haiku" })
Run fewer agent checks, more field checks
Run evaluations in parallel if you have multiple test suites

Custom check isn't running

Custom checks run locally, not on Scout servers. Make sure:

The function doesn't throw unhandled errors
You're returning boolean or { passed: boolean, ... }

Rate limiting

If you're running many evaluations in a loop, add delays:

for (const test of tests) {
  await client.evaluate(test);
  await new Promise(r => setTimeout(r, 100)); // 100ms delay
}

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@scoutos/evals

Table of Contents

What is this?

Installation

Quick Start

Client

Creating a Client

Choosing a Method

evaluate()

run()

runDataset()

Check Builders

Field Checks

Agent Checks

Custom Checks

Scorer References

Config Files

YAML Format

JSON Format

Loading Config Files

Evaluation Results

Error Handling

CI Integration

TypeScript Types

Troubleshooting

License