npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

scoutos-evals

v0.1.0

Published

Scout Evaluation SDK - Evaluate AI agent outputs

Downloads

87

Readme

scoutos-evals

Scout Evaluation SDK for TypeScript - Evaluate AI agent outputs with type-safe assertions.

Table of Contents


What is this?

scoutos-evals is an SDK for evaluating AI agent outputs. Use it to:

  • Assert on outputs — Check fields, types, patterns, and values
  • Use AI as a judge — Let Claude evaluate subjective qualities like helpfulness and accuracy
  • Run in CI — Fail builds when quality drops below thresholds
  • Track over time — Named evaluations group runs so you can spot regressions

Results are stored in the Scout dashboard where you can view history, compare runs, and debug failures.

Get your API key: app.scoutos.com/settings/api-keys


Installation

npm install scoutos-evals

Quick Start

import { createClient, checks } from "scoutos-evals";

const client = createClient(); // Uses SCOUT_API_KEY env var

const result = await client.evaluate({
  name: "my-agent-tests",
  tests: [
    {
      input: { query: "hello" },
      output: { response: "Hi there!" },
    },
  ],
  assert: [
    checks.field("response").exists(),
    checks.agent("Is this a friendly greeting?"),
  ],
});

console.log(result.passed);   // true
console.log(result.passRate); // 1.0
console.log(result.runUrl);   // https://app.scoutos.com/evaluations/runs/...

Client

Creating a Client

import { createClient } from "scoutos-evals";

// Uses SCOUT_API_KEY environment variable
const client = createClient();

// Or pass API key directly
const client = createClient("sk_...");

// Or full configuration
const client = createClient({
  apiKey: "sk_...",
  baseUrl: "https://api.scoutos.com", // Optional
});

Choosing a Method

| Method | Use When | |--------|----------| | evaluate() | You already have outputs (from logs, a database, or another system) | | run() | You want to execute a function and evaluate its outputs in one step | | runDataset() | You have a dataset stored in Scout and want to run your function against it |

Quick decision:

  • Have outputs already? → evaluate()
  • Want to run code and evaluate? → run()
  • Using Scout datasets? → runDataset()

evaluate()

Evaluate pre-computed outputs against assertions.

const result = await client.evaluate({
  // Name for tracking across runs (recommended)
  name: "product-search-tests",

  // Metadata for tracking versions
  metadata: {
    version: "v2.1",
    model: "gpt-4",
    gitCommit: "abc123",
  },

  // Test cases
  tests: [
    {
      name: "basic-search",
      input: { query: "laptops" },
      output: { results: [{ name: "MacBook", price: 1299 }] },
      assert: [checks.field("results").length.gte(1)], // Per-test assertions
    },
  ],

  // Global assertions (applied to all tests)
  assert: [
    checks.field("results").exists(),
    checks.agent("Are these results relevant to the query?"),
  ],

  // Tags for filtering in Scout UI
  tags: ["v1.0", "production"],
});

Single test shorthand:

const result = await client.evaluate({
  name: "single-test",
  input: { query: "hello" },
  output: { response: "Hi!" },
  assert: [checks.field("response").exists()],
});

run()

Execute a function against inputs and evaluate the outputs.

const result = await client.run(
  // Inputs
  [
    { query: "laptops" },
    { query: "phones" },
  ],

  // Function to execute (sync or async)
  async (input) => {
    return await mySearchFunction(input.query);
  },

  // Options
  {
    name: "search-function-tests",
    metadata: { version: "v1.0" },
    assert: [
      checks.field("results").exists(),
      checks.field("results").length.gte(1),
    ],
    tags: ["integration"],
  }
);

The SDK automatically measures latency for each execution.

runDataset()

Run a function against a dataset stored in Scout.

const result = await client.runDataset({
  dataset: "product-search-golden-v2", // Dataset name or ID from Scout UI
  fn: async (input) => {
    return await mySearchFunction(input.query);
  },
  name: "dataset-regression-test",
  assert: [
    checks.field("results").exists(),
    checks.agent("Are results accurate?"),
  ],
});

Check Builders

Field Checks

Assert on any field in the output using a fluent builder.

import { checks } from "scoutos-evals";

// Existence
checks.field("results").exists()

// Equality
checks.field("status").equals("success")
checks.field("status").notEquals("error")

// Numeric comparisons
checks.field("count").gt(0)       // greater than
checks.field("count").gte(1)      // greater than or equal
checks.field("count").lt(100)     // less than
checks.field("count").lte(99)     // less than or equal

// String/array contains
checks.field("message").includes("hello")   // substring
checks.field("tags").includes("urgent")     // array element

// Regex matching
checks.field("email").matches(/^.+@.+\..+$/)

// Type checking
checks.field("count").isType("number")
checks.field("items").isType("array")
checks.field("data").isType("object")

// Length (for strings and arrays)
checks.field("items").length.gte(5)
checks.field("name").length.lte(100)

Nested paths:

// Dot notation
checks.field("user.profile.name").exists()

// Array indices
checks.field("items[0].name").equals("first")

// Mixed
checks.field("users[0].profile.email").matches(/^.+@.+$/)

Agent Checks

Use AI to evaluate subjective qualities.

// Simple prompt
checks.agent("Is this response helpful and accurate?")

// With criteria and weights
checks.agent({
  prompt: "Evaluate the quality of this search result",
  criteria: [
    {
      name: "relevance",
      description: "Are results relevant to the query?",
      weight: 2.0,
    },
    {
      name: "completeness",
      description: "Does the response cover all aspects?",
      weight: 1.0,
    },
  ],
  model: "claude-sonnet-4", // Default model
})

Available models:

  • claude-sonnet-4 (default) — Good balance of speed and quality
  • claude-opus-4 — Highest quality, slower and more expensive
  • claude-haiku — Fastest, best for simple checks

Custom Checks

Define assertions that run locally in your environment.

// Return boolean
checks.custom("has-items", (input, output) => {
  return output.items && output.items.length > 0;
})

// Return detailed result
checks.custom("price-valid", (input, output) => ({
  passed: output.price > 0 && output.price < 10000,
  score: output.price > 0 ? 1.0 : 0.0,
  reason: output.price <= 0 ? "Price must be positive" : undefined,
  metadata: { actualPrice: output.price },
}))

// Async check (call external APIs, databases, etc.)
checks.custom("api-validation", async (input, output) => {
  const isValid = await validateWithExternalAPI(output);
  return { passed: isValid };
})

Scorer References

Reference scorers saved in Scout UI.

checks.scorer("scorer_abc123")

Config Files

Define evaluations in YAML or JSON for version control and sharing.

YAML Format

# eval.yaml
name: product-search-tests

metadata:
  version: "1.0"
  environment: staging

tests:
  - name: laptop-search
    input:
      query: "gaming laptops"
    output:
      results:
        - name: "ASUS ROG"
          price: 1299
    assert:
      - field: results
        exists: true
      - field: results
        length:
          gte: 1
      - agent: "Are these gaming laptops?"

# Global assertions
assert:
  - field: results
    exists: true

tags:
  - regression

JSON Format

{
  "name": "product-search-tests",
  "tests": [
    {
      "name": "basic-search",
      "input": { "query": "laptops" },
      "output": { "results": [{ "name": "MacBook" }] },
      "assert": [
        { "field": "results", "exists": true },
        { "agent": "Are results relevant?" }
      ]
    }
  ]
}

Loading Config Files

import { loadConfig, createClient } from "scoutos-evals";

const config = await loadConfig("./eval.yaml");
const client = createClient();
const result = await client.evaluate(config);

Evaluation Results

const result = await client.evaluate({ ... });

// Overall status
result.passed      // boolean — did all tests pass?
result.passRate    // number — 0.0 to 1.0
result.passCount   // number
result.failCount   // number
result.totalCount  // number

// Performance
result.avgLatencyMs // number — average latency in ms

// Details
result.results     // TestResult[] — all test results
result.failures    // TestResult[] — only failed tests

// Scout UI
result.runId       // string
result.runUrl      // string — link to view in dashboard

// Get specific test by name
const test = result.getTest("laptop-search");

// Full API response
result.raw

Error Handling

import { createClient, checks } from "scoutos-evals";

const client = createClient();

try {
  const result = await client.evaluate({ ... });
} catch (error) {
  if (error.message.includes("401")) {
    // Invalid or expired API key
    console.error("Authentication failed. Check your SCOUT_API_KEY.");
  } else if (error.message.includes("429")) {
    // Rate limited
    console.error("Rate limited. Slow down requests or contact support.");
  } else if (error.message.includes("5")) {
    // Server error (500, 502, 503, etc.)
    console.error("Scout API is temporarily unavailable. Retry later.");
  } else {
    // Network error or other issue
    console.error("Evaluation failed:", error.message);
  }
}

Common errors:

| Status | Meaning | Solution | |--------|---------|----------| | 401 | Invalid API key | Check SCOUT_API_KEY is set correctly | | 403 | Forbidden | Your API key doesn't have access to this resource | | 404 | Not found | Dataset or scorer ID doesn't exist | | 429 | Rate limited | Add delays between requests or contact support | | 500+ | Server error | Retry with exponential backoff |


CI Integration

import { createClient, checks } from "scoutos-evals";

async function runEvaluation() {
  const client = createClient();

  const result = await client.evaluate({
    name: "ci-regression-tests",
    metadata: {
      gitCommit: process.env.GIT_COMMIT,
      branch: process.env.GIT_BRANCH,
    },
    tests: [...],
    assert: [...],
  });

  console.log(`Pass rate: ${(result.passRate * 100).toFixed(1)}%`);
  console.log(`Results: ${result.passCount}/${result.totalCount}`);
  console.log(`Details: ${result.runUrl}`);

  if (!result.passed) {
    console.error("\nFailed tests:");
    for (const failure of result.failures) {
      console.error(`  ${failure.test_name}`);
      for (const scorer of failure.scorer_results.filter(s => !s.passed)) {
        console.error(`    - ${scorer.scorer_id}: ${scorer.reason}`);
      }
    }
    process.exit(1);
  }
}

runEvaluation();

GitHub Actions:

name: AI Evaluation

on: [push, pull_request]

jobs:
  evaluate:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: "20"
      - run: npm ci
      - name: Run Evaluations
        env:
          SCOUT_API_KEY: ${{ secrets.SCOUT_API_KEY }}
          GIT_COMMIT: ${{ github.sha }}
          GIT_BRANCH: ${{ github.ref_name }}
        run: npx tsx scripts/evaluate.ts

TypeScript Types

Key interfaces for IDE autocompletion and type safety:

interface TestCase {
  name?: string;
  input?: Record<string, unknown>;
  output: Record<string, unknown>;
  expected_output?: Record<string, unknown>; // For custom comparison checks
  assert?: Check[];
  latency_ms?: number;
}

interface TestResult {
  test_name: string;
  passed: boolean;
  status: "pass" | "fail" | "error";
  input: Record<string, unknown>;
  output: Record<string, unknown>;
  latency_ms: number;
  scorer_results: ScorerResult[];
  composite_score: number;
}

interface ScorerResult {
  scorer_id: string;
  passed: boolean;
  score: number;          // 0.0 to 1.0
  reason?: string;        // Why it failed
  metadata?: Record<string, unknown>;
}

interface EvaluationMetrics {
  pass_rate: number;
  pass_count: number;
  fail_count: number;
  total_count: number;
  avg_latency_ms: number;
}

// Check is a union of all check types
type Check =
  | FieldExistsCheck
  | ExactMatchCheck
  | ContainsCheck
  | RegexMatchCheck
  | TypeCheckCheck
  | AgentCheck
  | CustomCheck;

Troubleshooting

"API key required" error

Make sure SCOUT_API_KEY is set:

export SCOUT_API_KEY=sk_...

Or pass it directly:

const client = createClient("sk_...");

Tests pass locally but fail in CI

  • Check that SCOUT_API_KEY is set in your CI secrets
  • Ensure the secret name matches exactly (case-sensitive)

Agent checks are slow

Agent checks use AI models which have latency. Options:

  • Use claude-haiku for faster checks: checks.agent({ prompt: "...", model: "claude-haiku" })
  • Run fewer agent checks, more field checks
  • Run evaluations in parallel if you have multiple test suites

Custom check isn't running

Custom checks run locally, not on Scout servers. Make sure:

  • The function doesn't throw unhandled errors
  • You're returning boolean or { passed: boolean, ... }

Rate limiting

If you're running many evaluations in a loop, add delays:

for (const test of tests) {
  await client.evaluate(test);
  await new Promise(r => setTimeout(r, 100)); // 100ms delay
}

License

MIT