@scoutos/evals
v0.1.0
Published
Scout Evaluation SDK - Evaluate AI agent outputs
Maintainers
Readme
@scoutos/evals
Scout Evaluation SDK for TypeScript - Evaluate AI agent outputs with type-safe assertions.
Table of Contents
- What is this?
- Installation
- Quick Start
- Client
- Check Builders
- Config Files
- Evaluation Results
- Error Handling
- CI Integration
- TypeScript Types
- Troubleshooting
What is this?
@scoutos/evals is an SDK for evaluating AI agent outputs. Use it to:
- Assert on outputs — Check fields, types, patterns, and values
- Use AI as a judge — Let Claude evaluate subjective qualities like helpfulness and accuracy
- Run in CI — Fail builds when quality drops below thresholds
- Track over time — Named evaluations group runs so you can spot regressions
Results are stored in the Scout dashboard where you can view history, compare runs, and debug failures.
Get your API key: app.scoutos.com/settings/api-keys
Installation
npm install @scoutos/evalsQuick Start
import { createClient, checks } from "@scoutos/evals";
const client = createClient(); // Uses SCOUT_API_KEY env var
const result = await client.evaluate({
name: "my-agent-tests",
tests: [
{
input: { query: "hello" },
output: { response: "Hi there!" },
},
],
assert: [
checks.field("response").exists(),
checks.agent("Is this a friendly greeting?"),
],
});
console.log(result.passed); // true
console.log(result.passRate); // 1.0
console.log(result.runUrl); // https://app.scoutos.com/evaluations/runs/...Client
Creating a Client
import { createClient } from "@scoutos/evals";
// Uses SCOUT_API_KEY environment variable
const client = createClient();
// Or pass API key directly
const client = createClient("sk_...");
// Or full configuration
const client = createClient({
apiKey: "sk_...",
baseUrl: "https://api.scoutos.com", // Optional
});Choosing a Method
| Method | Use When |
|--------|----------|
| evaluate() | You already have outputs (from logs, a database, or another system) |
| run() | You want to execute a function and evaluate its outputs in one step |
| runDataset() | You have a dataset stored in Scout and want to run your function against it |
Quick decision:
- Have outputs already? →
evaluate() - Want to run code and evaluate? →
run() - Using Scout datasets? →
runDataset()
evaluate()
Evaluate pre-computed outputs against assertions.
const result = await client.evaluate({
// Name for tracking across runs (recommended)
name: "product-search-tests",
// Metadata for tracking versions
metadata: {
version: "v2.1",
model: "gpt-4",
gitCommit: "abc123",
},
// Test cases
tests: [
{
name: "basic-search",
input: { query: "laptops" },
output: { results: [{ name: "MacBook", price: 1299 }] },
assert: [checks.field("results").length.gte(1)], // Per-test assertions
},
],
// Global assertions (applied to all tests)
assert: [
checks.field("results").exists(),
checks.agent("Are these results relevant to the query?"),
],
// Tags for filtering in Scout UI
tags: ["v1.0", "production"],
});Single test shorthand:
const result = await client.evaluate({
name: "single-test",
input: { query: "hello" },
output: { response: "Hi!" },
assert: [checks.field("response").exists()],
});run()
Execute a function against inputs and evaluate the outputs.
const result = await client.run(
// Inputs
[
{ query: "laptops" },
{ query: "phones" },
],
// Function to execute (sync or async)
async (input) => {
return await mySearchFunction(input.query);
},
// Options
{
name: "search-function-tests",
metadata: { version: "v1.0" },
assert: [
checks.field("results").exists(),
checks.field("results").length.gte(1),
],
tags: ["integration"],
}
);The SDK automatically measures latency for each execution.
runDataset()
Run a function against a dataset stored in Scout.
const result = await client.runDataset({
dataset: "product-search-golden-v2", // Dataset name or ID from Scout UI
fn: async (input) => {
return await mySearchFunction(input.query);
},
name: "dataset-regression-test",
assert: [
checks.field("results").exists(),
checks.agent("Are results accurate?"),
],
});Check Builders
Field Checks
Assert on any field in the output using a fluent builder.
import { checks } from "@scoutos/evals";
// Existence
checks.field("results").exists()
// Equality
checks.field("status").equals("success")
checks.field("status").notEquals("error")
// Numeric comparisons
checks.field("count").gt(0) // greater than
checks.field("count").gte(1) // greater than or equal
checks.field("count").lt(100) // less than
checks.field("count").lte(99) // less than or equal
// String/array contains
checks.field("message").includes("hello") // substring
checks.field("tags").includes("urgent") // array element
// Regex matching
checks.field("email").matches(/^.+@.+\..+$/)
// Type checking
checks.field("count").isType("number")
checks.field("items").isType("array")
checks.field("data").isType("object")
// Length (for strings and arrays)
checks.field("items").length.gte(5)
checks.field("name").length.lte(100)Nested paths:
// Dot notation
checks.field("user.profile.name").exists()
// Array indices
checks.field("items[0].name").equals("first")
// Mixed
checks.field("users[0].profile.email").matches(/^.+@.+$/)Agent Checks
Use AI to evaluate subjective qualities.
// Simple prompt
checks.agent("Is this response helpful and accurate?")
// With criteria and weights
checks.agent({
prompt: "Evaluate the quality of this search result",
criteria: [
{
name: "relevance",
description: "Are results relevant to the query?",
weight: 2.0,
},
{
name: "completeness",
description: "Does the response cover all aspects?",
weight: 1.0,
},
],
model: "claude-sonnet-4", // Default model
})Available models:
claude-sonnet-4(default) — Good balance of speed and qualityclaude-opus-4— Highest quality, slower and more expensiveclaude-haiku— Fastest, best for simple checks
Custom Checks
Define assertions that run locally in your environment.
// Return boolean
checks.custom("has-items", (input, output) => {
return output.items && output.items.length > 0;
})
// Return detailed result
checks.custom("price-valid", (input, output) => ({
passed: output.price > 0 && output.price < 10000,
score: output.price > 0 ? 1.0 : 0.0,
reason: output.price <= 0 ? "Price must be positive" : undefined,
metadata: { actualPrice: output.price },
}))
// Async check (call external APIs, databases, etc.)
checks.custom("api-validation", async (input, output) => {
const isValid = await validateWithExternalAPI(output);
return { passed: isValid };
})Scorer References
Reference scorers saved in Scout UI.
checks.scorer("scorer_abc123")Config Files
Define evaluations in YAML or JSON for version control and sharing.
YAML Format
# eval.yaml
name: product-search-tests
metadata:
version: "1.0"
environment: staging
tests:
- name: laptop-search
input:
query: "gaming laptops"
output:
results:
- name: "ASUS ROG"
price: 1299
assert:
- field: results
exists: true
- field: results
length:
gte: 1
- agent: "Are these gaming laptops?"
# Global assertions
assert:
- field: results
exists: true
tags:
- regressionJSON Format
{
"name": "product-search-tests",
"tests": [
{
"name": "basic-search",
"input": { "query": "laptops" },
"output": { "results": [{ "name": "MacBook" }] },
"assert": [
{ "field": "results", "exists": true },
{ "agent": "Are results relevant?" }
]
}
]
}Loading Config Files
import { loadConfig, createClient } from "@scoutos/evals";
const config = await loadConfig("./eval.yaml");
const client = createClient();
const result = await client.evaluate(config);Evaluation Results
const result = await client.evaluate({ ... });
// Overall status
result.passed // boolean — did all tests pass?
result.passRate // number — 0.0 to 1.0
result.passCount // number
result.failCount // number
result.totalCount // number
// Performance
result.avgLatencyMs // number — average latency in ms
// Details
result.results // TestResult[] — all test results
result.failures // TestResult[] — only failed tests
// Scout UI
result.runId // string
result.runUrl // string — link to view in dashboard
// Get specific test by name
const test = result.getTest("laptop-search");
// Full API response
result.rawError Handling
import { createClient, checks } from "@scoutos/evals";
const client = createClient();
try {
const result = await client.evaluate({ ... });
} catch (error) {
if (error.message.includes("401")) {
// Invalid or expired API key
console.error("Authentication failed. Check your SCOUT_API_KEY.");
} else if (error.message.includes("429")) {
// Rate limited
console.error("Rate limited. Slow down requests or contact support.");
} else if (error.message.includes("5")) {
// Server error (500, 502, 503, etc.)
console.error("Scout API is temporarily unavailable. Retry later.");
} else {
// Network error or other issue
console.error("Evaluation failed:", error.message);
}
}Common errors:
| Status | Meaning | Solution |
|--------|---------|----------|
| 401 | Invalid API key | Check SCOUT_API_KEY is set correctly |
| 403 | Forbidden | Your API key doesn't have access to this resource |
| 404 | Not found | Dataset or scorer ID doesn't exist |
| 429 | Rate limited | Add delays between requests or contact support |
| 500+ | Server error | Retry with exponential backoff |
CI Integration
import { createClient, checks } from "@scoutos/evals";
async function runEvaluation() {
const client = createClient();
const result = await client.evaluate({
name: "ci-regression-tests",
metadata: {
gitCommit: process.env.GIT_COMMIT,
branch: process.env.GIT_BRANCH,
},
tests: [...],
assert: [...],
});
console.log(`Pass rate: ${(result.passRate * 100).toFixed(1)}%`);
console.log(`Results: ${result.passCount}/${result.totalCount}`);
console.log(`Details: ${result.runUrl}`);
if (!result.passed) {
console.error("\nFailed tests:");
for (const failure of result.failures) {
console.error(` ${failure.test_name}`);
for (const scorer of failure.scorer_results.filter(s => !s.passed)) {
console.error(` - ${scorer.scorer_id}: ${scorer.reason}`);
}
}
process.exit(1);
}
}
runEvaluation();GitHub Actions:
name: AI Evaluation
on: [push, pull_request]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: "20"
- run: npm ci
- name: Run Evaluations
env:
SCOUT_API_KEY: ${{ secrets.SCOUT_API_KEY }}
GIT_COMMIT: ${{ github.sha }}
GIT_BRANCH: ${{ github.ref_name }}
run: npx tsx scripts/evaluate.tsTypeScript Types
Key interfaces for IDE autocompletion and type safety:
interface TestCase {
name?: string;
input?: Record<string, unknown>;
output: Record<string, unknown>;
expected_output?: Record<string, unknown>; // For custom comparison checks
assert?: Check[];
latency_ms?: number;
}
interface TestResult {
test_name: string;
passed: boolean;
status: "pass" | "fail" | "error";
input: Record<string, unknown>;
output: Record<string, unknown>;
latency_ms: number;
scorer_results: ScorerResult[];
composite_score: number;
}
interface ScorerResult {
scorer_id: string;
passed: boolean;
score: number; // 0.0 to 1.0
reason?: string; // Why it failed
metadata?: Record<string, unknown>;
}
interface EvaluationMetrics {
pass_rate: number;
pass_count: number;
fail_count: number;
total_count: number;
avg_latency_ms: number;
}
// Check is a union of all check types
type Check =
| FieldExistsCheck
| ExactMatchCheck
| ContainsCheck
| RegexMatchCheck
| TypeCheckCheck
| AgentCheck
| CustomCheck;Troubleshooting
"API key required" error
Make sure SCOUT_API_KEY is set:
export SCOUT_API_KEY=sk_...Or pass it directly:
const client = createClient("sk_...");Tests pass locally but fail in CI
- Check that
SCOUT_API_KEYis set in your CI secrets - Ensure the secret name matches exactly (case-sensitive)
Agent checks are slow
Agent checks use AI models which have latency. Options:
- Use
claude-haikufor faster checks:checks.agent({ prompt: "...", model: "claude-haiku" }) - Run fewer agent checks, more field checks
- Run evaluations in parallel if you have multiple test suites
Custom check isn't running
Custom checks run locally, not on Scout servers. Make sure:
- The function doesn't throw unhandled errors
- You're returning
booleanor{ passed: boolean, ... }
Rate limiting
If you're running many evaluations in a loop, add delays:
for (const test of tests) {
await client.evaluate(test);
await new Promise(r => setTimeout(r, 100)); // 100ms delay
}License
MIT
