@pauly4010/evalai-sdk

v1.9.1

Published

5 months ago

AI Evaluation Platform SDK - Complete API Coverage with Performance Optimizations

0High
0Medium
0Low

pauly1040

ai evaluation llm testing observability tracing monitoring annotations webhooks developer-tools openai anthropic

@pauly4010/evalai-sdk

One-command CI for AI evaluation. Complete pipeline: discover → manifest → impact → run → diff → PR summary.

Zero to production CI in 60 seconds. No infra. No lock-in. Remove anytime.

Quick Start (60 seconds)

Add this to your .github/workflows/evalai.yml:

name: EvalAI CI
on: [push, pull_request]
jobs:
  evalai:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npx @pauly4010/evalai-sdk ci --format github --write-results --base main
      - uses: actions/upload-artifact@v4
        if: always()
        with:
          name: evalai-results
          path: .evalai/

Create eval/your-spec.spec.ts:

import { defineEval } from "@pauly4010/evalai-sdk";

defineEval({
  name: "Basic Math Operations",
  description: "Test fundamental arithmetic",
  prompt: "Test: 1+1=2, string concatenation, array includes",
  expected: "All tests should pass",
  tags: ["basic", "math"],
  category: "unit-test"
});

git add .github/workflows/evalai.yml eval/
git commit -m "feat: add EvalAI CI pipeline"
git push

That's it! Your CI now:

✅ Discovers evaluation specs automatically
✅ Runs only impacted specs (smart caching)
✅ Compares results against base branch
✅ Posts rich summary in PR with regressions
✅ Exits with proper codes (0=clean, 1=regressions, 2=config)

🚀 New in v1.9.0: One-Command CI

`evalai ci` - Complete CI Pipeline

npx @pauly4010/evalai-sdk ci --format github --write-results --base main

What it does:

Discover - Finds all evaluation specs automatically
Manifest - Builds stable manifest if missing
Impact Analysis - Runs only specs impacted by changes (optional)
Run - Executes evaluations with artifact retention
Diff - Compares results against base branch
PR Summary - Posts rich markdown summary to GitHub
Debug Flow - Prints copy/paste next step on failure

Advanced Options:

npx @pauly4010/evalai-sdk ci --base main --impacted-only    # Run only impacted specs
npx @pauly4010/evalai-sdk ci --format json --write-results   # JSON output for automation
npx @pauly4010/evalai-sdk ci --base develop                  # Custom base branch

Smart Diffing & GitHub Integration

npx @pauly4010/evalai-sdk diff --base main --head last --format github

Features:

📊 Pass rate delta and score changes
🚨 Regression detection with classifications
📈 Improvements and new specs
📁 Artifact links and technical details
🎯 Exit codes: 0=clean, 1=regressions, 2=config

Self-Documenting Failures

Every failure prints a clear next step:

🔧 Next step for debugging:
   Download base artifact and run: evalai diff --base .evalai/base-run.json --head .evalai/last-run.json
   Artifacts: .evalai/runs/

CLI Commands

🚀 One-Command CI (v1.9.0)

| Command | Description | |---------|-------------| | npx evalai ci | Complete CI pipeline: discover → manifest → impact → run → diff → PR summary | | npx evalai ci --base main | Run CI with diff against main branch | | npx evalai ci --impacted-only | Run only specs impacted by changes | | npx evalai ci --format github | GitHub Step Summary with rich markdown | | npx evalai ci --format json | JSON output for automation |

Discovery & Manifest

| Command | Description | |---------|-------------| | npx evalai discover | Find and analyze evaluation specs | | npx evalai discover --manifest | Generate stable manifest for incremental analysis |

Impact Analysis

| Command | Description | |---------|-------------| | npx evalai impact-analysis --base main | Analyze impact of changes | | npx evalai impact-analysis --changed-files file1.ts,file2.ts | Analyze specific changed files |

Run & Diff

| Command | Description | |---------|-------------| | npx evalai run | Run evaluation specifications | | npx evalai run --write-results | Run with artifact retention | | npx evalai diff --base main | Compare results against base branch | | npx evalai diff --base last --head last | Compare last two runs | | npx evalai diff --format github | GitHub Step Summary with regressions |

Legacy Regression Gate (local, no account needed)

| Command | Description | |---------|-------------| | npx evalai init | Full project scaffolder — creates everything you need | | npx evalai gate | Run regression gate locally | | npx evalai gate --format json | Machine-readable JSON output | | npx evalai gate --format github | GitHub Step Summary with delta table | | npx evalai baseline init | Create starter evals/baseline.json | | npx evalai baseline update | Re-run tests and update baseline with real scores | | npx evalai upgrade --full | Upgrade from Tier 1 (built-in) to Tier 2 (full gate) |

API Gate (requires account)

| Command | Description | |---------|-------------| | npx evalai check | Gate on quality score from dashboard | | npx evalai share | Create share link for a run |

Debugging & Diagnostics

| Command | Description | |---------|-------------| | npx evalai doctor | Comprehensive preflight checklist — verifies config, baseline, auth, API, CI wiring | | npx evalai explain | Offline report explainer — top failures, root cause classification, suggested fixes | | npx evalai print-config | Show resolved config with source-of-truth annotations (file/env/default/arg) |

Migration Tools

| Command | Description | |---------|-------------| | npx evalai migrate config --in evalai.config.json --out eval/migrated.spec.ts | Convert legacy config to DSL |

Guided failure flow:

evalai ci  →  fails  →  "Next: evalai explain --report .evalai/last-run.json"
                              ↓
                   evalai explain  →  root causes + fixes

GitHub Actions step summary — CI result at a glance with regressions and artifacts:

GitHub Actions step summary showing CI pass/fail with delta table

evalai explain terminal output — root causes + fix commands:

Terminal output of evalai explain showing top failures and suggested fixes

All commands automatically write artifacts so explain works with zero flags.

Gate Exit Codes

| Code | Meaning | |------|---------| | 0 | Pass — no regression | | 1 | Regression detected | | 2 | Infra error (baseline missing, tests crashed) |

Check Exit Codes (API mode)

| Code | Meaning | |------|---------| | 0 | Pass | | 1 | Score below threshold | | 2 | Regression failure | | 3 | Policy violation | | 4 | API error | | 5 | Bad arguments | | 6 | Low test count | | 7 | Weak evidence | | 8 | Warn (soft regression) |

Doctor Exit Codes

| Code | Meaning | |------|---------| | 0 | Ready — all checks passed | | 2 | Not ready — one or more checks failed | | 3 | Infrastructure error |

How the Gate Works

Built-in mode (any Node project, no config needed):

Runs <pm> test, captures exit code + test count
Compares against evals/baseline.json
Writes evals/regression-report.json
Fails CI if tests regress

Project mode (advanced, for full regression gate):

If eval:regression-gate script exists in package.json, delegates to it
Supports golden eval scores, confidence tests, p95 latency, cost tracking
Full delta table with tolerances

Run a Regression Test Locally (no account)

npm install @pauly4010/evalai-sdk openai

import { openAIChatEval } from "@pauly4010/evalai-sdk";

await openAIChatEval({
  name: "chat-regression",
  cases: [
    { input: "Hello", expectedOutput: "greeting" },
    { input: "2 + 2 = ?", expectedOutput: "4" },
  ],
});

Output: PASS 2/2 (score: 100). No account needed. Just a score.

Vitest Integration

import { openAIChatEval, extendExpectWithToPassGate } from "@pauly4010/evalai-sdk";
import { expect } from "vitest";

extendExpectWithToPassGate(expect);

it("passes gate", async () => {
  const result = await openAIChatEval({
    name: "chat-regression",
    cases: [
      { input: "Hello", expectedOutput: "greeting" },
      { input: "2 + 2 = ?", expectedOutput: "4" },
    ],
  });
  expect(result).toPassGate();
});

SDK Exports

Regression Gate Constants

import {
  GATE_EXIT,           // { PASS: 0, REGRESSION: 1, INFRA_ERROR: 2, ... }
  GATE_CATEGORY,       // { PASS: "pass", REGRESSION: "regression", INFRA_ERROR: "infra_error" }
  REPORT_SCHEMA_VERSION,
  ARTIFACTS,           // { BASELINE, REGRESSION_REPORT, CONFIDENCE_SUMMARY, LATENCY_BENCHMARK }
} from "@pauly4010/evalai-sdk";

// Or tree-shakeable:
import { GATE_EXIT } from "@pauly4010/evalai-sdk/regression";

Types

import type {
  RegressionReport,
  RegressionDelta,
  Baseline,
  BaselineTolerance,
  GateExitCode,
  GateCategory,
} from "@pauly4010/evalai-sdk/regression";

Platform Client

import { AIEvalClient } from "@pauly4010/evalai-sdk";

const client = AIEvalClient.init(); // from EVALAI_API_KEY env
// or
const client = new AIEvalClient({ apiKey: "...", organizationId: 123 });

Framework Integrations

import { traceOpenAI } from "@pauly4010/evalai-sdk/integrations/openai";
import { traceAnthropic } from "@pauly4010/evalai-sdk/integrations/anthropic";

Installation

npm install @pauly4010/evalai-sdk
# or
yarn add @pauly4010/evalai-sdk
# or
pnpm add @pauly4010/evalai-sdk

Add openai as a peer dependency if using openAIChatEval:

npm install openai

Environment Support

| Feature | Node.js | Browser | |---------|---------|---------| | Platform APIs (Traces, Evaluations, LLM Judge) | ✅ | ✅ | | Assertions, Test Suites, Error Handling | ✅ | ✅ | | CJS/ESM | ✅ | ✅ | | CLI, Snapshots, File Export | ✅ | — | | Context Propagation | ✅ Full | ⚠️ Basic |

No Lock-in

rm evalai.config.json

Your local openAIChatEval runs continue to work. No account cancellation. No data export required.

Changelog

See CHANGELOG.md for the full release history.

v1.8.0 — evalai doctor rewrite (9-check checklist), evalai explain command, guided failure flow, CI template with doctor preflight

v1.7.0 — evalai init scaffolder, evalai upgrade --full, detectRunner(), machine-readable gate output, init test matrix

v1.6.0 — evalai gate, evalai baseline, regression gate constants & types

v1.5.8 — secureRoute fix, test infra fixes, 304 handling fix

v1.5.5 — PASS/WARN/FAIL semantics, flake intelligence, golden regression suite

v1.5.0 — GitHub annotations, --onFail import, evalai doctor

License

MIT

Support

Docs: https://v0-ai-evaluation-platform-nu.vercel.app/documentation
Issues: https://github.com/pauly7610/ai-evaluation-platform/issues

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@pauly4010/evalai-sdk

Quick Start (60 seconds)

🚀 New in v1.9.0: One-Command CI

evalai ci - Complete CI Pipeline

Smart Diffing & GitHub Integration

Self-Documenting Failures

CLI Commands

🚀 One-Command CI (v1.9.0)

Discovery & Manifest

Impact Analysis

Run & Diff

Legacy Regression Gate (local, no account needed)

API Gate (requires account)

Debugging & Diagnostics

Migration Tools

Gate Exit Codes

Check Exit Codes (API mode)

Doctor Exit Codes

How the Gate Works

Run a Regression Test Locally (no account)

Vitest Integration

SDK Exports

Regression Gate Constants

Types

Platform Client

Framework Integrations

Installation

Environment Support

No Lock-in

Changelog

License

Support

`evalai ci` - Complete CI Pipeline