@frontsail_ai/frontevals
v0.1.0
Published
A minimal, vitest-native evals library for LLM applications
Readme
frontevals
A minimal, vitest-native evals library for LLM applications.
Features
- Vitest-native: Works seamlessly inside your existing vitest test suite
- Simple API: Just
evalSuite()orevalTest()- no custom CLI needed - Built-in metrics:
exact(),contains(),startsWith(),endsWith(),jsonValid(),hasKeys() - LLM-as-judge: G-Eval style custom criteria evaluation via OpenAI
- Autoevals integration: Re-exports popular scorers like
Factuality,Levenshtein, etc. - Multi-run support: Run cases multiple times with pass rate thresholds for non-deterministic tasks
- Pretty output: Console tables showing scores and pass/fail status
Installation
npm install @frontsail_ai/frontevalsQuick Start
Basic Usage
import { describe, it, expect } from 'vitest';
import { evalSuite } from '@frontsail_ai/frontevals';
import { exact, contains } from '@frontsail_ai/frontevals/metrics';
describe('Greeting Evals', () => {
it('passes all cases', async () => {
const result = await evalSuite({
name: 'greet function',
data: [
{ input: 'Alice', expected: 'Hello, Alice!' },
{ input: 'Bob', expected: 'Hello, Bob!' },
],
task: (name) => `Hello, ${name}!`,
scorers: [exact(), contains('Hello')],
});
expect(result.summary.passRate).toBe(1);
});
});Even Simpler: evalTest Helper
import { evalTest } from '@frontsail_ai/frontevals/vitest';
import { exact } from '@frontsail_ai/frontevals/metrics';
describe('String Utils', () => {
evalTest('uppercase works', {
data: [
{ input: 'hello', expected: 'HELLO' },
{ input: 'world', expected: 'WORLD' },
],
task: (s) => s.toUpperCase(),
scorers: [exact()],
});
});API Reference
evalSuite(config)
Run an evaluation suite and return structured results.
interface EvalSuiteConfig<TInput, TOutput> {
name: string; // Suite name for display
data: EvalData<TInput>[]; // Test cases (or async function returning them)
task: (input: TInput) => TOutput | Promise<TOutput>; // Function to evaluate
scorers: Scorer<TOutput>[]; // Scoring functions
runs?: number; // Times to run each case (default: 1)
threshold?: number; // Pass threshold 0-1 (default: 1.0)
}
interface EvalData<TInput> {
input: TInput;
expected?: unknown;
name?: string; // Custom display name
}evalTest(name, config, threshold?)
Vitest helper that creates a test case automatically.
evalTest('test name', {
data: [...],
task: (input) => output,
scorers: [...],
}, 0.9); // Optional threshold (default: 1.0)Built-in Metrics
Import from frontevals/metrics:
| Metric | Description |
|--------|-------------|
| exact() | Exact string match |
| contains(str) | Output contains substring |
| startsWith(str) | Output starts with prefix |
| endsWith(str) | Output ends with suffix |
| jsonValid() | Output is valid JSON |
| hasKeys(keys) | Output object has required keys |
Example
import { exact, contains, jsonValid, hasKeys } from '@frontsail_ai/frontevals/metrics';
const result = await evalSuite({
name: 'API Response',
data: [{ input: 'query', expected: '{"status":"ok"}' }],
task: async (q) => await api.call(q),
scorers: [
jsonValid(),
hasKeys(['status']),
contains('ok'),
],
});G-Eval (LLM-as-Judge)
Use custom criteria evaluated by an LLM:
import { gEval } from '@frontsail_ai/frontevals/metrics';
const result = await evalSuite({
name: 'Support Bot',
data: [
{ input: 'My order is late', expected: 'Apologize and offer help' },
],
task: async (q) => await bot.respond(q),
scorers: [
gEval({ criteria: 'Is the response empathetic and helpful?' }),
gEval({
name: 'actionable',
criteria: 'Does the response provide a clear next step?',
threshold: 0.8,
}),
],
});gEval Options
gEval({
name?: string; // Scorer name (default: 'gEval')
criteria: string; // Plain language criteria
steps?: string[]; // Optional evaluation steps
threshold?: number; // Pass threshold (default: 0.7)
model?: string; // OpenAI model (default: 'gpt-4o-mini')
})Note: Requires OPENAI_API_KEY environment variable.
Autoevals Integration
Re-exported scorers from the autoevals library:
import { Factuality, Levenshtein } from '@frontsail_ai/frontevals/autoevals';
const result = await evalSuite({
name: 'Chatbot Quality',
data: [
{ input: 'What is TypeScript?', expected: 'A typed superset of JavaScript' },
],
task: async (q) => await chatbot.answer(q),
scorers: [Factuality, Levenshtein],
});Available: Factuality, Levenshtein, ClosedQA, Battle, Humor, Security, Summary, Translation
Multiple Runs (Non-Deterministic Tasks)
For LLM tasks with variable outputs, run multiple times and set a pass threshold:
const result = await evalSuite({
name: 'Chatbot Consistency',
data: [
{ input: 'Explain recursion', expected: 'A clear explanation' },
],
task: async (q) => await chatbot.answer(q),
scorers: [gEval({ criteria: 'Is the explanation clear?' })],
runs: 5, // Run each case 5 times
threshold: 0.8, // 80% of runs must pass
});
console.log(result.results[0].passRate); // e.g., 0.8Console Output
Single Run Mode
┌─────────────────────────────────────────────────────────────────┐
│ greet function │
├──────────┬────────────────┬───────┬──────────┬─────────┬────────┤
│ Case │ Output │ exact │ contains │ Score │ Status │
├──────────┼────────────────┼───────┼──────────┼─────────┼────────┤
│ Alice │ Hello, Alice! │ 1.00 │ 1.00 │ 1.00 │ ✓ PASS │
│ Bob │ Hello, Bob! │ 1.00 │ 1.00 │ 1.00 │ ✓ PASS │
├──────────┴────────────────┴───────┴──────────┴─────────┴────────┤
│ Summary: 2/2 passed (100%) │
└─────────────────────────────────────────────────────────────────┘Multiple Runs Mode
┌───────────────────────────────────────────────────────────────────────┐
│ Chatbot Consistency (5 runs, 80% threshold) │
├────────────────────┬─────────┬───────────┬──────────┬─────────────────┤
│ Case │ gEval │ Runs │ PassRate │ Status │
├────────────────────┼─────────┼───────────┼──────────┼─────────────────┤
│ Explain recursion │ 0.85 │ 4/5 │ 80% │ ✓ PASS │
│ What is a closure │ 0.60 │ 3/5 │ 60% │ ✗ FAIL (< 80%) │
├────────────────────┴─────────┴───────────┴──────────┴─────────────────┤
│ Summary: 1/2 cases passed | 7/10 total trials passed (70%) │
└───────────────────────────────────────────────────────────────────────┘Result Types
interface SuiteResult {
name: string;
results: EvalResult[];
summary: {
total: number;
passed: number;
failed: number;
passRate: number;
totalTrials: number;
byScorer: Record<string, { avgScore: number; passRate: number }>;
};
runs: number;
threshold: number;
}
interface EvalResult {
name: string;
input: unknown;
expected?: unknown;
trials: TrialResult[];
passRate: number;
pass: boolean;
}
interface TrialResult {
trialIndex: number;
output: unknown;
scores: Array<{ name: string; score: number; pass: boolean }>;
pass: boolean;
}Custom Scorers
Create custom scorers matching the autoevals signature:
import type { Scorer } from '@frontsail_ai/frontevals';
const lengthScorer: Scorer<string> = ({ output, expected }) => {
const diff = Math.abs(output.length - String(expected).length);
const score = Math.max(0, 1 - diff / 100);
return { score, name: 'length' };
};
// Async scorers are supported
const apiScorer: Scorer<string> = async ({ output }) => {
const result = await externalApi.evaluate(output);
return { score: result.score, name: 'api', metadata: result };
};Development
Setup
git clone <repo-url>
cd frontevals
npm installScripts
npm test # Run tests
npm run test:watch # Run tests in watch mode
npm run build # Build TypeScriptRunning Tests with Coverage
npx vitest run --coverageProject Structure
frontevals/
├── src/
│ ├── index.ts # Main exports
│ ├── types.ts # TypeScript interfaces
│ ├── eval-suite.ts # Core evalSuite function
│ ├── reporter.ts # Console table output
│ ├── vitest.ts # evalTest helper
│ ├── autoevals.ts # Re-exports from autoevals
│ └── metrics/
│ ├── index.ts # Built-in metrics
│ └── geval.ts # G-Eval LLM-as-judge
├── tests/
│ ├── eval-suite.test.ts
│ ├── metrics.test.ts
│ ├── geval.test.ts
│ ├── reporter.test.ts
│ ├── vitest-helper.test.ts
│ ├── exports.test.ts
│ └── output.test.ts
├── package.json
├── tsconfig.json
└── vitest.config.tsAdding New Metrics
- Create scorer function in
src/metrics/index.tsor a new file - Export from
src/metrics/index.ts - Add tests in
tests/metrics.test.ts
Example:
// src/metrics/index.ts
export function regex(pattern: RegExp): Scorer<unknown> {
return ({ output }): ScoreResult => {
const score = pattern.test(String(output)) ? 1 : 0;
return { score, name: 'regex' };
};
}Dependencies
- autoevals: LLM evaluation metrics
- console-table-printer: Pretty console tables
- openai: Required for gEval
Peer Dependencies
- vitest: >=1.0.0
License
MIT
