npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@lov3kaizen/agentsea-evaluate

v0.5.2

Published

Comprehensive feedback collection and LLM evaluation platform for Node.js - human-in-the-loop annotation, automated evaluation pipelines, preference dataset generation

Downloads

299

Readme

@lov3kaizen/agentsea-evaluate

Comprehensive feedback collection and LLM evaluation platform for Node.js. Build production-ready evaluation pipelines with human-in-the-loop annotation, automated metrics, LLM-as-Judge, and preference dataset generation.

Features

  • Evaluation Metrics - Built-in metrics for accuracy, relevance, coherence, toxicity, faithfulness, and more
  • LLM-as-Judge - Use LLMs to evaluate responses with rubric-based and comparative scoring
  • Human Feedback - Collect ratings, rankings, and corrections from annotators
  • Dataset Management - Create, import, and manage evaluation datasets with HuggingFace integration
  • Continuous Evaluation - Monitor production quality with automated evaluation pipelines
  • Preference Learning - Generate datasets for RLHF, DPO, and preference optimization

Installation

pnpm add @lov3kaizen/agentsea-evaluate

Quick Start

import {
  EvaluationPipeline,
  AccuracyMetric,
  RelevanceMetric,
  LLMJudge,
  EvalDataset,
} from '@lov3kaizen/agentsea-evaluate';

// Create metrics
const accuracy = new AccuracyMetric({ type: 'fuzzy' });
const relevance = new RelevanceMetric();

// Create evaluation pipeline
const pipeline = new EvaluationPipeline({
  metrics: [accuracy, relevance],
  parallelism: 5,
});

// Create dataset
const dataset = new EvalDataset({
  items: [
    {
      id: '1',
      input: 'What is the capital of France?',
      expectedOutput: 'Paris',
    },
    {
      id: '2',
      input: 'What is 2 + 2?',
      expectedOutput: '4',
    },
  ],
});

// Run evaluation
const results = await pipeline.evaluate({
  dataset,
  generateFn: async (input) => {
    // Your LLM generation function
    return await myAgent.run(input);
  },
});

console.log(results.summary);
// { passRate: 0.95, avgScore: 0.87, ... }

Metrics

Built-in Metrics

| Metric | Description | | ------------------------ | ------------------------------------------------------- | | AccuracyMetric | Exact, fuzzy, or semantic match against expected output | | RelevanceMetric | How relevant the response is to the input | | CoherenceMetric | Logical flow and consistency of the response | | ToxicityMetric | Detection of harmful or inappropriate content | | FaithfulnessMetric | Factual accuracy relative to provided context (RAG) | | ContextRelevanceMetric | Relevance of retrieved context (RAG) | | FluencyMetric | Grammar, spelling, and readability | | ConcisenessMetric | Brevity without losing important information | | HelpfulnessMetric | How helpful the response is to the user | | SafetyMetric | Detection of unsafe or harmful outputs |

Custom Metrics

import {
  BaseMetric,
  MetricResult,
  EvaluationInput,
} from '@lov3kaizen/agentsea-evaluate';

class CustomMetric extends BaseMetric {
  readonly type = 'custom';
  readonly name = 'my-metric';

  async evaluate(input: EvaluationInput): Promise<MetricResult> {
    // Your evaluation logic
    const score = calculateScore(input.output, input.expectedOutput);

    return {
      metric: this.name,
      score,
      explanation: `Score: ${score}`,
    };
  }
}

LLM-as-Judge

Rubric-Based Evaluation

import { RubricJudge } from '@lov3kaizen/agentsea-evaluate';

const judge = new RubricJudge({
  provider: anthropicProvider,
  rubric: {
    criteria: 'Response Quality',
    levels: [
      { score: 1, description: 'Poor - Incorrect or irrelevant' },
      { score: 2, description: 'Fair - Partially correct' },
      { score: 3, description: 'Good - Correct but incomplete' },
      { score: 4, description: 'Very Good - Correct and complete' },
      {
        score: 5,
        description: 'Excellent - Correct, complete, and well-explained',
      },
    ],
  },
});

const result = await judge.evaluate({
  input: 'Explain quantum entanglement',
  output: response,
});

Comparative Evaluation

import { ComparativeJudge } from '@lov3kaizen/agentsea-evaluate';

const judge = new ComparativeJudge({
  provider: openaiProvider,
  criteria: ['accuracy', 'helpfulness', 'clarity'],
});

const result = await judge.compare({
  input: 'Summarize this article',
  responseA: modelAOutput,
  responseB: modelBOutput,
});
// { winner: 'A', reasoning: '...', criteriaScores: {...} }

Human Feedback

Rating Collector

import { RatingCollector } from '@lov3kaizen/agentsea-evaluate/feedback';

const collector = new RatingCollector({
  scale: 5,
  criteria: ['accuracy', 'helpfulness', 'clarity'],
});

// Collect feedback
await collector.collect({
  itemId: 'response-123',
  input: 'What is ML?',
  output: 'Machine Learning is...',
  annotatorId: 'user-1',
  ratings: {
    accuracy: 4,
    helpfulness: 5,
    clarity: 4,
  },
  comment: 'Good explanation',
});

// Get aggregated scores
const stats = collector.getStatistics('response-123');

Preference Collection

import { PreferenceCollector } from '@lov3kaizen/agentsea-evaluate/feedback';

const collector = new PreferenceCollector();

// Collect A/B preferences
await collector.collect({
  input: 'Explain recursion',
  responseA: '...',
  responseB: '...',
  preference: 'A',
  annotatorId: 'user-1',
  reason: 'More concise explanation',
});

// Export for RLHF/DPO training
const dataset = collector.exportForDPO();

Datasets

Create Dataset

import { EvalDataset } from '@lov3kaizen/agentsea-evaluate/datasets';

const dataset = new EvalDataset({
  name: 'qa-benchmark',
  items: [
    {
      id: '1',
      input: 'Question 1',
      expectedOutput: 'Answer 1',
      context: ['Relevant context...'],
      tags: ['factual', 'science'],
    },
  ],
});

// Filter and sample
const subset = dataset
  .filter((item) => item.tags?.includes('science'))
  .sample(100);

// Split for train/test
const [train, test] = dataset.split(0.8);

HuggingFace Integration

import { loadHuggingFaceDataset } from '@lov3kaizen/agentsea-evaluate/datasets';

const dataset = await loadHuggingFaceDataset('squad', {
  split: 'validation',
  inputField: 'question',
  outputField: 'answers.text[0]',
  contextField: 'context',
  limit: 1000,
});

Continuous Evaluation

Production Monitoring

import { ContinuousEvaluator } from '@lov3kaizen/agentsea-evaluate/continuous';

const evaluator = new ContinuousEvaluator({
  metrics: [accuracy, relevance, toxicity],
  sampleRate: 0.1, // Evaluate 10% of requests
  alertThresholds: {
    accuracy: 0.8,
    toxicity: 0.1,
  },
});

// In your production code
evaluator.on('alert', (alert) => {
  console.error(`Quality alert: ${alert.metric} below threshold`);
  notifyOncall(alert);
});

// Log production interactions
await evaluator.log({
  input: userQuery,
  output: agentResponse,
  expectedOutput: groundTruth, // Optional
});

API Reference

EvaluationPipeline

interface EvaluationPipelineConfig {
  metrics: MetricInterface[];
  llmJudge?: JudgeInterface;
  parallelism?: number;
  timeout?: number;
  retries?: number;
}

// Methods
pipeline.evaluate(options: PipelineEvaluationOptions): Promise<PipelineEvaluationResult>

EvalDataset

interface EvalDatasetItem {
  id: string;
  input: string;
  expectedOutput?: string;
  context?: string[];
  reference?: string;
  metadata?: Record<string, unknown>;
  tags?: string[];
}

// Methods
dataset.getItems(): EvalDatasetItem[]
dataset.filter(predicate): EvalDataset
dataset.sample(count): EvalDataset
dataset.split(ratio): [EvalDataset, EvalDataset]

PipelineEvaluationResult

interface PipelineEvaluationResult {
  results: SingleEvaluationResult[];
  metrics: MetricsSummary;
  failures: FailureAnalysis[];
  summary: EvaluationSummary;
  exportJSON(): string;
  exportCSV(): string;
}

Links