npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

eval-genius

v0.1.5

Published

eval-genius enables evals of arbitrary async code. It is generally intended for making multiple assertions on outputs which are generated nondeterministically. These assertions can be used to score algorithms on their effectiveness.

Readme

eval-genius

eval-genius enables evals of arbitrary async code. It is generally intended for making multiple assertions on outputs which are generated nondeterministically. These assertions can be used to score algorithms on their effectiveness.

eval-genius is based heavily on evalite, with some key differences:

  • eval-genius is designed to export data for analysis, where evalite handles the analysis internally. This gives more flexibility in the evaluation algorithms.
  • eval-genius uses Vitest built-ins for its CLI and observing test results, which makes configuration more standardized.

Installation

yarn install -D eval-genius vitest

Setup

Override the default Vitest config so Vitest will pick up your evals from *.eval.ts files. If you already have vitest set up, you may want to use the --config flag to use a distinct configuration for evals from your existing tests.

// vitest.config.ts
import { defineConfig } from "vitest/config";

export default defineConfig({
  test: {
    include: ["./**/*.eval.ts"],
  },
});

Writing evals

// my-test.eval.ts
import { genius } from "eval-genius";
import * as vitest from "vitest";
import { describe } from "vitest";

describe("my-test", () =>
  genius({
    vitest,
    /**
     * Runs tests concurrently according to the vitest 
     * maxConcurrency setting. Switches expect.soft() with 
     * expect() because expect.soft() does not work with 
     * concurrent tests in Vitest. Defaults to false.
     */
    concurrent: true,
    metadata: {
      /**
       * The name of the functionality under evaluation.
       */
      name: "my-test",

      /**
       * The name of the variation being tested. For example, if you
       * are testing two prompts, you can run the suite with
       * different labels for each prompt.
       */
      label: "my-experiment",
    },

    /**
     * The data to be processed and evaluated. `input` and `expected`
     * can be any type, and can diverge from each other.
     */
    data: {
      values: async () => [
        { 
          name: "basic test", 
          input: "hello world!", 
          expected: "HELLO WORLD!" 
        },
      ],
    },

    /**
     * The work done for every entry in data.values
     */
    task: {
      /**
       * The behavior being evaluated.
       */
      execute: async (input) => input.toUpperCase(),

      /**
       * Makes assertions to be shown in the Vitest output. Not used 
       * by the exporters. Use the expect() function provided here;
       * do not use expect() from Vitest directly.
       */
      test: async (expect, { rendered, expected, output }) => {
        /**
         * Use the rendered values to represent the values sent to 
         * the exporter
         */
        expect
          .soft(
            rendered.capitalizesCorrectly, 
            "capitalizes correctly"
          )
          .toBe(true);

        /**
         * For more complex comparisons, error messages are clearer 
         * if the expect() call makes the comparison directly
         */
        expect.soft(output).toBe(expected);
      },
      /**
       * Renders output to be sent to the exporters
       */
      renderer: {
        /**
         * The properties which the exporter should consume from 
         * the return values of the render function.
         */
        fields: ["capitalizesCorrectly"],

        /**
         * The data that the exporters should consume.
         */
        render: async ({ output, expected }) => ({
          capitalizesCorrectly: output === expected,
        }),
      },
    },

    /**
     * Destinations to send the rendered data.
     */
    exporters: [],
  }));

If you want to compare multiple implementations in an experiment, you can do something like this:

[
  { label: "control", execute: controlImplementation },
  { label: "test", execute: testImplementation },
].forEach(({ label, execute }) =>
  describe(`my-test [${label}]`, () =>
    genius({
      metadata: { name: "my-test", label },
      task: {
        execute,
        // ...task
      },
      // ...config
    })),
);

Exporting data (optional)

GoogleSheetsExporter

Set up Google Service Account credentials

See the google-sheets documentation for how to create your keys. Create a .env file with:

GOOGLE_SERVICE_ACCOUNT_EMAIL=your-service-account-email
GOOGLE_PRIVATE_KEY=your-private-key

# this is the email of the account you want the documents to be saved in
MY_GOOGLE_ACCOUNT_EMAIL=your-google-account-email

Initialize exporters

type NewDocumentInit = { title: string; folderId?: string };
type ExistingDocumentInit = { spreadsheetId: string };
type InitArg = NewDocumentInit | ExistingDocumentInit;
import { defineConfig } from "vitest/config";
import { GoogleSheetsExporter } from "eval-genius/GoogleSheetsExporter";
import dotenv from "dotenv";

dotenv.config();

const googleSheetsExporter = GoogleSheetsExporter();

const now = new Date();
await googleSheetsExporter.init({
  title: `Evals [${now.toLocaleDateString()} ${now.toLocaleTimeString()}]`,
});

export default defineConfig({
  test: {
    include: ["./**/*.eval.ts"],
  },
});

Use the exporter

// my-test.eval.ts
import { genius } from "eval-genius";
import { GoogleSheetsExporter } from "./src/GoogleSheetsExporter";
import * as vitest from "vitest";

genius({
  // ...config
  exporters: [GoogleSheetsExporter],
});

What is generated?

You will get a table of the output that is generated from the renderer, with a runId supplied.

Spreadsheet of the data

Why Google Sheets?

Google Sheets is a straightforward way of running aggregate analysis on data. In particular, Pivot Tables make it very easy to compare outputs of different runs. The below example indicates a regression when changing from the control to the experiment.

Pivot of the data

Custom reporters

Custom exporters can export to any destination. They must comply with this type definition:

type RenderedValue = boolean | number | string | null;
type Rendered<T extends string> = Record<T, RenderedValue>;

export type Reporter<FieldNames extends string> = {
  /**
   * Queues data to be sent to the destination.
   */
  report: (arg: { result: Rendered<FieldNames> }) => MaybePromise<unknown>;

  /**
   * Sends data to the destination.
   */
  flush: () => MaybePromise<unknown>;
};

export type Exporter<InitArgs extends any, InitReturn extends any> = () => {
  /**
   * Any initialization logic for the reporter.
   */
  init: (arg: InitArgs) => InitReturn;

  /**
   * Creates the reporter.
   */
  start: <FieldNames extends string>(arg: {
    title: string;
    fields: Array<FieldNames>;
  }) => MaybePromise<Reporter<FieldNames>>;
};

Tips

  • Make sure to cache your algorithms! Generating outputs can be slow and expensive, so caching is important.
  • In general, numeric output is the easiest to evaluate. It is easiest to use numbers where possible as output in the renderer. For example, booleans can be more easily represented as 0 or 1 for aggregation.