semantic-expect

v0.0.6

Published

7 months ago

LLM-based test assertions

Downloads

0High
0Medium
0Low

agorischek

ai assert assertion expect jest llm npm openai tests unit vitest

🔡🤞 Semantic Expect

LLM-based test assertions for Vitest and Jest

test('Joke writer', async () => {
  await expect(writeJoke).toGenerate('Something funny');
});

This library is in early development and is seeking contributors!

Philosophy

Developing applications backed by generative artificial intelligence (such as large language models) requires us to redefine the very notion of "reliability". No longer is it possible — or even desirable — to expect our applications to do exactly what we program them to do: Not only are LLMs fundamentally non-deterministic, but exhibiting emergent and unprogrammed behaviors is one of the key things that makes LLMs so powerful in the first place. Any production-grade LLM-powered system will require multiple quality assurance mechanisms, including run-time checks, live service monitoring, offline evaluation, and — ideally — test automation.

Semantic Expect's role is to shift basic validation left and verify essential generative behavior before shipping. It will always be possible to tweak prompts and eke out better responses, but some behaviors may be simply unacceptable to ship at all. Semantic Expect lets you write tests for generative features that can be added to your continuous integration and deployment processes, alongside end-to-end and integration tests. You should err toward defining rules that express acceptable behavior rather than perfect behavior; otherwise your tests may exhibit "flakiness" that impedes development velocity. Finding this balance and refining these techniques is perhaps the new art of "semantic testing".

Setup

To use Semantic Expect, you'll need to register custom matchers with your test runner. Instructions vary slightly by runner, but generally look like this:

// First, import your LLM client and a matcher factory
import { OpenAI } from 'openai';
import { makeOpenAIMatchers } from 'semantic-expect';

const model = new OpenAI();

// Second, build the matchers by submitting the LLM client
const matchers = makeOpenAIMatchers(model);

// Third, register the matchers
expect.extend(matchers);

You can typically do multiple steps one line if preferred:

expect.extend(makeOpenAIMatchers(new OpenAI()));

See Jest expect.extend() and Vitest Extending Matchers for further details.

To use custom matchers across multiple test files, you can register them in a separate setup file. See Jest setupFilesAfterEnv configuration and Vitest setupFiles configuration for further details.

Matching

Because generative AI is fundamentally non-deterministic, it's generally not possible to test a static input against an expected value (e.g. using toBe), nor is it typically sufficient to generate only one test value for assessment. Given these dynamics, Semantic Expect provides a toGenerate matcher that accepts a generator function, runs it n times, and checks every generation against a requirement:

it('Should write an on-topic joke', async () => {
  const generator = () => writeJoke('about computers');
  // Be sure to await the assertion
  await expect(generator).toGenerate('A joke about computers', 5);
});

Note: You must await the assertion, since the model call is asynchronous. If you don't, the test will always pass!

If the generated content does not fulfill the requirement, the matcher will provide a message explaining why:

Each generation should be 'A joke about computers' (1 of 3 were not):
  - 'Why was the electricity feeling so powerful? Because it had a high voltage personality!' (Is a joke about electricity, not computers)

By default, toGenerate will run the generator 3 times, however a custom count can be specified as the second argument. Of course, it's always possible for a generator to work correctly 10 times and fail on the 11th time, but such is the reality of working with LLMs; the best we can do is manage the risk, not eliminate it. The requirements should be kept broad enough that they can reliably be met even with the inherent variability of the content being tested.

If the generator being tested doesn't require any parameters, it can be submitted on its own, without a wrapping function:

it('Should write something funny', async () => {
  await expect(writeJoke).toGenerate('Something funny');
});

The toGenerate matcher can also be negated using not:

it('Should write a work-appropriate joke', async () => {
  const generator = () => writeJoke('about computers');
  await expect(generator).not.toGenerate('Anything inappropriate for work', 5);
});

Models

Semantic Expect provides multiple options for the models backing the custom matchers.

makeOpenAIMatchers: Uses OpenAI backend and defaults to chat-based model (alias for makeOpenAIChatMatchers)
makeOpenAIChatMatchers: Uses OpenAI backend and always uses chat-based model
makeOpenAITextMatchers: Uses OpenAI backend and always uses text-based (instruct) model

You can also specify a particular model via options if desired:

const textMatchers = makeOpenAITextMatchers(client, {
  model: 'text-davinci-003',
});
const chatMatchers = makeOpenAIChatMatchers(client, { model: 'gpt-4' });

Message formats

Semantic Expect generates an unformatted test result message by default, however this can be customized for your test runner and preferences:

const jestMatchers = makeOpenAIMatchers(client, { format: 'jest' });
const vitestMatchers = makeOpenAIMatchers(client, { format: 'vitest' });

Additional examples

Semantic Expect includes general examples by default, however your particular use case may benefit from additional guidance. Examples include the following properties:

requirement: A description of the desired generated content, such as "A professional greeting"
content: The content being submitted for assessment, such as "What's up?? 🤪"
assessment: A brief assessment of why the content does or doesn't fulfill the requirement, such as "Uses casual language"
pass: true if requirement is fulfilled, false if not

Additional examples are registered when you create your matchers:

const matchers = makeOpenAIMatchers(client, {
  examples: [
    {
      requirement: 'A professional greeting',
      content: "What's up?? 🤪",
      assessment: 'Uses casual language',
      pass: false,
    },
  ],
});

There is no hard limit to the number of custom examples you can provide, however note that you may eventually run up against token limits imposed by your model.

To-do

Support LLM providers other than OpenAI
Message formats for additional test runners, and fully custom format function
Test coverage, particularly a suite directly testing determinations (including their wording) in order to trim down the prompt content as much as possible
Docs

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme