@arizeai/phoenix-evals

v1.0.3

Published

12 days ago

A library for running evaluations for AI use cases

0High
0Medium
0Low

arize evals evaluation llm phoenix

This package provides a TypeScript evaluation library. It is vendor agnostic and can be used in isolation of any framework or platform. This package is still under active development and is subject to change.

Installation

# or yarn, pnpm, bun, etc...
npm install @arizeai/phoenix-evals

Usage

Creating a Classifier

The library provides a createClassifier function that allows you to create custom evaluators for different tasks like hallucination detection, relevance scoring, or any binary/multi-class classification.

import { createClassifier } from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

const promptTemplate = `
In this task, you will be presented with a query, a reference text and an answer. The answer is
generated to the question based on the reference text. The answer may contain false information. You
must use the reference text to determine if the answer to the question contains false information,
if the answer is a hallucination of facts. Your objective is to determine whether the answer text
contains factual information and is not a hallucination. A 'hallucination' refers to
an answer that is not based on the reference text or assumes information that is not available in
the reference text. Your response should be a single word: either "factual" or "hallucinated", and
it should not include any other text or characters.

    [BEGIN DATA]
    ************
    [Query]: {{input}}
    ************
    [Reference text]: {{reference}}
    ************
    [Answer]: {{output}}
    ************
    [END DATA]

Is the answer above factual or hallucinated based on the query and reference text?
`;

// Create the classifier
const evaluator = await createClassifier({
  model,
  choices: { factual: 1, hallucinated: 0 },
  promptTemplate: promptTemplate,
});

// Use the classifier
const result = await evaluator({
  output: "Arize is not open source.",
  input: "Is Arize Phoenix Open Source?",
  reference:
    "Arize Phoenix is a platform for building and deploying AI applications. It is open source.",
});

console.log(result);
// Output: { label: "hallucinated", score: 0 }

See the complete example in examples/classifier_example.ts.

Pre-Built Evaluators

The library includes several pre-built evaluators for common evaluation tasks. These evaluators come with optimized prompts and can be used directly with any AI SDK model.

All pre-built evaluators are available from the @arizeai/phoenix-evals/llm module:

| Evaluator | Function | Description | | ---------------------- | ------------------------------------- | --------------------------------------------------------------------------------- | | Faithfulness | createFaithfulnessEvaluator | Detects hallucinations — checks if the output is grounded in the provided context | | Conciseness | createConcisenessEvaluator | Evaluates whether the response is appropriately concise | | Correctness | createCorrectnessEvaluator | Checks if the output is factually correct given the input | | Document Relevance | createDocumentRelevanceEvaluator | Measures how relevant a retrieved document is to the query | | Refusal | createRefusalEvaluator | Detects whether the model refused to answer | | Tool Invocation | createToolInvocationEvaluator | Evaluates whether the correct tool was invoked with the right arguments | | Tool Selection | createToolSelectionEvaluator | Checks whether the right tool was selected for the task | | Tool Response Handling | createToolResponseHandlingEvaluator | Evaluates how well the model uses a tool's response |

import {
  createFaithfulnessEvaluator,
  createConcisenessEvaluator,
  createCorrectnessEvaluator,
  createDocumentRelevanceEvaluator,
  createRefusalEvaluator,
} from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

// Faithfulness: checks if the output is grounded in the context
const faithfulnessEvaluator = createFaithfulnessEvaluator({ model });
const faithfulnessResult = await faithfulnessEvaluator.evaluate({
  input: "What is the capital of France?",
  context: "France is a country in Europe. Paris is its capital city.",
  output: "The capital of France is London.",
});
console.log(faithfulnessResult);
// Output: { label: "unfaithful", score: 0, explanation: "..." }

// Correctness: checks if the output is factually correct
const correctnessEvaluator = createCorrectnessEvaluator({ model });
const correctnessResult = await correctnessEvaluator.evaluate({
  input: "What is the capital of France?",
  output: "Paris is the capital of France.",
});
console.log(correctnessResult);
// Output: { label: "correct", score: 1, explanation: "..." }

// Document Relevance: checks if a retrieved document is relevant to the query
const relevanceEvaluator = createDocumentRelevanceEvaluator({ model });
const relevanceResult = await relevanceEvaluator.evaluate({
  input: "What is the capital of France?",
  documentText: "Paris is the capital of France and a major European city.",
});
console.log(relevanceResult);
// Output: { label: "relevant", score: 1, explanation: "..." }

Data Mapping

When your data structure doesn't match what an evaluator expects, use bindEvaluator to map your fields to the evaluator's expected input format:

import { bindEvaluator } from "@arizeai/phoenix-evals";
import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";

const model = openai("gpt-4o-mini");

type ExampleType = {
  question: string;
  context: string;
  answer: string;
};

const evaluator = bindEvaluator<ExampleType>(
  createFaithfulnessEvaluator({ model }),
  {
    inputMapping: {
      input: "question", // Map "input" from "question"
      context: "context", // Map "context" from "context"
      output: "answer", // Map "output" from "answer"
    },
  }
);

const result = await evaluator.evaluate({
  question: "Is Arize Phoenix Open Source?",
  context:
    "Arize Phoenix is a platform for building and deploying AI applications. It is open source.",
  answer: "Arize is not open source.",
});

Mapping supports simple properties ("fieldName"), dot notation ("user.profile.name"), array access ("items[0].id"), JSONPath expressions ("$.items[*].id"), and function extractors ((data) => data.customField).

See the complete example in examples/bind_evaluator_example.ts.

Experimentation with Phoenix

This package works seamlessly with @arizeai/phoenix-client to enable experimentation workflows. You can create datasets, run experiments, and trace evaluation calls for analysis and debugging.

Running Experiments

To run experiments with your evaluations, install the phoenix-client

npm install @arizeai/phoenix-client

import { createFaithfulnessEvaluator } from "@arizeai/phoenix-evals/llm";
import { openai } from "@ai-sdk/openai";
import { createDataset } from "@arizeai/phoenix-client/datasets";
import {
  asExperimentEvaluator,
  runExperiment,
} from "@arizeai/phoenix-client/experiments";

// Create your evaluator
const faithfulnessEvaluator = createFaithfulnessEvaluator({
  model: openai("gpt-4o-mini"),
});

// Create a dataset for your experiment
const dataset = await createDataset({
  name: "faithfulness-eval",
  description: "Evaluate the faithfulness of the model",
  examples: [
    {
      input: {
        question: "Is Phoenix Open-Source?",
        context: "Phoenix is Open-Source.",
      },
    },
    // ... more examples
  ],
});

// Define your experimental task
const task = async (example) => {
  // Your AI system's response to the question
  return "Phoenix is not Open-Source";
};

// Create a custom evaluator to validate results
const faithfulnessCheck = asExperimentEvaluator({
  name: "faithfulness",
  kind: "LLM",
  evaluate: async ({ input, output }) => {
    // Use the faithfulness evaluator from phoenix-evals
    const result = await faithfulnessEvaluator({
      input: input.question,
      context: input.context,
      output: output,
    });

    return result; // Return the evaluation result
  },
});

// Run the experiment with automatic tracing
runExperiment({
  experimentName: "faithfulness-eval",
  experimentDescription: "Evaluate the faithfulness of the model",
  dataset: dataset,
  task,
  evaluators: [faithfulnessCheck],
});

Examples

To run examples, install dependencies using pnpm and run:

pnpm install
pnpx tsx examples/classifier_example.ts
# change the file name to run other examples

Community

Join our community to connect with thousands of AI builders:

🌍 Join our Slack community.
📚 Read the Phoenix documentation.
💡 Ask questions and provide feedback in the #phoenix-support channel.
🌟 Leave a star on our GitHub.
🐞 Report bugs with GitHub Issues.
𝕏 Follow us on 𝕏.
💼 Follow us on LinkedIn.
🗺️ Check out our roadmap to see where we're heading next.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme