@evalops/evalops

v0.1.12

Published

a year ago

CLI tool for evaluating LLM outputs against defined assertions

Downloads

0High
0Medium
0Low

evalops

llm evaluation testing cli openai anthropic

EvalOps CLI

A command-line tool for evaluating LLM outputs against defined assertions.

Installation

npm i @evalops/evalops

Or install locally:

git clone https://github.com/evalops/cli.git
cd evalops
npm install
npm link

Authentication

To use EvalOps CLI, you'll need an API key from the EvalOps dashboard:

Quick Setup: Set your API key as an environment variable:
```
export EVALOPS_API_KEY="sk_your_api_key_here"
```
Get an API Key: Visit app.evalops.dev → Integrations → API Keys tab

📖 For detailed setup instructions, see API_KEY_SETUP.md

Usage

# Initialize a new configuration file
evalops init

# Run tests defined in evalops.config.yaml
evalops test

# Run tests with a specific API key
evalops test --api-key "sk_your_api_key_here"

# Get help
evalops --help

How It Works

The diagram below illustrates the flow of EvalOps, showing how it interacts with LLM providers like OpenAI and Anthropic:

EvalOps Flow Diagram

Configuration

EvalOps uses a YAML configuration file (evalops.config.yaml) to define tests. The tool supports various types of assertions including similarity, toxicity, language, content inclusion/exclusion, regression, drift, and comprehensive security testing.

Basic Example

tests:
  - id: test-similarity
    provider: openai
    model: gpt-3.5-turbo
    prompt: "What is the capital of France?"
    assertions:
      - type: similarity
        expected: "The capital of France is Paris."
        threshold: 0.8
  
  - id: test-toxicity
    provider: anthropic
    model: claude-3-sonnet-20240229
    prompt: "Tell me about the solar system."
    assertions:
      - type: toxicity
        max_score: 0.1

Security Testing

EvalOps provides comprehensive security testing capabilities through various detector categories:

1. Prompt Injection & System Override

- type: security
  tests:
    - ascii-smuggling
    - beavertails
    - cyberseceval
    - harmbench
    - indirect-prompt-injection
    - system-prompt-override
    - cca
    - prompt-extraction
    - tool-discovery
  threshold: 0.8

2. Access Control & Data Exfiltration

- type: security
  tests:
    - bfla
    - bola
    - rbac
    - rag-document-exfiltration
    - cross-session-leak
    - memory-poisoning
  threshold: 0.8

3. Code & Shell Vulnerabilities

- type: security
  tests:
    - harmful:cybercrime:malicious-code
    - shell-injection
    - sql-injection
    - ssrf
    - hijacking
  threshold: 0.8

4. Privacy & PII

- type: security
  tests:
    - pii:direct
    - pii:api-db
    - pii:session
    - pii:social
    - harmful:privacy
  threshold: 0.8

5. Content Safety

- type: security
  tests:
    - harmful:graphic-content
    - harmful:harassment-bullying
    - harmful:hate
    - harmful:insults
    - harmful:profanity
    - harmful:radicalization
    - harmful:self-harm
    - harmful:sexual-content
    - bias:gender
    - unsafebench
  threshold: 0.8

6. Misinformation & Misuse

- type: security
  tests:
    - hallucination
    - harmful:misinformation-disinformation
    - excessive-agency
    - overreliance
    - competitors
    - contracts
    - imitation
    - politics
    - religion
    - harmful:specialized-advice
    - harmful:unsafe-practices
  threshold: 0.8

7. Resource Exhaustion

- type: security
  tests:
    - reasoning-dos
    - divergent-repetition
  threshold: 0.8

Complete Example with Multiple Assertions

tests:
  - id: comprehensive-test
    provider: openai
    model: gpt-4
    prompt: "Write a product description."
    assertions:
      # Security tests
      - type: security
        tests:
          - ascii-smuggling
          - pii:direct
          - harmful:profanity
        threshold: 0.8
      
      # Content inclusion test
      - type: content_inclusion
        required_content: ["product features", "price", "benefits"]
        threshold: 0.7
      
      # Language test
      - type: language
        expected: "en"
        threshold: 0.9

Test Results

When running tests, EvalOps will:

Execute each test case against the specified model
Check responses against all specified assertions
Generate a detailed report showing:
- Pass/fail status for each test
- Detailed information about any detected issues
- Confidence scores for security detections
- Severity levels of detected issues

License

GNU Affero General Public License v3.0 (AGPLv3)