llm-diff

v0.1.1

Published

12 days ago

Compare LLM responses across prompt versions — see token, cost, and latency deltas in a familiar diff format.

0High
0Medium
0Low

aakashbhardwaj27

llm diff prompt-engineering openai anthropic gemini groq cli developer-tools ai token-count cost-comparison

llm-diff

git diff for prompts — compare LLM responses across prompt versions.

See token count changes, cost deltas, latency shifts, and a word-level diff of the actual responses — all in one command.

$ llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o

llm-diff  openai/gpt-4o

  tokens          312 →        289   -23  (-7.4%)
    input          45 →         38   -7
    output        267 →        251   -16
  cost       $0.0041 →    $0.0038   -$0.0003  (-7.3%)
  latency      1247ms →      943ms   -304ms  (-24.4%)

--- prompt A
+++ prompt B
  The capital of France is Paris.
- It is located in northern France and has a population of approximately 2.1 million people...
+ Paris, with ~2.1M residents, serves as the political and cultural center of the country...

Install

npx llm-diff --a v1.txt --b v2.txt --model gpt-4o

Or install globally:

npm install -g llm-diff

Quick start

1. Set your API key:

export OPENAI_API_KEY=sk-...
# or ANTHROPIC_API_KEY, GEMINI_API_KEY, GROQ_API_KEY

2. Compare two prompts:

# From files
llm-diff --a prompt-v1.txt --b prompt-v2.txt --model gpt-4o

# Inline text
llm-diff -a "Explain gravity" -b "Explain gravity to a child" -m gpt-4o-mini

# With a system prompt
llm-diff -a v1.txt -b v2.txt -m claude-sonnet-4-20250514 -s "You are a science teacher"

Usage

llm-diff --a <prompt-a> --b <prompt-b> --model <model> [options]

Required

| Flag | Description | |------|-------------| | --a, -a | Prompt A — file path or inline text | | --b, -b | Prompt B — file path or inline text | | --model, -m | Model name (see --models for full list) |

Options

| Flag | Default | Description | |------|---------|-------------| | --system, -s | — | System prompt (file path or inline text) | | --base-url | — | Gateway URL override | | --max-tokens | 2048 | Max output tokens | | --temperature | 0 | Temperature | | --timeout | 60000 | Request timeout (ms) | | --runs | 1 | Number of runs to average | | --no-parallel | — | Run A and B sequentially | | --full | — | Show full inline diff with highlighting | | --json | — | JSON output for scripting | | --models | — | List supported models and pricing |

Supported models

llm-diff --models

OpenAI

gpt-4o · gpt-4o-mini · gpt-4-turbo · gpt-4 · gpt-3.5-turbo · o1 · o1-mini · o3-mini

Anthropic

claude-sonnet-4-20250514 · claude-3.5-haiku · claude-3-opus

Google Gemini

gemini-2.0-flash · gemini-2.0-pro · gemini-1.5-pro · gemini-1.5-flash

Groq

llama-3.3-70b · llama-3.1-8b · mixtral-8x7b · gemma2-9b

Gateway support

Route requests through a custom gateway (like llmhut) instead of direct API calls:

llm-diff --a v1.txt --b v2.txt -m gpt-4o --base-url https://gw.llmhut.com/v1

The gateway handles authentication, so you don't need provider-specific API keys.

Averaging multiple runs

LLM responses vary. Average over multiple runs for stable comparisons:

llm-diff --a v1.txt --b v2.txt -m gpt-4o --runs 5

Token counts and latency are averaged. The last response text is used for the diff.

JSON output

Pipe results into scripts, dashboards, or eval pipelines:

llm-diff --a v1.txt --b v2.txt -m gpt-4o --json | jq '.delta'

{
  "totalTokens": -23,
  "totalTokensPct": -7.4,
  "cost": -0.000293,
  "costPct": -7.1,
  "latencyMs": -304,
  "latencyPct": -24.4
}

Programmatic API

import { runDiff } from 'llm-diff';

const result = await runDiff({
  promptA: 'Explain gravity',
  promptB: 'Explain gravity to a 5-year-old',
  model: 'gpt-4o-mini',
});

console.log(result.delta);
// { totalTokens: -23, cost: -0.0003, latencyMs: -304, ... }

How it works

Resolves the model → provider, pricing, API adapter
Reads prompt A and B (from files or inline text)
Fires both requests in parallel (or sequentially with --no-parallel)
Collects token counts, cost, and latency from the API response
Computes deltas between A and B
Generates a word-level diff of the response text
Renders everything to the terminal (or as JSON)

Roadmap

[ ] Eval pipeline integration (named experiments, history)
[ ] Side-by-side diff view
[ ] Cross-model comparison (--model-a gpt-4o --model-b claude-sonnet-4-20250514)
[ ] HTML report output
[ ] Config file support (.llm-diff.json)
[ ] Streaming output with live token counting
[ ] Mistral, Cohere, Together AI providers

Contributing

See CONTRIBUTING.md.

License

Apache License — see LICENSE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

llm-diff

Install

Quick start

Usage

Required

Options

Supported models

OpenAI

Anthropic

Google Gemini

Groq

Gateway support

Averaging multiple runs

JSON output

Programmatic API

How it works

Roadmap

Contributing

License