prompt-lock
v0.5.0
Published
Version control and behavioral regression testing for LLM prompts
Maintainers
Readme
prompt-lock
Version control and behavioral regression testing for LLM prompts.
prompt-lock wraps your prompts with behavioral assertions and snapshot baselines. On every change, it runs the assertion suite and flags regressions — like Jest for LLM behavior. The lightweight, code-first alternative to promptfoo. TypeScript-native. Works with any LLM endpoint. Zero cloud dependencies.
Demo
Run the demo without any API keys to see prompt-lock in action:
git clone https://github.com/shmulikdav/Promptlock.git
cd Promptlock
npm install && npm run build
node demo.jsThis runs 4 simulated prompts (2 passing, 2 failing), saves snapshots, shows diffs, and generates an HTML report — all with mock LLM outputs.
Installation
npm install prompt-lockYou'll also need at least one LLM provider SDK (or use a custom endpoint):
# For Anthropic
npm install @anthropic-ai/sdk
# For OpenAI
npm install openai
# For Ollama, LM Studio, Azure, etc. — no extra install needed!
# Use the custom provider: { type: 'custom', url: 'http://localhost:11434/api/generate' }Quick Start
# 1. Initialize prompt-lock in your project
npx prompt-lock init
# 2. Edit the example prompt in prompts/example.js
# 3. Set your API key
export ANTHROPIC_API_KEY=your-key-here
# 4. Run assertions
npx prompt-lock run
# 5. Save a snapshot baseline
npx prompt-lock snapshotDefining Prompts
Create config files as .js, .yaml, .yml, or .json. prompt-lock auto-discovers promptlock.yaml in your project root, or scans the prompts/ directory.
YAML Config (recommended)
# promptlock.yaml
id: article-summarizer
version: '1.0.0'
provider: anthropic
model: claude-sonnet-4-20250514
prompt: |
Summarize the following article in 3 bullet points.
Article: {{article}}
defaultVars:
article: The quick brown fox jumped over the lazy dog.
assertions:
- type: contains
value: '•'
- type: max-length
chars: 500
- type: max-cost
dollars: 0.05JavaScript Config
Create .js files in your prompts/ directory:
// prompts/summarizer.js
module.exports = {
id: 'article-summarizer',
version: '1.0.0',
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
prompt: `You are a professional summarizer.
Summarize the following article in 3 bullet points.
Article: {{article}}`,
defaultVars: {
article: 'The quick brown fox jumped over the lazy dog.'
},
assertions: [
{ type: 'contains', value: '•' },
{ type: 'max-length', chars: 500 },
{ type: 'not-contains', value: 'I cannot' },
{ type: 'max-latency', ms: 10000 },
]
};Using Custom Providers (Ollama, LM Studio, Azure, etc.)
module.exports = {
id: 'local-test',
provider: {
type: 'custom',
url: 'http://localhost:11434/api/generate',
// Optional: custom headers for auth
headers: { 'Authorization': 'Bearer ...' },
// Optional: custom response path (auto-detects OpenAI, Anthropic, Ollama formats)
responsePath: 'response',
},
model: 'llama3',
prompt: 'Hello {{name}}',
defaultVars: { name: 'world' },
assertions: [
{ type: 'min-length', chars: 5 },
],
};Testing with Datasets
Test a prompt against multiple inputs — inline or from external CSV/JSON files:
module.exports = {
id: 'classifier',
provider: 'openai',
model: 'gpt-4o-mini',
prompt: 'Classify this ticket: {{ticket}}',
defaultVars: { ticket: 'My payment failed' },
// Inline dataset
dataset: [
{ ticket: 'My payment failed' },
{ ticket: 'How do I reset my password?' },
{ ticket: 'Your product is great!' },
],
// Or load from a file:
// dataset: './data/test-tickets.csv',
// dataset: './data/test-tickets.json',
assertions: [
{ type: 'json-valid' },
{ type: 'max-latency', ms: 5000 },
],
};CSV files use the first row as headers (= template variable names):
ticket,expected_category
My payment failed,billing
How do I reset my password?,account
Your product is great!,feedbackLLM-as-Judge
Use a separate LLM to evaluate output quality:
module.exports = {
id: 'creative-writer',
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
prompt: 'Write a haiku about {{topic}}',
defaultVars: { topic: 'coding' },
assertions: [
{ type: 'max-length', chars: 200 },
{
type: 'llm-judge',
judge: { provider: 'openai', model: 'gpt-4o-mini' },
criteria: 'Is this a valid haiku with 5-7-5 syllable structure?',
threshold: 0.7, // pass if score >= 0.7
},
],
};Programmatic Usage
import { PromptLock } from 'prompt-lock';
const lock = new PromptLock({
id: 'my-prompt',
version: '1.0.0',
provider: 'anthropic',
model: 'claude-sonnet-4-20250514',
prompt: 'Translate to French: {{text}}',
defaultVars: { text: 'Hello world' },
assertions: [
{ type: 'not-contains', value: 'Hello' },
{ type: 'min-length', chars: 5 },
],
});
const results = await lock.run();
console.log(results[0].passed); // true or falseCLI Reference
prompt-lock init
Scaffold a .promptlock/ folder with config and an example prompt.
prompt-lock run
Run all assertions against current prompts.
prompt-lock run # Run all prompts
prompt-lock run --id my-prompt # Run a specific prompt
prompt-lock run --ci # Exit code 1 on failure
prompt-lock run --report html # Generate HTML report
prompt-lock run --report json # Generate JSON report
prompt-lock run --report markdown # Generate Markdown report
prompt-lock run --report both # Generate JSON + HTML reports
prompt-lock run --dry-run # Show what would be tested without calling LLMs
prompt-lock run --verbose # Show detailed output per prompt
prompt-lock run --parallel # Run prompts in parallel
prompt-lock run --concurrency 10 # Max concurrent runs (default: 5)
prompt-lock run --cache # Cache LLM outputs (skip unchanged prompts)
prompt-lock run --watch # Watch for file changes and re-run
prompt-lock run --ab v1:v2 # A/B compare two prompt IDs side-by-side
prompt-lock run --no-open # Don't auto-open the HTML report in browser
prompt-lock run --github-pr owner/repo#123 # Post results as PR commentprompt-lock snapshot
Capture and save the current output as a baseline. Snapshots are versioned — previous snapshots are kept as history.
prompt-lock snapshot # Snapshot all prompts
prompt-lock snapshot --id my-promptprompt-lock diff
Compare current LLM output against the last saved snapshot.
prompt-lock diff # Diff all prompts
prompt-lock diff --id my-promptprompt-lock history
View snapshot history for a prompt.
prompt-lock history my-promptprompt-lock cache
Manage the output cache (used with --cache flag).
prompt-lock cache stats # Show cache size and entries
prompt-lock cache clear # Clear all cached outputsOutput Caching
Use --cache to skip LLM calls for prompts that haven't changed. Cached outputs are stored in .promptlock/cache/ and keyed by prompt text + model name.
# First run: calls the LLM, saves to cache
prompt-lock run --cache
# Second run: uses cache, instant results
prompt-lock run --cache
# Clear cache when you want fresh results
prompt-lock cache clearA/B Testing Mode
Compare two prompt variants side-by-side and pick a winner:
prompt-lock run --ab summarizer-v1:summarizer-v2You get a console table + a dark-themed HTML comparison report that opens automatically in your browser:
A/B Comparison: summarizer-v1 vs summarizer-v2
| Metric | Variant A | Variant B | Delta |
| ------- | ----------------- | ----------------- | --------- |
| Status | ✅ 5/5 passed | ✅ 5/5 passed | — |
| Latency | 1250ms | 980ms | -270ms |
| Cost | $0.004500 | $0.003200 | -$0.00130 |
| Tokens | 1234 | 987 | -247 |
Winner: Variant BThe HTML report includes a winner banner, side-by-side variant cards, assertion chips, delta bars for each metric, and collapsible output previews. Ship it as an artifact in CI or open it locally during development. Use --no-open to skip the browser launch.
Winner logic:
- Higher pass rate wins
- Otherwise: >5% cheaper wins
- Otherwise: >10% faster wins
- Otherwise: tie
For signal over noise, use a dataset of size 5+ when A/B testing — a single LLM call isn't statistically meaningful.
Programmatic API:
import { runAB } from 'prompt-lock';
const result = await runAB(variantA, variantB);
console.log(result.winner); // 'A' | 'B' | 'tie'
console.log(result.deltas.costDollars);YAML Autocomplete (JSON Schema)
Get IDE autocomplete, validation, and inline docs for YAML configs. Add this comment at the top of your promptlock.yaml:
# yaml-language-server: $schema=https://raw.githubusercontent.com/shmulikdav/Promptlock/main/schemas/promptlock.schema.json
id: my-prompt
provider: anthropic
model: claude-sonnet-4-20250514
# VS Code now suggests all valid fields and assertion typesInstall the YAML extension in VS Code to enable schema support.
Cost & Token Tracking
prompt-lock automatically tracks token usage and estimates cost for OpenAI and Anthropic calls. Costs are shown in console output and included in all report formats.
prompt-lock run --verbose # Shows per-prompt token counts and costUse the max-cost assertion to enforce budget limits:
assertions:
- type: max-cost
dollars: 0.05Built-in pricing for GPT-4o, GPT-4o-mini, Claude Sonnet 4, Claude Haiku, and more. Unknown models show zero cost.
Watch Mode
Auto-rerun on file changes during prompt development:
prompt-lock run --watchWatches your config files and prompts/ directory. Debounces rapid changes (500ms).
Retry Logic
All LLM provider calls automatically retry on transient errors (rate limits, timeouts, network errors) with exponential backoff. Default: 3 retries. Use --verbose to see retry activity.
GitHub PR Comments
Post test results directly to a GitHub pull request:
export GITHUB_TOKEN=your-token
prompt-lock run --ci --github-pr owner/repo#123This posts a markdown table with pass/fail results and failure details. If a comment already exists, it updates in place.
Assertion Reference
| Assertion | Config | What it checks |
|-----------|--------|---------------|
| contains | value: string | Output contains the string |
| not-contains | value: string | Output does NOT contain the string |
| contains-all | values: string[] | Output contains ALL listed strings |
| starts-with | value: string | Output starts with the string |
| ends-with | value: string | Output ends with the string |
| matches-regex | pattern: string | Output matches regex pattern |
| max-length | chars: number | Output is under N characters |
| min-length | chars: number | Output is over N characters |
| json-valid | — | Output is valid JSON |
| json-schema | schema: object | Output JSON matches a JSON Schema |
| no-hallucination-words | words?: string[] | Output does NOT contain hallucination indicators |
| no-duplicates | separator?: string | Output has no duplicate items (split by separator, default \n) |
| max-latency | ms: number | LLM response time is under N milliseconds |
| max-cost | dollars: number | LLM call cost is under N dollars |
| llm-judge | judge: {provider, model}, criteria: string, threshold?: number | Another LLM scores output quality (0-1) |
| custom | name: string, fn: (output) => boolean | User-provided function returning boolean (JS configs only) |
Provider Setup
Anthropic
export ANTHROPIC_API_KEY=your-key-hereOpenAI
export OPENAI_API_KEY=your-key-hereCustom Provider (Ollama, LM Studio, Azure, any HTTP endpoint)
No API key needed for local models. Just set the URL:
provider: {
type: 'custom',
url: 'http://localhost:11434/api/generate', // Ollama
// url: 'http://localhost:1234/v1/chat/completions', // LM Studio
// url: 'https://your-resource.openai.azure.com/openai/deployments/gpt-4/chat/completions?api-version=2024-02-01', // Azure
headers: { 'api-key': process.env.AZURE_API_KEY }, // Optional auth
responsePath: 'choices[0].message.content', // Optional: path to extract response
}Auto-detects response format for OpenAI, Anthropic, and Ollama APIs. Use responsePath for custom APIs.
CI/CD Integration
GitHub Actions
name: Prompt Regression Tests
on: [push, pull_request]
jobs:
prompt-lock:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
with:
node-version: '20'
- run: npm install
- run: npx prompt-lock run --ci --cache --github-pr ${{ github.repository }}#${{ github.event.pull_request.number }}
env:
ANTHROPIC_API_KEY: ${{ secrets.ANTHROPIC_API_KEY }}
GITHUB_TOKEN: ${{ secrets.GITHUB_TOKEN }}--ciensures exit code 1 when any assertion fails--cacheskips LLM calls for unchanged prompts (faster, cheaper)--github-prposts results as a PR comment with pass/fail table- Automatic retry on rate limits and transient errors (3 retries with exponential backoff)
Configuration
.promptlock/config.json (created by init):
{
"promptsDir": "./prompts",
"snapshotDir": "./.promptlock/snapshots",
"reportDir": "./.promptlock/reports",
"defaultProvider": "anthropic",
"ci": {
"failOnRegression": true,
"reportFormat": ["json", "html"]
}
}Template Variables
Use {{variableName}} in your prompts. Supports alphanumeric, dashes, dots, and underscores:
{{article}} ✅
{{user-name}} ✅
{{api.version}} ✅
{{my_var}} ✅License
MIT
