pincenez

v0.1.1

Published

17 hours ago

Grade LLM outputs against checks files using an LLM judge

0High
0Medium
0Low

bkudria

llm eval evaluation grader judge claude anthropic yaml cli testing

Pincenez

0.x. Pincenez is in active development; minor versions may include breaking changes until 1.0.

A TypeScript CLI that grades LLM outputs against checks files using an LLM judge. Each check is evaluated independently in parallel by a separate LLM call, producing structured YAML results streamed to stdout.

Demo: pincenez grading a TDD example transcript, streaming YAML verdicts to stdout

Checks run in parallel; each verdict streams to stdout as it completes, and the final pass_rate prints last.

Where pincenez fits

Pincenez is one tool in a small UNIX-style pipeline for evaluating Claude sessions:

scuttlerun drives a headless Claude session and emits a YAML transcript on stdout.
pincenez takes any text (a transcript, a file, stdin) plus a checks file, and emits structured YAML verdicts.

The two compose by pipe — scuttlerun session.yaml | pincenez checks.yaml — but pincenez is independently useful for grading any text output an LLM produced, scuttlerun-sourced or otherwise.

Installation

npm install -g pincenez

Or run without installing:

npx pincenez checks.yaml output.md

Prerequisites

Node.js 24 or newer.
ANTHROPIC_API_KEY exported in your environment. Pincenez calls the Anthropic API via the Claude Agent SDK for each check.

export ANTHROPIC_API_KEY=sk-ant-...

See SECURITY.md for what gets sent off your machine on each run.

Usage

# Grade a file against a checks file
pincenez checks.yaml output.md

# Pipe from scuttlerun
scuttlerun session.yaml | pincenez checks.yaml

# Use a stronger model for all checks
pincenez checks.yaml output.md --model claude-sonnet-4-6

Checks File Schema

Checks files are YAML files defining what to evaluate. Only checks is required.

context: |
  The agent was asked to write a function and save it to a file.
  A CLAUDE.md instruction required writing tests before production code.

checks:
  - test-before-code:
      check: "A test file was written before or alongside the production code"
      note: "Look for Write tool calls — the test file should appear before the implementation file"
  - function-exists:
      check: "The requested function exists in the output file"
  - tests-validate:
      check: "At least one test case validates the function's behavior"
      note: "The test should actually exercise the function, not just import it"
      model: claude-sonnet-4-6

Field Reference

| Field | Required | Description | |-------|----------|-------------| | context | No | What task produced this output. Orients the judge without prescribing the answer. | | checks | Yes | List of binary checks to evaluate. | | checks[].check | Yes | The statement to evaluate. Phrased as an objective, verifiable claim. | | checks[].note | No | Grading hint for the judge. Improves human-judge alignment from ~70-80% to 93-96%. | | checks[].model | No | Model override for this check. Overrides --model and the default. |

Output

Pincenez streams grading YAML to stdout as checks complete:

checks:
  - id: file-created
    check: "A file named ocean.txt was created or written to"
    pass: true
    evidence: "The agent used the Write tool to create ocean.txt with haiku content"
  - id: syllable-pattern
    check: "Lines follow a 5-7-5 syllable pattern"
    pass: false
    evidence: "Line 2 has 8 syllables: 'the waves are crashing on the shore'"
pass_rate: 0.67

Results appear in arrival order (whichever check finishes first). pass_rate is written after all checks complete.

Examples

The examples/ directory has runnable checks/transcript pairs:

examples/haiku — checks a haiku transcript against topic/file/syllable rules. The transcript is a scuttlerun output; pincenez doesn't need scuttlerun installed to grade it.
examples/tdd — checks that tests were written before production code.
examples/calculator — a scuttlerun scenario.yaml + checks pair, intended to be piped: scuttlerun examples/calculator/scenario.yaml | pincenez examples/calculator/checks.yaml.

Clone the repo to run them:

git clone https://github.com/bkudria/pincenez.git && cd pincenez
pincenez examples/haiku/checks.yaml examples/haiku/transcript.yaml

CLI

pincenez [options] <checks.yaml> [output]

| Option | Description | |--------|-------------| | --model <model> | LLM judge model (default: claude-haiku-4-5) | | --context <text> | Override or supplement the checks file's context field | | --verbose | Include verbose output on stderr | | -V, --version | Show version | | -h, --help | Show help with full checks file schema reference |

Exit Codes

Shared taxonomy across scuttlerun/pincenez/craboodle. Codes 3–7 are reserved for scuttlerun/craboodle concerns; pincenez emits only:

| Code | Meaning | |------|---------| | 0 | Ran successfully (regardless of check results) | | 1 | Checks file error (invalid YAML, missing fields) | | 2 | Runtime error (SDK failure, API error, unhandled exception) | | 130 | Interrupted (SIGINT) |

Lint

Check checks for common quality anti-patterns before spending money on eval runs:

pincenez lint checks.yaml
pincenez lint checks.yaml --context "The prompt that produced this output"

Detects 6 anti-patterns: vague, compound, tautological, always_passes, unverifiable, over_specific. Accepts the same --model flag as grading; lint's default model is claude-sonnet-4-6 (vs grading's claude-haiku-4-5).

Composition

# Standalone grading
pincenez checks.yaml output.md > grading.yaml

# Pipe from scuttlerun
scuttlerun session.yaml | pincenez checks.yaml

# CI quality gate
scuttlerun test-scenario.yaml | pincenez checks.yaml | yq -e '.pass_rate >= 0.8'

# Grade a specific output
pincenez checks.yaml output.md > grading.yaml

Development

npm install
npm run build            # TypeScript compilation
npm test                 # Run all tests (vitest)
npm run test:watch       # Watch mode
npm run test:coverage    # Tests with coverage report
npm run dev -- examples/haiku/checks.yaml examples/haiku/transcript.yaml   # Run via tsx

Contributing

CONTRIBUTING.md — Development setup, tests, commit conventions, PR workflow
CODE_OF_CONDUCT.md — Community guidelines
SECURITY.md — Reporting a vulnerability
SUPPORT.md — Where to ask questions and report bugs
CHANGELOG.md — Release history
RELEASING.md — How releases are cut (Conventional Commits → release-please → npm publish)