@tryinget/pi-evalset-lab

v0.2.0

Published

a month ago

pi extension for fixed-task-set eval runs and prompt/system comparisons

0High
0Medium
0Low

pi-package pi-extension evalset llm-eval prompt-evaluation fixed-task-set ux-observability safety-governance review-quality-loops model-prompt-management monorepo

summary: "Overview and quickstart for @tryinget/pi-evalset-lab." read_when:

"Starting work in this package workspace."
"Using /evalset run or /evalset compare." system4d: container: "Monorepo package for a pi fixed-task-set evaluation extension." compass: "Keep prompt/system comparisons small, reproducible, and easy to inspect." engine: "Define dataset -> run or compare variants -> export JSON/HTML report -> review deltas." fog: "Model/provider nondeterminism can make brittle checks noisy."

@tryinget/pi-evalset-lab

Monorepo package for fixed-task-set eval workflows in Pi (/evalset run|compare) with reproducible JSON reports and static HTML export.

Workspace path: packages/pi-evalset-lab
Release component key: pi-evalset-lab
Former legacy standalone source: ~/programming/pi-extensions/pi-evalset-lab
Canonical package status: canonicalized here; the legacy repo was archived to ~/programming/pi-extensions/pi-evalset-lab-final-archive.tar.gz and removed after validation.
Session-history migration: no legacy Pi session-history directory existed for the old path, so relocation was recorded as skip-no-history.

Primary category fit: Model & Prompt Management, Review & Quality Loops, UX & Observability, Safety & Governance.

Runtime dependencies and packaged files

This package expects Pi host runtime APIs and declares them as peerDependencies:

@mariozechner/pi-coding-agent
@mariozechner/pi-ai

The npm package uses a files whitelist so required runtime artifacts are explicitly included:

extensions/evalset.ts
prompts/
examples/ (sample datasets + sample report UI)
scripts/export-evalset-report-html.mjs

Quickstart

Install package dependencies for local validation:

cd packages/pi-evalset-lab
npm install
npm run check

Install into Pi from the package directory containing package.json:

pi install /absolute/path/to/pi-extensions/packages/pi-evalset-lab
# then in Pi: /reload

For ad hoc source testing from this package directory:

pi -e ./extensions/evalset.ts

evalset command

/evalset help
/evalset init [dataset-path] [--force]
/evalset run <dataset.json> [--system-file <path>] [--system-text <text>] [--variant <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]
/evalset compare <dataset.json> <baseline-system.txt> <candidate-system.txt> [--baseline-name <name>] [--candidate-name <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]

/evalset is a Pi slash command, not a shell executable.

Interactive mode:

pi -e ./extensions/evalset.ts
# then inside Pi:
/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt

Non-interactive mode:

pi -e ./extensions/evalset.ts -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"
# or, if installed/enabled:
pi -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"

Interactive sessions use Pi UI hooks (ctx.ui) for status/notify updates. Non-interactive -p mode skips those UI calls when ctx.hasUI === false.

Included datasets and sample output

examples/fixed-task-set.json — tiny smoke set (3 cases)
examples/fixed-task-set-v2.json — larger first pass set
examples/fixed-task-set-v3.json — less brittle checks (recommended)
examples/evalset-compare-sample-embedded.html — self-contained report UI with embedded compare JSON
examples/evalset-compare-sample.png — screenshot preview of that HTML report
examples/system-baseline.txt and examples/system-candidate.txt — compare inputs

Preview:

Evalset compare sample screenshot

Reports are written to explicit --out <path> when provided, otherwise .evalset/reports/*.json under the current project directory.

Each report includes run identity metadata (runId, datasetHash, casesHash, and variant hashes). Session messages keep lightweight report metadata only, not full report bodies.

Export report JSON to static HTML

npm run evalset:export-html -- --in .evalset/reports/compare-your-dataset-YYYYMMDDTHHMMSS.json
# optional:
npm run evalset:export-html -- --in .evalset/reports/run-your-dataset-YYYYMMDDTHHMMSS.json --out .evalset/reports/run-your-dataset.html --title "Evalset run report"

Script: scripts/export-evalset-report-html.mjs

Validation and release checks

Package-local validation:

npm run check
npm run release:check:quick

Monorepo-scoped validation:

cd ../..
bash ./scripts/package-quality-gate.sh ci packages/pi-evalset-lab
node ./scripts/release-components.mjs validate

Release metadata is root-managed through x-pi-template.releaseConfigMode=component and component key pi-evalset-lab.

The scoped package @tryinget/pi-evalset-lab is the canonical npm identity for future releases. The old unscoped [email protected] package remains historical registry state, not the canonical development target.

Optional core hooks (future, not required)

This extension works today without Pi core changes. Optional hardening could include stable agent-level lineage IDs, explicit reproducibility metadata in pi-ai, shared provider payload hashing, or a headless agent-eval API for tool-heavy/full agent-loop benchmark runs.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@tryinget/pi-evalset-lab

Runtime dependencies and packaged files

Quickstart

evalset command

Included datasets and sample output

Export report JSON to static HTML

Validation and release checks

Optional core hooks (future, not required)