pi-evalset-lab
v0.2.0
Published
pi extension for fixed-task-set eval runs and prompt/system comparisons
Maintainers
Keywords
Readme
summary: "Overview and quickstart for pi-evalset-lab." read_when:
- "Starting work in this repository." system4d: container: "Repository scaffold for a pi extension package." compass: "Ship small, safe, testable extension iterations." engine: "Plan -> implement -> verify with docs and hooks in sync." fog: "Unknown runtime integration edge cases until first live sync."
pi-evalset-lab
Extension package for fixed-task-set eval workflows in pi (/evalset run|compare) with reproducible JSON reports.
Primary category fit: Model & Prompt Management, Review & Quality Loops, UX & Observability, Safety & Governance.
Quickstart
Install dependencies (if you add any):
npm installTest with pi:
pi -e ./extensions/evalset.tsInstall package into pi:
pi install /absolute/path/to/pi-evalset-lab
Runtime dependencies and packaged files
This extension depends on pi host APIs and declares them as peerDependencies:
@mariozechner/pi-coding-agent@mariozechner/pi-ai
In normal usage, pi provides these at runtime when loading the package.
The npm package also uses a files whitelist so required runtime artifacts are explicitly included:
extensions/evalset.tsprompts/examples/(sample datasets + sample report UI)
Category taxonomy (reference)
Keyword slugs used for extension categorization:
ux-observability(UX & Observability)safety-governance(Safety & Governance)context-codebase-mapping(Context & Codebase Mapping)web-docs-retrieval(Web & Docs Retrieval)background-processes(Background / Long-running Processes)review-quality-loops(Review & Quality Loops)planning-orchestration(Planning & Orchestration)subagents-parallelization(Subagents / Parallelization)model-prompt-management(Model & Prompt Management)interactive-clis-editors(Interactive CLIs / Editors)skills-rules-packs(Skills & Rules Packs)paste-code-extraction(Paste / Code Extraction)
evalset command (MVP)
This extension adds /evalset for fixed-task-set evaluation runs.
Commands
/evalset help
/evalset init [dataset-path] [--force]
/evalset run <dataset.json> [--system-file <path>] [--system-text <text>] [--variant <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]
/evalset compare <dataset.json> <baseline-system.txt> <candidate-system.txt> [--baseline-name <name>] [--candidate-name <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]Running modes
/evalset is a pi slash command, not a shell executable.
Interactive mode:
pi -e ./extensions/evalset.ts
# then inside pi:
/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txtNon-interactive mode (scripts/CI):
pi -e ./extensions/evalset.ts -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"
# or, if extension already installed/enabled:
pi -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"Interactive sessions use pi UI hooks (ctx.ui) for status/notify updates.
In non-interactive -p mode, those UI calls are safely skipped (ctx.hasUI === false).
Example workflow (inside pi)
/evalset run examples/fixed-task-set.json --variant baseline
/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txtIncluded datasets
examples/fixed-task-set.json— tiny smoke set (3 cases)examples/fixed-task-set-v2.json— larger first pass setexamples/fixed-task-set-v3.json— less brittle checks (recommended)
Sample visual output (in repo)
examples/evalset-compare-sample-embedded.html— self-contained report UI with embedded compare JSONexamples/evalset-compare-sample.png— screenshot preview of that HTML report
Preview:

The command writes JSON reports to:
- explicit
--out <path>when provided - otherwise
.evalset/reports/*.jsonunder your current project directory
Each report includes run identity metadata:
runIddatasetHashcasesHashvariantHash(run) or baseline/candidate variant hashes (compare)
Session messages only keep lightweight report metadata (reportPath, ids, summary metrics), not full report bodies.
Export report JSON to static HTML
Use the helper script to create a shareable standalone HTML file from any evalset JSON report:
npm run evalset:export-html -- --in .evalset/reports/compare-your-dataset-YYYYMMDDTHHMMSS.json
# optional:
npm run evalset:export-html -- --in .evalset/reports/run-your-dataset-YYYYMMDDTHHMMSS.json --out .evalset/reports/run-your-dataset.html --title "Evalset run report"Script: scripts/export-evalset-report-html.mjs
Optional core hooks (future, not required for this MVP)
This extension works today without core changes. If we decide to harden further, optional core support could include:
- Stable agent-level lineage IDs (
runId/traceId) across extension events. - Explicit reproducibility capability metadata in
pi-ai(e.g. seed support and determinism caveats per provider/model). - Shared canonical provider payload hash helper in
pi-ai. - A headless agent-eval API for tool-heavy/full agent-loop benchmark runs.
Repository checks
Run:
npm run checkThis executes scripts/validate-structure.sh.
Release + security baseline
This scaffold defaults to release-please for single-package release PR + tag flow (vX.Y.Z) and npm trusted publishing via OIDC.
Included files:
- CI workflow
- release-please workflow
- publish workflow
- Dependabot config
- CODEOWNERS
- release-please config
- release-please manifest
- Security policy
Before first production release:
- Confirm/adjust owners in .github/CODEOWNERS.
- Enable branch protection on
main. - Configure npm Trusted Publishing for this repo + publish workflow.
- Merge release PR from release-please, then publish from GitHub release.
Issue + PR intake baseline
Included files:
- Bug report form
- Feature request form
- Docs request form
- Issue template config
- PR template
- Code of conduct
- Support guide
- Top-level contributing guide
Vouch trust gate baseline
Included files:
Default behavior:
- PR workflow runs on
pull_request_target(opened,reopened). require-vouch: trueandauto-close: trueare enabled by default.- Maintainers can comment
vouch,denounce, orunvouchon issues to update trust state. - Vouch actions are SHA pinned (
0e11a71bba23218a284d3ecca162e75a110fd7e3) for reproducibility and supply-chain review.
Bootstrap step:
- Confirm/adjust entries in .github/VOUCHED.td before enforcing production policy.
Docs discovery
Run:
npm run docs:list
npm run docs:list:workspace
npm run docs:list:jsonWrapper script: scripts/docs-list.sh
Resolution order:
DOCS_LIST_SCRIPT./scripts/docs-list.mjs(if vendored)~/ai-society/core/agent-scripts/scripts/docs-list.mjs
Copier lifecycle policy
- Keep
.copier-answers.ymlcommitted. - Do not edit
.copier-answers.ymlmanually. - Run from a clean destination repo (commit or stash pending changes first).
- Use
copier update --trustwhen.copier-answers.ymlincludes_commitand update is supported. - In non-interactive shells/CI, append
--defaultsto update/recopy. - Use
copier recopy --trustwhen update is unavailable (for example local non-VCS source) or cannot reconcile cleanly. - After recopy, re-apply local deltas intentionally and run
npm run check.
Hook behavior
- Git uses
.githooks/pre-commit(configured by scripts/install-hooks.sh). - If
prekis available, the hook runsprekusing prek.toml. - If
prekis not available, the hook falls back toscripts/validate-structure.sh.
Install options for prek:
npm add -D @j178/prek
# or
npm install -g @j178/prekStartup interview flow (project-local)
.pi/extensions/startup-intake-router.tswatches the first non-command message in a session.- It converts your startup intent into a prefilled command:
/init-project-docs "<your intent>"
.pi/prompts/init-project-docs.mdthen drives theinterviewtool using docs/org/project-docs-intake.questions.json.
Utility commands:
/startup-intake-router-status/startup-intake-router-reset
Live sync helper
Use scripts/sync-to-live.sh to copy the package extension to
~/.pi/agent/extensions/.
Optional flags:
--with-prompts--with-policy--all(prompts + policy)
After sync, run /reload in pi.
