pi-evalset-lab

v0.2.0

Published

3 months ago

pi extension for fixed-task-set eval runs and prompt/system comparisons

summary: "Overview and quickstart for pi-evalset-lab." read_when:

"Starting work in this repository." system4d: container: "Repository scaffold for a pi extension package." compass: "Ship small, safe, testable extension iterations." engine: "Plan -> implement -> verify with docs and hooks in sync." fog: "Unknown runtime integration edge cases until first live sync."

pi-evalset-lab

Extension package for fixed-task-set eval workflows in pi (/evalset run|compare) with reproducible JSON reports.

Primary category fit: Model & Prompt Management, Review & Quality Loops, UX & Observability, Safety & Governance.

Quickstart

Install dependencies (if you add any):
```
npm install
```
Test with pi:
```
pi -e ./extensions/evalset.ts
```

Install package into pi:

pi install /absolute/path/to/pi-evalset-lab

Runtime dependencies and packaged files

This extension depends on pi host APIs and declares them as peerDependencies:

@mariozechner/pi-coding-agent
@mariozechner/pi-ai

In normal usage, pi provides these at runtime when loading the package.

The npm package also uses a files whitelist so required runtime artifacts are explicitly included:

extensions/evalset.ts
prompts/
examples/ (sample datasets + sample report UI)

Category taxonomy (reference)

Keyword slugs used for extension categorization:

ux-observability (UX & Observability)
safety-governance (Safety & Governance)
context-codebase-mapping (Context & Codebase Mapping)
web-docs-retrieval (Web & Docs Retrieval)
background-processes (Background / Long-running Processes)
review-quality-loops (Review & Quality Loops)
planning-orchestration (Planning & Orchestration)
subagents-parallelization (Subagents / Parallelization)
model-prompt-management (Model & Prompt Management)
interactive-clis-editors (Interactive CLIs / Editors)
skills-rules-packs (Skills & Rules Packs)
paste-code-extraction (Paste / Code Extraction)

evalset command (MVP)

This extension adds /evalset for fixed-task-set evaluation runs.

Commands

/evalset help
/evalset init [dataset-path] [--force]
/evalset run <dataset.json> [--system-file <path>] [--system-text <text>] [--variant <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]
/evalset compare <dataset.json> <baseline-system.txt> <candidate-system.txt> [--baseline-name <name>] [--candidate-name <name>] [--max-cases <n>] [--temperature <n>] [--out <report.json>]

Running modes

/evalset is a pi slash command, not a shell executable.

Interactive mode:

pi -e ./extensions/evalset.ts
# then inside pi:
/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt

Non-interactive mode (scripts/CI):

pi -e ./extensions/evalset.ts -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"
# or, if extension already installed/enabled:
pi -p "/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt"

Interactive sessions use pi UI hooks (ctx.ui) for status/notify updates. In non-interactive -p mode, those UI calls are safely skipped (ctx.hasUI === false).

Example workflow (inside pi)

/evalset run examples/fixed-task-set.json --variant baseline
/evalset compare examples/fixed-task-set.json examples/system-baseline.txt examples/system-candidate.txt

Included datasets

examples/fixed-task-set.json — tiny smoke set (3 cases)
examples/fixed-task-set-v2.json — larger first pass set
examples/fixed-task-set-v3.json — less brittle checks (recommended)

Sample visual output (in repo)

examples/evalset-compare-sample-embedded.html — self-contained report UI with embedded compare JSON
examples/evalset-compare-sample.png — screenshot preview of that HTML report

Preview:

Evalset compare sample screenshot

The command writes JSON reports to:

explicit --out <path> when provided
otherwise .evalset/reports/*.json under your current project directory

Each report includes run identity metadata:

runId
datasetHash
casesHash
variantHash (run) or baseline/candidate variant hashes (compare)

Session messages only keep lightweight report metadata (reportPath, ids, summary metrics), not full report bodies.

Export report JSON to static HTML

Use the helper script to create a shareable standalone HTML file from any evalset JSON report:

npm run evalset:export-html -- --in .evalset/reports/compare-your-dataset-YYYYMMDDTHHMMSS.json
# optional:
npm run evalset:export-html -- --in .evalset/reports/run-your-dataset-YYYYMMDDTHHMMSS.json --out .evalset/reports/run-your-dataset.html --title "Evalset run report"

Script: scripts/export-evalset-report-html.mjs

Optional core hooks (future, not required for this MVP)

This extension works today without core changes. If we decide to harden further, optional core support could include:

Stable agent-level lineage IDs (runId/traceId) across extension events.
Explicit reproducibility capability metadata in pi-ai (e.g. seed support and determinism caveats per provider/model).
Shared canonical provider payload hash helper in pi-ai.
A headless agent-eval API for tool-heavy/full agent-loop benchmark runs.

Repository checks

Run:

npm run check

This executes scripts/validate-structure.sh.

Release + security baseline

This scaffold defaults to release-please for single-package release PR + tag flow (vX.Y.Z) and npm trusted publishing via OIDC.

Included files:

Before first production release:

Confirm/adjust owners in .github/CODEOWNERS.
Enable branch protection on main.
Configure npm Trusted Publishing for this repo + publish workflow.
Merge release PR from release-please, then publish from GitHub release.

Issue + PR intake baseline

Included files:

Vouch trust gate baseline

Included files:

Default behavior:

PR workflow runs on pull_request_target (opened, reopened).
require-vouch: true and auto-close: true are enabled by default.
Maintainers can comment vouch, denounce, or unvouch on issues to update trust state.
Vouch actions are SHA pinned (0e11a71bba23218a284d3ecca162e75a110fd7e3) for reproducibility and supply-chain review.

Bootstrap step:

Confirm/adjust entries in .github/VOUCHED.td before enforcing production policy.

Docs discovery

Run:

npm run docs:list
npm run docs:list:workspace
npm run docs:list:json

Wrapper script: scripts/docs-list.sh

Resolution order:

DOCS_LIST_SCRIPT
./scripts/docs-list.mjs (if vendored)
~/ai-society/core/agent-scripts/scripts/docs-list.mjs

Copier lifecycle policy

Keep .copier-answers.yml committed.
Do not edit .copier-answers.yml manually.
Run from a clean destination repo (commit or stash pending changes first).
Use copier update --trust when .copier-answers.yml includes _commit and update is supported.
In non-interactive shells/CI, append --defaults to update/recopy.
Use copier recopy --trust when update is unavailable (for example local non-VCS source) or cannot reconcile cleanly.
After recopy, re-apply local deltas intentionally and run npm run check.

Hook behavior

Git uses .githooks/pre-commit (configured by scripts/install-hooks.sh).
If prek is available, the hook runs prek using prek.toml.
If prek is not available, the hook falls back to scripts/validate-structure.sh.

Install options for prek:

npm add -D @j178/prek
# or
npm install -g @j178/prek

Startup interview flow (project-local)

.pi/extensions/startup-intake-router.ts watches the first non-command message in a session.
It converts your startup intent into a prefilled command:
- /init-project-docs "<your intent>"
.pi/prompts/init-project-docs.md then drives the interview tool using docs/org/project-docs-intake.questions.json.

Utility commands:

/startup-intake-router-status
/startup-intake-router-reset

Live sync helper

Use scripts/sync-to-live.sh to copy the package extension to ~/.pi/agent/extensions/.

Optional flags:

--with-prompts
--with-policy
--all (prompts + policy)

After sync, run /reload in pi.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

pi-evalset-lab

Quickstart

Runtime dependencies and packaged files

Category taxonomy (reference)

evalset command (MVP)

Commands

Running modes

Example workflow (inside pi)

Included datasets

Sample visual output (in repo)

Export report JSON to static HTML

Optional core hooks (future, not required for this MVP)

Repository checks

Release + security baseline

Issue + PR intake baseline

Vouch trust gate baseline

Docs discovery

Copier lifecycle policy

Hook behavior

Startup interview flow (project-local)

Live sync helper

Docs map