pi-llm-as-verifier

v0.2.2

Published

16 days ago

Pi skill + extension for llm-as-verifier style pairwise, repeated, criteria-decomposed candidate selection.

0High
0Medium
0Low

pk-nerdsaver-ai

pi-package pi verifier llm-as-a-verifier evaluation ranking

pi-llm-as-verifier

Pi package for llm-as-verifier style selection and auditing.

It bundles:

a Pi skill: llm-as-verifier
a Pi extension tool: llm_as_verifier
reusable prompt templates for common verifier workflows

Install

pi install npm:pi-llm-as-verifier

Or test without installing globally:

pi -e npm:pi-llm-as-verifier

What it does

This package helps Pi choose among multiple candidate artifacts using:

pairwise comparison
criteria decomposition
repeated verification
round-robin winner selection

It supports three backends:

gemini-python - Python runner inspired by the upstream paper/repo
zai-coding-plan - single ZAI model through Pi's model registry
pi-model-ensemble - multiple Pi models rotated across repeated attempts

Tool usage

Use the llm_as_verifier tool with:

task
candidates
criteria
optional context
optional evidencePaths
optional outputPath

Multi-model repeated attempts

For mixed-model verification, use:

backend: "pi-model-ensemble"
models: ["openai:gpt-5.4", "google:gemini-2.5-flash", "minimax:MiniMax-M2.7-highspeed"]

If nVerifications is omitted in ensemble mode, it defaults to the number of configured verifier models so each model gets one pass.

Weighted voting by model

For ensemble runs, you can bias some verifier models more strongly:

{
  "backend": "pi-model-ensemble",
  "models": [
    "openai:gpt-5.4",
    "google:gemini-2.5-flash",
    "minimax:MiniMax-M2.7-highspeed"
  ],
  "modelWeights": [
    { "model": "openai:gpt-5.4", "weight": 1.5 },
    { "model": "google:gemini-2.5-flash", "weight": 1.0 },
    { "model": "minimax:MiniMax-M2.7-highspeed", "weight": 0.8 }
  ]
}

Confidence reporting

Ensemble and ZAI-backed runs now return richer breakdowns in details, including:

criterion confidence
pairwise confidence
disagreement scores
per-model breakdowns
weighted model metadata

Example

{
  "backend": "pi-model-ensemble",
  "task": "Choose the strongest patch for the bug fix.",
  "models": [
    "openai:gpt-5.4",
    "google:gemini-2.5-flash",
    "minimax:MiniMax-M2.7-highspeed"
  ],
  "modelWeights": [
    { "model": "openai:gpt-5.4", "weight": 1.3 },
    { "model": "google:gemini-2.5-flash", "weight": 1.0 },
    { "model": "minimax:MiniMax-M2.7-highspeed", "weight": 0.9 }
  ],
  "candidates": [
    {
      "id": "patch-a",
      "content": "..."
    },
    {
      "id": "patch-b",
      "content": "..."
    }
  ],
  "criteria": [
    {
      "name": "Correctness",
      "description": "Check whether the patch directly fixes the requested behavior."
    },
    {
      "name": "Requirements adherence",
      "description": "Check whether exact task constraints are satisfied."
    },
    {
      "name": "Empirical verification",
      "description": "Check whether the candidate is supported by concrete test or runtime evidence."
    }
  ]
}

Prompt templates

This package also ships prompt templates:

/compare-patches
/audit-candidate
/ensemble-verifier

These expand into ready-made instructions for common verifier workflows.

Auth and setup

Gemini Python backend

Install:

pip install google-genai

Provide one of:

GEMINI_API_KEY
GOOGLE_API_KEY
VERTEX_API_KEY

Pi registry backends

For zai-coding-plan and pi-model-ensemble, configure model auth in Pi for whichever providers you want to use.

Smoke tests

Python-runner smoke test:

/lav-smoke

Weighted ensemble smoke test:

/lav-ensemble-smoke

Package contents

.pi/extensions/llm-as-verifier/index.ts
.agents/skills/llm-as-verifier/SKILL.md
.agents/skills/llm-as-verifier/scripts/lav_runner.py
.agents/skills/llm-as-verifier/examples/code-patch-selection.json
.agents/skills/llm-as-verifier/examples/weighted-ensemble-selection.json
prompts/*.md
bundled references and examples

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

pi-llm-as-verifier

Install

What it does

Tool usage

Multi-model repeated attempts

Weighted voting by model

Confidence reporting

Example

Prompt templates

Auth and setup

Gemini Python backend

Pi registry backends

Smoke tests

Package contents