vigilo-bench
v1.0.2
Published
Vigilo audit benchmarking pipeline — score against Code4rena ground truth via ScaBench
Readme
Vigilo Bench
Benchmark system for measuring Vigilo audit accuracy against Code4rena/Cantina/Sherlock verified security reports (Ground Truth).
Overview
┌─────────────────────────────────────────────────────────────────┐
│ Vigilo Bench Pipeline │
├─────────────────────────────────────────────────────────────────┤
│ checkout → audit → score → report │
│ │
│ ScaBench Vigilo LLM Markdown │
│ Dataset Audit Scoring Report │
└─────────────────────────────────────────────────────────────────┘Scoring Algorithm: Based on Nethermind AuditAgent
- Iterates through Ground Truth vulnerabilities one by one
- Compares each vulnerability against batches of Vigilo findings via LLM
- Uses 3 iterations + majority voting for reliable results
Prerequisites
Installation
cd packages/bench
bun installQuick Start
One command runs the full pipeline:
bun bench <contest-id> [options]Examples:
# Full pipeline with watch mode (see audit in OpenCode TUI)
bun bench sherlock_cork-protocol_2025_01 -w -v
# Headless mode (automated)
bun bench code4rena_loopfi_2025_02
# Skip audit (use existing .vigilo/)
bun bench code4rena_loopfi_2025_02 --skip-audit -vOptions:
| Flag | Description |
|------|-------------|
| -v, --verbose | Show detailed output (LLM responses, batch processing) |
| -w, --watch | Open OpenCode TUI to watch audit progress |
| --skip-audit | Skip audit step (use existing .vigilo/) |
Pipeline Steps
When you run bun bench <contest-id>:
- Checkout - Clone source code from ScaBench dataset + extract ground truth
- Audit - Run Vigilo audit (headless or watch mode)
- Score - Compare findings against ground truth using LLM
- Report - Generate markdown benchmark report
Individual Commands
For manual control, you can run each step separately:
checkout <contest-id>
Clone contest source code and extract ground truth.
bun bench checkout code4rena_loopfi_2025_02score <contest-id>
Score findings against ground truth.
bun bench score code4rena_loopfi_2025_02 -v
bun bench score code4rena_loopfi_2025_02 --iterations 5 --batch-size 5Options:
-v, --verbose- Detailed logging--iterations <n>- LLM iterations for majority voting (default: 3)--batch-size <n>- Findings per batch (default: 10)
Environment Variables:
BENCH_MODEL- Model to use (default:anthropic/claude-opus-4-5)
report
Generate markdown reports.
bun bench report --contest code4rena_loopfi_2025_02
bun bench report --allData Structure
packages/bench/
├── src/
│ ├── cli.ts
│ ├── client/
│ │ └── opencode.ts
│ ├── commands/
│ │ ├── pipeline.ts
│ │ ├── checkout.ts
│ │ ├── score.ts
│ │ └── report.ts
│ ├── scorer/
│ │ ├── llm-scorer.ts
│ │ └── prompts.ts
│ └── parsers/
│ └── vigilo-findings.ts
└── data/
├── dataset.json # ScaBench dataset (31 contests)
├── baselines/ # GPT-5 baseline results
├── sources/ # Cloned source code + .vigilo/
│ └── {contest-id}/
│ └── .vigilo/findings/
├── truth/ # Ground truth JSON
│ └── {contest-id}.json
├── scores/ # Scoring results
│ └── {contest-id}/
│ └── {timestamp}.json
└── reports/ # Generated markdown reports
└── {contest-id}.mdScoring Methodology
Match Types
| Type | Condition | |------|-----------| | Exact Match | Same Root Cause + Attack Scenario + Impact | | Partial Match | Same Root Cause, incomplete scenario/impact | | No Match | No matching finding found |
Metrics
| Metric | Description | |--------|-------------| | Detection Rate | Exact matches / Total ground truth | | Partial Rate | (Exact + Partial) / Total ground truth | | Precision | Exact matches / (Exact + False positives) | | F1 Score | 2 × (Precision × Recall) / (Precision + Recall) | | Severity-Weighted | Weighted score (Critical=5, High=4, Medium=2, Low=1) |
Example Output
=== Scoring Complete ===
Exact matches: 3/7
Partial matches: 2/7
Missed: 2/7
False positives: 12
Detection rate: 42.9%
Precision: 20.0%
F1 Score: 28.6%
Severity-Weighted: 45.0%
vs Baseline (gpt-5): BETTER (+15.2%)Available Contests
31 contests from ScaBench dataset:
cat data/dataset.json | jq -r '.[] | "\(.project_id) - \(.vulnerabilities | length) vulns"' | head -10| Platform | Example Contests | |----------|------------------| | Code4rena | loopfi, pump-science, mantra-dex | | Sherlock | cork-protocol, perennial-v2, oku | | Cantina | minimal-delegation |
Troubleshooting
Audit Not Starting
# Use watch mode to debug
bun bench <contest-id> -w
# Or run audit manually
cd data/sources/<contest-id>
opencode
# Then type: /auditLLM Response Parsing Failed
Failed to parse LLM JSON responseTry increasing iterations: --iterations 5
No Findings Generated
Ensure Vigilo is properly installed in OpenCode:
bunx vigilo doctorLicense
See LICENSE in the root directory.
