semprove

v0.1.0

Published

24 days ago

SMT-backed semantic diff tool for TypeScript business logic. Proves equivalence or finds counterexamples.

0High
0Medium
0Low

krakennut

typescript testing smt z3 equivalence semantic-diff formal-verification refactoring

Semprove - SMT-Backed Semantic Diff for TypeScript

Semprove is an SMT-backed semantic diff tool for TypeScript business logic. It can prove equivalence within a restricted TypeScript fragment, configured bounds, and modeling assumptions (e.g., external-call abstraction), or produce a concrete counterexample and replay it at runtime.

✅ Guarantees (read this first)

UNSAT_IN_MODEL = no counterexample exists within configured bounds + supported fragment + modeling assumptions
COUNTEREXAMPLE_FOUND = a concrete input where old/new diverge (runtime-replayed)
TIMEOUT_OR_UNSUPPORTED = solver timed out or code uses unsupported constructs

🔍 Supported Fragment (Beta)

Semprove targets a strict subset of TypeScript for business logic.

✅ Supported:

Primitive types: string, number, boolean
Arithmetic: +, -, *, /
Logic: &&, ||, !, comparisons
Control flow: if/else, return, ternary, variable diffs
Constraint: Must be a pure function with stable parameter names.

⚠️ Not Supported:

Loops (for, while)
Object literals / Arrays
Async / Promises
Parameter renaming
Modulo %

See SUPPORTED_FRAGMENT.md for full details.

🎯 What It Does

Given two versions of a function (old and new), semprove:

Proves equivalence (UNSAT_IN_MODEL): No counterexample within bounds + fragment + modeling assumptions
Finds bugs (COUNTEREXAMPLE_FOUND): Shows exact input where functions diverge
Falls back gracefully (TIMEOUT_OR_UNSUPPORTED): If Z3 can't decide or features are unsupported, tries fast-check sampling

Safety note: Runtime replay is disabled by default when external calls are present.

Use cases (within supported fragment + bounds):

Verifying refactors of pricing rules
Validating policy/eligibility logic changes
Catching off-by-one errors in tax calculations
Proving discount tier adjustments are semantically equivalent

📦 Installation

Node LTS: Requires Node.js 20 or 22.

npm install -g semprove

Windows users:

npx -y semprove@beta --help

🚀 Quick Start

Compare two functions

semprove compare \
  --old src/pricing.old.ts \
  --new src/pricing.ts \
  --fn calculateDiscount

Output:

artifacts/result.json: Verdict (UNSAT/COUNTEREXAMPLE/TIMEOUT)
artifacts/counterexample.json: Exact input where functions diverge (if found)
artifacts/trace.json: Old/new return values and event logs

Analyze a Pull Request

semprove pr --base main --head feature-branch

Output:

artifacts/pr-summary.json: CTO-friendly metrics (functions analyzed, verdicts, p95 time)
Per-function artifacts in artifacts/[file]/[function]/

🛡️ Production-minded Safeguards

1. UNSAT Sampling Smoke-Test

When Z3 proves equivalence (UNSAT), semprove now validates with fast-check sampling:

semprove compare --old old.ts --new new.ts --fn myFunc --paranoid

Normal mode: 200 sampling runs
Paranoid mode (--paranoid): 2000 sampling runs
Confidence field: "unsat+sampled" vs "unsat-only"

Why? Sampling is a soundness smoke-test against lowering bugs; it cannot guarantee UNSAT is correct.

2. Dangerous Coercion Detection

Semprove explicitly refuses constructs that can diverge between Z3 and runtime:

❌ == / != operators (use === / !==)
❌ String concatenation with +
❌ undefined / null keywords
❌ NaN / Infinity identifiers

Error message:

UNSUPPORTED: == operator (use === instead). 
Reason: Type coercion can diverge between Z3 model and runtime.

3. Event Diff Severity

For event-based logic (function calls), semprove classifies changes:

Low: Only logging/metrics tags changed
Med: Guard conditions changed
High: Order/args changed on business logic

Field in result: "eventDiffSeverity": "low" | "med" | "high"

4. CI-Safe Runtime Replay (no side effects by default)

Default Safety Posture:

Runtime replay and sampling are disabled by default when external calls are detected.
This prevents accidental database writes, API calls, or file system mutations in CI.
To override (at your own risk), pass --allowSideEffects.

Float Handling:

Runtime replay treats near-equal numeric results as equal (relative epsilon) to avoid float-noise false positives.

5. PR Summary JSON

After semprove pr, get a CTO-friendly summary:

{
  "functionsAnalyzed": 12,
  "functionsSkipped": 3,
  "skippedReasons": { "UNSUPPORTED_FEATURE": 2, "TIMEOUT": 1 },
  "verdictCounts": { "unsat": 8, "found": 3, "approved": 2 },
  "runtimeP95Ms": 4500,
  "topDivergences": [
    {
      "fn": "calculateDiscount",
      "file": "src/pricing.ts",
      "counterexample": { "price": 100, "tier": "gold" },
      "oldRetRuntime": 90,
      "newRetRuntime": 85,
      "eventDiffSeverity": "high",
      "approved": false
    }
  ]
}

📏 Supported TypeScript Fragment

Semprove supports a tight, well-defined subset optimized for pure business logic:

✅ Arithmetic: +, -, *, /
✅ Comparisons: >, <, >=, <=, ===, !==
✅ Logical: &&, ||, !
✅ Control flow: if/else, switch/case, ternary
✅ Function calls (as uninterpreted functions)

❌ Loops (for, while)
❌ Arrays, objects with dynamic keys
❌ Async/await, promises
❌ Type coercions (==, undefined, null, NaN)

Non-goal: general TypeScript / app code. Semprove targets deterministic pricing/policy/rules.

See: SUPPORTED_FRAGMENT.md for full details (~20 lines of language)

🎯 Presets

Money Preset

For financial calculations (pricing, VAT, discounts):

semprove compare --old old.ts --new new.ts --fn calculatePrice --money

Inputs: integers (cents)
Intermediates: rationals (supports * 0.9, / 1.20)
Bounds: min/max to prevent overflow

🔧 Options

Global Options

| Option | Default | Description | |--------|---------|-------------| | --timeoutMs <n> | 60000 | Z3 solver timeout (ms) | | --[no-]ints | true | Treat numbers as integers | | --realInputs | false | Treat numbers as reals (no integer constraint) | | --money | false | Preset for financial calculations | | --min <n> | -1000000 | Min bound for numeric inputs | | --max <n> | 1000000 | Max bound for numeric inputs | | --paranoid | false | Increase sampling runs for UNSAT (2000 vs 200) | | --nonBlockingTags <list> | "" | Comma-separated event tags to ignore | | --out <dir> | ./artifacts | Output directory for artifacts | | --allowSideEffects | false | Allow runtime replay even if side effects detected |

PR Mode Options

| Option | Default | Description | |--------|---------|-------------| | --base <ref> | HEAD~1 | Base git ref | | --head <ref> | HEAD | Head git ref | | --mode <mode> | refactor | refactor or bugfix |

📊 Commands

`compare`

Compare two specific functions:

semprove compare --old old.ts --new new.ts --fn myFunction

`pr`

Analyze all changed functions in a PR:

semprove pr --base main --head feature-branch --mode refactor

Generates:

artifacts/pr-summary.json
Per-function artifacts

`approve`

Approve a known divergence (for refactor mode):

semprove approve \
  --fn calculateDiscount \
  --from artifacts/src_pricing_ts/calculateDiscount/counterexample.json \
  --reason "Intentional discount adjustment for Q1 promo" \
  --approvedBy "john-doe"

Stores approval in .semprove/approvals.json

📚 Documentation

PRODUCTION_RISKS.md: 5 real production risks and mitigations
SUPPORTED_FRAGMENT.md: Exact TypeScript subset supported
PRODUCTION_READY.md: Implementation summary and next steps
SPRINT_A_BENCH.md: Benchmark plan (30 test functions)
SPRINT_B_GITHUB_ACTION.md: GitHub Action integration
SPRINT_C_DESIGN_PARTNERS.md: Design partner script

🧪 Benchmarks

Run the benchmark suite:

npm run build
node bench/run-bench.ts

Generates: bench/RESULTS.md with accuracy metrics

Current benchmarks:

10 refactors (should be UNSAT)
10 bugfixes (should find COUNTEREXAMPLE)
10 Copilot errors (should find COUNTEREXAMPLE)

🏗️ Architecture

Extract (src/extract.ts): Parse TypeScript functions with ts-morph
Symbolic (src/symbolic.ts): Lower to Z3 symbolic representation
Lower (src/lower.ts): Convert TS AST to Z3 constraints
Solve (src/solve.ts): Check equivalence with Z3, validate with sampling
Report (src/report.ts): Generate artifacts (JSON, markdown)

🎓 How It Works

Example

// OLD
export function calculateDiscount(price: number, tier: "gold" | "silver" | "none"): number {
  if (tier === "gold") return price * 0.9;
  return price;
}

// NEW
export function calculateDiscount(price: number, tier: "gold" | "silver" | "none"): number {
  if (tier === "gold") return price * 0.85;  // BUG: changed discount
  return price;
}

Semprove output:

{
  "status": "COUNTEREXAMPLE_FOUND",
  "counterexample": { "price": 100, "tier": "gold" },
  "oldRet": "90",
  "newRet": "85",
  "oldRetRuntime": 90,
  "newRetRuntime": 85,
  "runtimeDiverged": true
}

Verdict: Functions diverge at { price: 100, tier: "gold" } ✅

🚀 Roadmap

Sprint A: Bench + Bug Finds (Week 1-2)

Create 30 benchmark functions, validate accuracy > 90%

Sprint B: GitHub Action (Week 3)

Build installable GitHub Action with PR comment integration

Sprint C: Design Partners (Week 4-8)

Work with 3 companies to validate on real codebases

🤝 Contributing

Contributions welcome! Focus areas:

More benchmark cases (pricing, policy, tax logic)
Lowering improvements (support more TS constructs)
Performance (faster Z3 queries, caching)
UX (better error messages, visualization)

📄 License

MIT

🙋 FAQ

Q: What if my function uses loops?

A: Not supported yet. Refactor into fixed-arity logic or extract loop-free helpers.

Q: Can I use this on React components?

A: No. Semprove targets pure business logic (pricing, policy, calculations).

Q: What's the difference from property testing?

A: Property testing (fast-check) tests examples. Semprove does SMT-based equivalence checking within the supported fragment and bounds, or finds a concrete counterexample.

Q: How fast is it?

A: In our dry run (20 functions, AMD Ryzen 9 5900X, 60s timeout): P95 ~1.3s, median ~200ms. Performance depends on function complexity, bounds, and hardware. Use --timeoutMs to adjust.

Q: Can I approve a divergence?

A: Yes! Use semprove approve to mark expected changes in refactor mode.

Built with:

Z3 (SMT solver)
ts-morph (TypeScript AST)
fast-check (Property testing)
commander (CLI framework)

⚠️ Known Issues

WASM Z3 Memory Limits

Z3 runs in WebAssembly with ~2GB heap limit.
Large functions (>50 variables) or very large bounds may hit this limit.
Mitigation:
- Reduce search bounds: --min -1000 --max 1000
- Use integer reasoning: --no-realInputs (default)
- Split complex functions into smaller helpers
- Use --timeoutMs 5000 to fail fast instead of hanging

Windows PowerShell

Use -y flag with npx: npx -y semprove ...
Long paths may require --% escape in some shells
Git worktrees work correctly (tested on Windows 11)

Performance Variability

First run may be slower (WASM initialization)
Performance scales with function complexity and bounds
Recommendation: Start with smaller bounds (--min, --max) for exploration

Questions? Open an issue or reach out!

🔐 Security & Trust

No network required for analysis.
No telemetry (if ever added, it will be opt-in).

Next: Try semprove pr on your codebase 🚀 #� �S�e�m�p�r�o�v�e� � �