ti-code-evals

v0.1.14

Published

12 days ago

Typed Effect-native evaluation harness and benchmark adapters for ti

Downloads

129

0High
0Medium
0Low

schpnpls

ti-code-evals

Effect-native local evaluation contracts for ti.

The package is validated: it owns typed suite definitions, command-backed attempts, deterministic scoring, and report summaries, with test coverage over the grading and summary path. The local self_host cases are runnable today. SWE-bench has a typed adapter plus a durable local workflow. The internal ti eval tool prepares an official swebench==4.1.0 virtual environment under .ti/evals/swe-bench, retains official Docker/build/evaluation logs there, and writes machine-readable run summaries under .ti/evals/swe-bench/reports.

Validation

pnpm --filter ti-code-evals test
pnpm --filter ti-code-evals typecheck

Durable SWE-bench smoke run

From a built native ti CLI checkout, prepare the persistent harness once:

ti tool eval '{"operation":"prepare"}'

Then run the official single-instance gold validation from SWE-bench:

ti tool eval '{"operation":"gold_smoke","timeoutSeconds":1800}'

Use "dockerHost":"unix:///path/to/docker.sock" to override Docker context detection and "reinstallHarness":true only when recreating the persistent SWE-bench virtual environment is required. Gold smoke reuses official Docker evaluation images by default; set "forceRebuild":true only when you explicitly need to reconstruct those images.

The gold smoke run uses princeton-nlp/SWE-bench_Verified, official gold predictions, instance sympy__sympy-20590, one worker, and --namespace none for ARM/macOS compatibility. Local Docker runs remain resource intensive and the official harness recommends substantial free disk capacity.

Preflight checks interpreter availability, module discovery, and Docker health without importing the full SWE-bench harness, so first-run image/dataset setup is measured as execution rather than misreported as a missing Python runtime. For swebench==4.1.0, preparation also applies an idempotent compatibility patch to generated Python 3.9 environment builds so Conda installs pip<26 before pip is imported; newer pip releases require Python features unavailable in those official images. It also falls back to a full repository clone if an upstream historical branch named in an official instance no longer exists, before resetting to the dataset-pinned commit.

Supported benchmarks

self_host - local command-backed cases used for repo-native checks.
swe_bench - official harness adapter for SWE-bench / SWE-bench Verified.
terminal_bench - catalog target; adapter not implemented yet.
harbor - catalog target; adapter not implemented yet.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ti-code-evals

Validation

Durable SWE-bench smoke run

Supported benchmarks