ti-code-evals
v0.1.14
Published
Typed Effect-native evaluation harness and benchmark adapters for ti
Downloads
129
Readme
ti-code-evals
Effect-native local evaluation contracts for ti.
The package is validated: it owns typed suite definitions, command-backed
attempts, deterministic scoring, and report summaries, with test coverage over
the grading and summary path. The local self_host cases are runnable today.
SWE-bench has a typed adapter plus a durable local workflow. The internal ti
eval tool prepares an official swebench==4.1.0 virtual environment under
.ti/evals/swe-bench, retains official Docker/build/evaluation logs there, and
writes machine-readable run summaries under .ti/evals/swe-bench/reports.
Validation
pnpm --filter ti-code-evals test
pnpm --filter ti-code-evals typecheckDurable SWE-bench smoke run
From a built native ti CLI checkout, prepare the persistent harness once:
ti tool eval '{"operation":"prepare"}'Then run the official single-instance gold validation from SWE-bench:
ti tool eval '{"operation":"gold_smoke","timeoutSeconds":1800}'Use "dockerHost":"unix:///path/to/docker.sock" to override Docker context
detection and "reinstallHarness":true only when recreating the persistent
SWE-bench virtual environment is required. Gold smoke reuses official Docker
evaluation images by default; set "forceRebuild":true only when you
explicitly need to reconstruct those images.
The gold smoke run uses princeton-nlp/SWE-bench_Verified, official gold
predictions, instance sympy__sympy-20590, one worker, and --namespace none
for ARM/macOS compatibility. Local Docker runs remain resource intensive and
the official harness recommends substantial free disk capacity.
Preflight checks interpreter availability, module discovery, and Docker health
without importing the full SWE-bench harness, so first-run image/dataset setup
is measured as execution rather than misreported as a missing Python runtime.
For swebench==4.1.0, preparation also applies an idempotent compatibility
patch to generated Python 3.9 environment builds so Conda installs pip<26
before pip is imported; newer pip releases require Python features unavailable
in those official images. It also falls back to a full repository clone if an
upstream historical branch named in an official instance no longer exists,
before resetting to the dataset-pinned commit.
Supported benchmarks
self_host- local command-backed cases used for repo-native checks.swe_bench- official harness adapter for SWE-bench / SWE-bench Verified.terminal_bench- catalog target; adapter not implemented yet.harbor- catalog target; adapter not implemented yet.
