npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

nexus-eval-swebench

v0.2.2

Published

SWE-bench (Lite/Verified/Full) evaluation harness for nexus-agents — clean-room implementation, model-only baseline

Readme

nexus-eval-swebench

SWE-bench (Lite / Verified / Full) evaluation harness for nexus-agents, implementing the BenchmarkAdapter contract.

v0.2 — clean-room re-implementation. This harness is now self-contained: it depends only on public nexus-agents types (BenchmarkAdapter, IModelAdapter, runBenchmark) — no internal helpers, no in-tree runtime imports. The original v0.1 thin wrapper around the in-tree SWEBenchRunner is replaced by a model-only baseline implemented locally. See nexus-agents #2515 for the extraction rationale.

Install

npm install nexus-eval-swebench nexus-agents

nexus-agents is a peer dependency.

Quick start (CLI)

# Set the OpenAI-compat endpoint
export OPENAI_API_KEY=sk-...
export OPENAI_BASE_URL=https://your-gateway/v1   # optional
export MODEL_ID=anthropic/claude-sonnet-4-6      # optional

# Run 5 SWE-bench Lite instances in parallel
npx nexus-eval-swebench --variant lite --limit 5 --concurrency 3

# JSON summary for piping
npx nexus-eval-swebench --variant verified --json > run.json

Library usage

import { runBenchmark, createOpenAIAdapter } from 'nexus-agents';
import { SweBenchAdapter } from 'nexus-eval-swebench';

const modelAdapter = createOpenAIAdapter({
  apiKey: process.env.OPENAI_API_KEY!,
  modelId: 'gpt-4o',
});

const adapter = new SweBenchAdapter(modelAdapter, { variant: 'lite' });
const summary = await runBenchmark(adapter, {}, { concurrency: 4, limit: 10 });

console.log(
  `Generated ${summary.passed}/${summary.total} non-empty patches ` +
    `(${(summary.passRate * 100).toFixed(1)}%)`
);

Operators with their own IModelAdapter (Claude API, Ollama, anything implementing the contract) can substitute it for createOpenAIAdapter without changing anything else.

What this harness does (v0.2 MVP)

  • Loads SWE-bench instances from HuggingFace (princeton-nlp/SWE-bench{,-Lite,-Verified}) or a local .jsonl fixture. HF responses are cached under ~/.nexus-eval-swebench/cache/<variant>.jsonl.
  • Composes a SWE-bench prompt that surfaces repo, base commit, problem statement, and optional hints. Asks for a unified-diff patch wrapped in a fenced ```diff block.
  • Invokes the configured IModelAdapter via complete() — pure model-only, no agent loop, no workspace clone.
  • Extracts the patch from the model response (handles fenced ```diff blocks, ```patch blocks, and bare unified diffs).
  • Returns predictions in the standard SWE-bench shape: { instance_id, model_name_or_path, model_patch }.
  • Surfaces per-run metadata: empty-patch count, generation-error count, dataset variant.

What v0.2 does NOT do

  • Does NOT run tests against the predictions. Pass/fail in the summary reflects "did the model produce a non-empty patch", not "does the patch resolve the issue." For test-based resolution, run the upstream SWE-bench Docker harness on the emitted predictions file:

    python -m swebench.harness.run_evaluation \
      --dataset_name princeton-nlp/SWE-bench_Lite \
      --predictions_path ./predictions.jsonl \
      --max_workers 8 \
      --run_id my-run
  • Does NOT clone repos. The model only sees problem_statement + optional hints_text. Real agentic flows (clone repo, navigate codebase, edit files, capture diff) score considerably higher than this baseline. Tracked as v0.3 follow-up — agentic flow via ICliAdapter against a cloned workspace.

Roadmap

  • v0.3: agentic flow (ICliAdapter + workspace clone) for substantially better patch quality
  • v0.4: optional Docker harness integration for inline test-based resolution
  • v0.5+: fixture generation, dataset slicing, per-repo breakdowns

Track in this repo's issues.

Configuration

interface SweBenchAdapterConfig {
  variant?: 'lite' | 'verified' | 'full';     // default 'lite'
  dataset?: 'huggingface' | string;             // default 'huggingface'; pass a path for .jsonl
  cacheDir?: string;                            // default ~/.nexus-eval-swebench/cache/
}

CLI flags: --variant, --model-id, --dataset, --cache-dir, --limit, --concurrency, --timeout, --json. See npx nexus-eval-swebench --help.

Environment for the CLI: OPENAI_API_KEY (required), OPENAI_BASE_URL (optional), MODEL_ID (optional).

Related

License

MIT.