@whitenoisenpm/testforge-mcp

v0.37.0

Published

6 days ago

TestForge MCP Server — AI-powered testing in your IDE. Analyzes code for security, unit tests, load, accessibility, vision alignment, scope coverage, and stack quality.

0High
0Medium
0Low

whitenoisenpm

testing mcp model-context-protocol security code-analysis ai cursor vscode testforge static-analysis vision-alignment scope-testing stack-analysis

@whitenoisenpm/testforge-mcp

AI-powered testing in your IDE. The TestForge MCP server integrates with Cursor, VS Code, Windsurf, Claude Code, and any MCP-compatible editor to provide real-time code analysis — entirely on your machine.

npx -y @whitenoisenpm/testforge-mcp@latest        # start the server → http://localhost:33221
npx -y @whitenoisenpm/testforge-mcp setup         # interactive config wizard (AI provider, port, secret)
npx -y @whitenoisenpm/testforge-mcp --help        # full env-var reference

Tier-1 (22 dimensions) needs no config. For Tier-2 (LLM test generation + sims), run setup once — it configures an AI provider (OpenRouter cloud or a local model server like Ollama / LM Studio) and writes ~/.testforge/.env. No database to install — run history is auto-stored in SQLite at ~/.testforge/history.db.

What it does

| Dimension category | Examples | |---|---| | Security (SAST) | SQL/NoSQL injection, eval, XSS, sensitive data in logs/responses, hardcoded secrets, CORS misconfig, OWASP coverage | | Quality | Unit-test coverage, mutation-score estimate, predictive risk, dead-code, license/supply-chain audit | | Performance & resilience | Load profile, rate limiting, caching, n+1 query patterns, chaos resilience | | Product & ops | Vision/goal alignment (observability, analytics, feature flags), scope coverage, stack quality, DORA estimate, agentic-scale prediction | | UI | Accessibility (WCAG-ish): alt text, form labels, visual-regression hints |

All Tier-1 analysis is regex/static — fast, no LLM calls, deterministic. Same input → same output. Tier 2 (Generate & Run) layers an LLM on top: it writes real tests (Vitest / pytest / Go) for the top findings — grounded in your actual source, and importing & executing your real code when it safely can — and runs them in a sandboxed Docker container. Simulate goes further: it boots your app and exercises the running system (load, chaos, real-code unit tests in the booted image, and a Playwright browser crawl + LLM-authored user journeys). See the Tier 2 and Simulate sections below.

Quick Start (Tier 1)

# Start the server (port 33221)
npx @whitenoisenpm/testforge-mcp@latest

# Dashboard:
open http://localhost:33221

The dashboard lets you paste a local project path or a public GitHub URL, runs the full 22-dimension analysis, and persists each run to SQLite at ~/.testforge/history.db so /reports shows your history. Everything stays on your machine — no API keys required.

Tier 2 — Generate & Run (LLM tests + sandbox)

Added in v0.25.0. Tier 1 keeps working without any extra setup. Tier 2 needs an AI provider (cloud key or a local model server) and Docker running.

Easiest: npx @whitenoisenpm/testforge-mcp setup — pick OpenRouter or a local server (Ollama/LM Studio), and it writes the config for you. Or set env vars manually:

# Option A — OpenRouter (cloud). Free key at https://openrouter.ai/keys
export OPENROUTER_API_KEY=sk-or-v1-...

# Option B — local model server (Ollama/LM Studio/vLLM), free + private, no key:
#   ollama pull qwen2.5-coder:14b
export TESTFORGE_LLM_BASE_URL=http://localhost:11434/v1
export TESTFORGE_PRIMARY_MODEL=qwen2.5-coder:14b
#   (from Docker, use http://host.docker.internal:11434/v1)

# 2. Make sure Docker is running (Docker Desktop on macOS / Windows;
#    docker daemon on Linux). No image build step needed — the runner
#    image is pulled from GHCR on first use.

# 3. Start the server with the key in env
OPENROUTER_API_KEY=$OPENROUTER_API_KEY \
  npx @whitenoisenpm/testforge-mcp@latest

# 4. In the dashboard, click "🤖 Generate Tests (Tier 2)"
#    under any analysis report. The first call pulls
#    ghcr.io/t4tarzan/testforge-runner:latest (~92 MB, ~10s).
#    Subsequent calls reuse the local image and run in ~1s.

What it does: takes the top-3 highest-severity findings from a Tier-1 run, sends each to the LLM with a Zod-enforced schema (filename, content, reasoning), then runs the generated tests in a node:22-slim / pytest / Go container (--network=none, --rm, caps dropped) with the framework's JSON reporter.

Tests are grounded in your real code, not a description of it:

Each finding ships the actual source at the flagged line, so the generated test reproduces your logic — not a generic example.
When the finding's file is a leaf module (imports only Node built-ins), the test imports and executes the real module in the sandbox, so the pass/fail reflects your actual code.
For deeper coverage (modules with dependencies), the Simulate wired lane runs tests against the real code inside the booted app image, where its dependencies already resolve. See Simulate below.

Polyglot since v0.29: .ts/.js → Vitest, .py → pytest, .go → go test, each in its matching sandbox image.

Provider stack (default models via OpenRouter; override either, or point at a local server with TESTFORGE_LLM_BASE_URL):

| Model | Role | Override | |---|---|---| | deepseek/deepseek-v4-flash | Primary — cheap, fast, capable coder | TESTFORGE_PRIMARY_MODEL | | moonshotai/kimi-k2.6 | Fallback — different provider (hit when primary rate-limits/fails) | TESTFORGE_FALLBACK_MODEL |

Endpoint shape:

curl -X POST http://localhost:33221/generate-and-run \
  -H "Content-Type: application/json" \
  -d '{
    "findings": [{ "title": "…", "description": "…", "filePath": "…",
                   "lineNumber": 42, "severity": "high", "rule": "…",
                   "fixSuggestion": "…" }],
    "maxFindings": 3,
    "cluster": "edge-case"
  }'
# → { generationId, provider, generationMs, runMs,
#     results: [{ finding, file: { filename, content, reasoning },
#                 attempts: [{ model, ok, durationMs }] }],
#     run: { numPassedTests, numFailedTests, files: [...] } }

History endpoints:

| Endpoint | Returns | |---|---| | GET /api/generations | List of recent Tier-2 generations (id, cluster, provider, pass/fail counts) | | GET /api/generations/:id | One generation with the full payload (source files + run details) |

Cost at OpenRouter list prices: roughly $0.02 per Tier-2 invocation (3 generations × ~1.5k output tokens at Qwen 3.7 Max pricing). Sandbox compute is free locally.

Self-host vs managed: the local MCP runs Tier 2 with no quota — you BYOK OpenRouter and pay them directly. The managed SaaS at testforge.run gates Tier 2 to the Forge plan ($99/mo, 100 iterations/mo) and handles the keys for you.

Simulate — exercise the running app

Where Tier-2 runs sandboxed unit tests, Simulate boots your app and drives the running system. Needs a root Dockerfile (or docker-compose) it can build, plus Docker running. Async (real sims take minutes): you get a jobId, then poll.

# Kick off (opt into the lanes you want via "dimensions")
curl -X POST http://localhost:33221/simulate \
  -H "Content-Type: application/json" \
  -d '{"repoUrl":"https://github.com/owner/repo",
       "dimensions":["load","chaos","wired","e2e"],
       "journeys":2, "maxPages":8}'
# → { jobId, statusUrl }

# Poll for phased progress + the final result
curl http://localhost:33221/simulate/<jobId>

It clones → detects how to boot (Dockerfile/compose) → builds + boots the app once on an isolated network → runs the requested lanes against it → tears down. If it can't be auto-booted, each lane returns an honest ranReal:false + reason (and a static fallback for load/chaos).

| Lane (dimensions) | What it does | |---|---| | load (default) | autocannon ramp (10→500 concurrency) → p50/p90/p99, rps, error rate, breaking-point concurrency | | chaos | baseline load → inject a fault (restart/pause) → errorRateDuringFault + recoverySeconds | | agent | ramps a fleet of think-time agents → maxHealthyAgents | | wired | generates node:test files that import & run your real code inside the booted image (deps resolve from the image; Node apps, v1) | | e2e | Playwright crawls the running app → console errors, 4xx/5xx, axe a11y violations. Add journeys:N for LLM-authored user journeys (navigate/click/fill/assert) run as a deterministic step-DSL |

maxPages bounds the e2e crawl (default 8); journeys (0 = smoke only) sets how many user journeys the model authors. concurrencyLevels, durationPerLevelSec, faultType tune load/chaos.

Manual MCP Setup

Cursor / Windsurf / Claude Desktop

Open IDE settings → MCP → add server:

{
  "mcpServers": {
    "testforge": {
      "command": "npx",
      "args": ["-y", "@whitenoisenpm/testforge-mcp@latest"],
      "env": {
        "TESTFORGE_MCP_PORT": "33221",
        "OPENROUTER_API_KEY": "sk-or-v1-…  (optional — only needed for Tier 2)"
      }
    }
  }
}

VS Code

Use the Continue / Cline extension and add the same JSON to its MCP config block.

MCP Tools

| Tool | What it does | Latency | |---|---|---| | testforge_analyze | Synchronous: scan codebase structure (files, endpoints, dependencies, tech stack) | seconds | | testforge_quick_scan | Async: security + unit dimensions only. Streams progress via SSE. | ~30s | | testforge_test | Async: full suite across all dimensions. Streams progress via SSE. Persists summary to SQLite on completion (since 0.2.19). | 1–5 min | | testforge_report | Get or generate a structured PRD report for a completed test run | seconds |

REST API (running standalone)

# Health
curl http://localhost:33221/health
# → {"status":"ok","version":"0.36.5"}

# Public-status check (for badges/uptime)
curl http://localhost:33221/api/reports/latest
# → 404 {"error":"No reports yet"} if SQLite is empty;
#   the most recent report otherwise (no more seed/demo data fallback).

# Synchronous full analysis of a public repo
curl -X POST http://localhost:33221/clone-and-analyze \
  -H "Content-Type: application/json" \
  -d '{"repoUrl":"https://github.com/owner/repo"}'

# Async test run (background, streams via SSE)
curl -X POST http://localhost:33221/test \
  -H "Content-Type: application/json" \
  -d '{"projectPath":"/path/to/local/project"}'
# → {"testRunId":"...","status":"running","streamUrl":"/mcp/sse"}

# Progress for a specific run
curl http://localhost:33221/test/<testRunId>/progress

# List recent persisted runs (from SQLite)
curl http://localhost:33221/reports

# Single report by id
curl http://localhost:33221/report-view/<reportId>

Local data

| File | Contents | |---|---| | ~/.testforge/history.db | SQLite with a reports table — one row per analyze / test run, including per-dimension scores and the full JSON blob in full_data. WAL mode. | | ~/.testforge/history.db-wal, .db-shm | SQLite WAL sidecars. | | /tmp/testforge-repos/ (or $TMP_DIR) | Temp clones of public repos for /clone-and-analyze. Deleted after each analysis. |

Your source never leaves the machine — the dashboard is local, the analyzers are local, the DB is local. The only outbound calls are the git clone step (when you give it a public URL) and dependency lookups for license/supply-chain checks.

Environment Variables

| Variable | Default | Description | |----------|---------|-------------| | TESTFORGE_MCP_PORT | 33221 | Server port. 33221 chosen to avoid common dev-server collisions (3000/3001/5173/8080). | | TESTFORGE_MCP_HOST | 0.0.0.0 | Bind address. Set 127.0.0.1 to listen loopback-only — e.g. when a reverse proxy / Tailscale Serve already fronts the port and a wildcard bind would collide. Docker/managed deploys keep the wildcard. | | TMP_DIR | /tmp/testforge-repos | Where /clone-and-analyze puts temp checkouts. | | LOG_LEVEL | info | Fastify logger level (debug, info, warn, error). | | DATABASE_URL | — | Optional. If set, the server can fall back to Neon for read-replica history. Not required for local-only use. | | OPENROUTER_API_KEY | — | Tier 2 only. OpenRouter key for LLM test generation. Without it, POST /generate-and-run returns 503. Get one at https://openrouter.ai/. | | TESTFORGE_LLM_BASE_URL | — | Tier 2 (local model). Point at an OpenAI-compatible local server (Ollama http://localhost:11434/v1, LM Studio, vLLM) for free, private, keyless generation. From Docker use host.docker.internal. | | TESTFORGE_PRIMARY_MODEL | deepseek/deepseek-v4-flash | Tier 2 primary model. Any OpenRouter (or local) model id works. | | TESTFORGE_FALLBACK_MODEL | moonshotai/kimi-k2.6 | Tier 2 fallback when the primary errors or rejects the schema. | | TESTFORGE_RUNNER_IMAGE | ghcr.io/t4tarzan/testforge-runner:latest | Tier 2 unit-test sandbox image (+ _PYTHON / _GO variants). Auto-pulled from GHCR; builds locally from the bundled Dockerfile if the pull fails. | | TESTFORGE_LOADGEN_IMAGE | ghcr.io/t4tarzan/testforge-loadgen:latest | Simulate load/chaos driver (autocannon). Override to a local build if needed. | | TESTFORGE_E2E_IMAGE | ghcr.io/t4tarzan/testforge-e2e:latest | Simulate e2e crawler (Playwright + Chromium + axe). Override to a local build if needed. | | TESTFORGE_RUNNER_TAG | v0.36.x | Version tag for the runner images (:latest is cached and never re-pulled, so fixes ship on a fresh tag). |

Changelog highlights

0.37.0 — Tier-2 runs your real code, and Simulate exercises the running app. Tier-2 test generation now ships the actual source at the flagged line (so tests reproduce your logic, not a generic example), imports & executes the real module for leaf findings (Node built-ins only), and the new Simulate wired lane runs node:test files against your real code inside the booted app image (dependencies and all). The Simulate engine also gains a browser/E2E lane — Playwright crawls the running app (console errors, 4xx/5xx, axe a11y) and, with journeys:N, runs LLM-authored user journeys as a deterministic step-DSL. New runner images testforge-loadgen + testforge-e2e (multi-arch GHCR, local-build fallback); TESTFORGE_MCP_HOST makes the bind address configurable. (Tier-1 stays at 22 dimensions — this work is all Tier-2 / Simulate.)
0.29.0–0.36.x — Simulate runtime engine (load/agent/chaos, then the Kubernetes runtime tier), polyglot Tier-2 (pytest + Go test), the analyzer flywheel, and precision passes. Per-release detail in git history and [[Evolution]] (docs/knowledge/Evolution.md).
0.28.4 — Accessibility analyzer skips test paths. Same suppression the security analyzer got in 0.27.0: a11y per-file checks now skip tests/, __tests__/, e2e/, *.spec.*, etc. Test fixtures routinely contain intentional a11y violations (TestForge's own a11y-jsx fixture is deliberately broken to test the analyzer), so flagging them is noise. Removes 9 fixture findings from the TestForge self-audit + fixes it for any user with a11y component tests.
0.28.3 — Security + a11y precision pass (false-positive cleanup). Caught by the Supabase + TestForge self-audit reports, which showed mostly false positives. Four targeted fixes:
- SQL/NoSQL sink receiver-awareness: isDbQueryCall matched generic method names (get, find, all, run, count) regardless of receiver — so urlParams.get(), Promise.all(), map.get(), and an HTTP get() helper all tripped the "SQL injection" critical. Now split into STRONG methods (query/exec/execute/raw/findOne/findMany/findUnique/findFirst/aggregate — fire always) vs WEAK methods (find/get/all/run/count — only when the receiver looks like a DB handle: db/conn/client/pool/knex/prisma/sequelize/collection/mongoose/etc.). Supabase criticals dropped 14 → 1.
- Minified / vendored skip: files under vendor/, monaco-editor/, *.min.js, or with minified content shape (single >5k-char line, or >500-char average line) are excluded from per-file security analysis. Killed the monaco-editor workerMain.js findings.
- Placeholder secrets: checkHardcodedSecret skips bracketed/templated values ([YOUR-PASSWORD], <password>, ${…}, {{…}}) and common placeholder words (your-/example/changeme/dummy/sample/…) + all-symbol masks. Killed the 4 Supabase connection-string-UI password findings.
- Luminance-aware contrast: checkColorContrast matched ANY hex (#[a-fA-F0-9]{3,6}), so near-black text like #12101A fired "low contrast" — 100 false positives on the TestForge self-audit alone. Now parses the hex, computes WCAG relative luminance, and only flags genuinely light text (luminance > 0.55). Tailwind pattern narrowed to gray-300/400 (gray-500 passes AA on white). TestForge a11y findings dropped 175 → 87.
- Tests: 211 → 221 (+10).
0.28.2 — Coverage + Mutation dimension-correctness fixes. Both dimensions were producing 0% scores on perfectly testable repos for reasons that turned out to be bugs in their signals — caught by the LangChain in-the-wild report (4,849 test cases but coverage 0%) and the TestForge self-audit (mutation always 0).
- Coverage (runUnitAnalysis): function-name matching between test descriptions and source function names returns ~0 on library code where test names don't echo function names (it('chain handles long context') doesn't match function formatDate()). New rule: when the precise heuristic returns near-zero on a project with substantial tests (≥10 test files AND test:source ratio ≥ 0.1), fall back to the test-to-source-file ratio as an honest secondary signal. App code keeps its precise score (TestForge stayed at 72%); libraries get a real number (LangChain went 0% → 31%).
- Mutation (runMutationAnalysis): the hasTestFramework early-return checked the ROOT package.json devDeps only, ignoring workspace members. TestForge's vitest lives in mcp-server/package.json; root has none, so root devDeps:[] → "no test framework" → score 0, even though the unit-analyzer correctly detected 17 vitest files via AST. Replaced the misleading devDeps signal with the actual test-file count (which the function already iterates anyway). Also extended the test-file regex to recognize pytest (test_*.py, *_test.py) and Go (*_test.go) conventions so cross-language repos count their tests properly. TestForge mutation went 0 → 35; LangChain went 0 → 54.
- Tests: 208 → 211. Two synthetic-fixture tests prove the ratio-fallback engages on abstract-test-name projects + the mutation analyzer scores >0 when test files exist regardless of root devDeps.
0.28.1 — Vulnerable-dependency check is version-aware. Caught by the TestForge self-audit: express ^5.2.1 fired "Potentially Vulnerable Dependency" even though the CVE in our table is on <4.17.3. checkInsecureDependencies previously matched by package name alone, ignoring the declared spec. Now collects version specs from every package.json in fileContents and short-circuits the finding when the spec's major version is strictly greater than the vulnerable upper bound's major (e.g. ^5.2.1 vs <4.17.3 → safe → no finding). When the spec is unknowable (git+URL, "latest", workspace alias), still fires conservatively — safer than hiding a real vuln. New finding description now embeds the declared spec for actionability ("declared as \"express@^4.16.0\""). Tests: 205 → 208 (+3 covering the three branches: safe-major, vulnerable-major, unknowable-spec).
0.28.0 — Go native support. .go files now count in totalFiles + lines. New endpoint regex covers Gin / Echo / Chi / Fiber / Gorilla Mux (r.GET("/path", h), app.Get(...), mux.HandleFunc(...)) plus stdlib http.HandleFunc. New parseGoMod() parses require blocks (single-line + grouped form), skips // indirect deps, normalizes module paths to short package names (github.com/gin-gonic/gin → gin), and correctly handles semantic-import-versioning suffixes (jackc/pgx/v5 → pgx, not v5). Tech-stack tagging covers Gin, Echo, Chi, Fiber, Gorilla Mux, GORM/sqlx, Cobra, Viper, gRPC, structured logging (Zap/Zerolog/Logrus), testify/ginkgo, PostgreSQL (pgx/pq). Function-name extraction handles both package-level func Name(...) and receiver methods func (r *T) Name(...). go moved from UNSUPPORTED_EXT_TO_LANG to NATIVE_EXTS. Same conventional-monorepo recursion (libs/*, packages/*, apps/*, services/*) applies to go.mod. New fixture tests/fixtures/polyglot-go/ (Gin server + GORM repo + go.mod with indirect deps + semver-path suffix) covers all paths. Tests: 197 → 205.
0.27.2 — Accessibility analyzer ignores non-UI files + reports applicable: false on non-UI repos. Caught by the in-the-wild LangChain report: scored 10/100 on Accessibility because the per-file loop ran checkLinkText on README.md and emitted 44 false-positive "Empty Link" findings. The glob fallback was already filtered to .{html,tsx,jsx,vue,svelte} but when called with a pre-populated fileContents (the orchestrator path), the loop iterated every file. Now hard-filters to UI files at the loop level. New applicable: boolean on A11yReport is true when the repo has any UI files, false otherwise (Python lib, CLI, data-science repo). Surfaced in the dashboard JSON so non-UI repos can be rendered as N/A instead of a fake score. Three new test cases prove the LangChain regression won't recur. Tests: 194 → 197.
0.27.1 — "Missing Rate Limiting" check now only fires on web apps. Caught by the in-the-wild LangChain report: a pure Python library with no web framework still got a medium "Missing Rate Limiting" finding because checkMissingRateLimit fired unconditionally on any project without a rate-limit package. 0.27.1 gates the check on a WEB_FRAMEWORK_DEPS set covering JS (Express, Fastify, Koa, Hono, NestJS, Next, Remix, Astro, Nuxt, SvelteKit, h3, Polka, etc.) and Python (FastAPI, Flask, Django, Starlette, Sanic, Tornado, Aiohttp, Litestar, etc.). Libraries, CLIs, data-science repos, and other non-web projects no longer get the finding. Three new test cases cover: (1) libs-monorepo fixture (no web framework) emits zero rate-limit findings, (2) polyglot-python fixture (FastAPI) still emits one, (3) vulnerable-app fixture (Express) still emits one. Tests: 191 → 194.
0.27.0 — Security findings in test paths are now suppressed. Per-file security analysis (SQL/NoSQL injection, RCE sinks, path traversal, open redirect, reflected XSS, hardcoded secrets, etc.) is skipped on any path matching common test conventions: tests/, test/, __tests__/, __mocks__/, __fixtures__/, e2e/, specs/, fixtures/, cypress/, playwright/ dir segments anywhere in the path; *.test.{js,jsx,ts,tsx,mjs,cjs,mts,cts} and *.spec.* suffixes; pytest test_*.py and *_test.py filenames; and .d.ts declaration files (which have no runtime). Triggered by the in-the-wild Supabase report: 125 "critical" findings were almost all SQL-string-concat in e2e/studio/features/*.spec.ts where building the string is exactly what the test is testing. Project-level checks (rate-limiting, vulnerable dependencies, missing security headers) are unaffected — those are real signals regardless of where test files live. New exported isTestPath() helper. New fixture tests/fixtures/test-path-suppression/ proves a production file with a SQL-injection pattern still flags while four sibling test-path files (matching .test.js / e2e/ / __tests__/ / tests/ conventions) carrying the identical pattern do not. Tests: 189 → 191.
0.26.2 — Conventional-monorepo recursion (libs/, packages/, apps/, services/). A real-world test on langchain-ai/langchain (run via the In-the-Wild showcase pipeline) caught 0.26.1 returning deps: 0 because LangChain ships every package under libs/<name>/pyproject.toml without declaring [tool.uv.workspace] at root. The workspace-recursion in 0.26.1 only followed declared workspaces. 0.26.2 also globs libs/*/<manifest>, packages/*/<manifest>, apps/*/<manifest>, services/*/<manifest> for pyproject.toml, package.json, and requirements.txt. New helper discoverConventionalMembers(root, manifest) in code-scanner.ts. Same input, very different output on the real LangChain clone: deps: 0, techStack: [] → deps: 27 + 66 dev-deps, techStack: 5 (Pydantic, SQLAlchemy, pytest, httpx/requests, Playwright). New fixture tests/fixtures/libs-monorepo/ mirrors the LangChain shape. Tests: 184 → 189.
0.26.1 — Monorepo / workspace recursion for Python + Node. A real-world test on tiangolo/full-stack-fastapi-template exposed that 0.26.0 detected endpoints + pytest files but returned dependencies: 0, techStack: [] because the manifest-discovery code only read the root pyproject.toml / package.json. The actual deps lived in backend/pyproject.toml (uv workspace member) and frontend/package.json (bun workspace member). Fixes: (1) parse [tool.uv.workspace] members = [...] and recurse, (2) parse package.json "workspaces": [...] (handles globs like packages/*) and recurse, also reads pnpm-workspace.yaml, (3) parse PEP 735 [dependency-groups] (Astral's new standard for dev/test/docs groups), (4) peerDependencies rolled into runtime so framework targeting (React/Vue/Svelte) shows in techStack, (5) @playwright/test recognized as Playwright. Critical bug fix: the PEP 621 dependencies = [...] parser used a non-greedy regex that silently truncated arrays at the first ] — which is inside "fastapi[standard]". New extractTomlArrayBody() helper does string-aware bracket balancing. Same input → very different output: full-stack-fastapi-template went from deps: 0, techStack: [] to deps: 49 + 21 dev, techStack: 11 in 25ms. New fixture tests/fixtures/uv-workspace/ mirrors the failing real-world layout. Tests: 179 → 184.
0.26.0 — Python support (FastAPI / Flask / Django / pytest). Closes the polyglot blind spot that produced false-positive reports on Next.js+FastAPI repos like dclawstack/dclaw-monitor (analyzer was claiming "0 endpoints / no test framework / 0 test files" while the backend actually had 54 routes and 13 pytest files). Code-scanner now: (1) includes .py in the file glob, (2) parses requirements.txt / requirements-dev.txt / pyproject.toml (PEP 621 + Poetry tables) with version-spec / extras / env-marker / comment handling, (3) regex-detects FastAPI/Starlette/Flask/Django routes (@router.get(...), @app.route(...), path(...)), (4) emits a languageCoverage field (natively-analyzed % + counts of skipped languages like Go/Ruby/Rust/Java). Unit-analyzer counts test_*.py / *_test.py / tests/**/*.py pytest files via def test_… regex and adds pytest to the frameworks list. Tech-stack now includes FastAPI, Flask, Django, Starlette, SQLAlchemy, Pydantic, Alembic, Celery, pytest, Uvicorn/Gunicorn, APScheduler, OpenTelemetry, PostgreSQL (asyncpg/psycopg2). Dashboard shows an amber banner whenever languageCoverage < 100%, naming each unsupported language with its file count — no more pretending "0 endpoints" means "no endpoints" when really we just didn't read the files. New fixture tests/fixtures/polyglot-python/ (FastAPI + requirements.txt + pyproject.toml + Next.js); tests: 166 → 179.
0.25.2 — Runner image published to GHCR (ghcr.io/t4tarzan/testforge-runner:0.25.2). No more manual docker build step on first Tier-2 use — the MCP auto-pulls the image (~92 MB) on the first /generate-and-run call. Existing testforge-runner:local builds still work via TESTFORGE_RUNNER_IMAGE override.
0.25.1 — /health now reports the correct version (was hardcoded to "0.6.0").
0.25.0 — Tier 2: Generate & Run. New POST /generate-and-run endpoint takes findings from a Tier-1 report, generates one Vitest file per finding via OpenRouter (primary: Qwen 3.7 Max, fallback: DeepSeek V4 Flash), executes them inside a pre-baked Docker container (node:22-slim + vitest, --network=none --rm), and returns structured pass/fail JSON. Provider rotation is automatic on rate-limit or schema rejection; both attempts are recorded. New ~/.testforge/history.db.generations table persists every iteration. New GET /api/generations + GET /api/generations/:id endpoints. Dashboard grows a "🤖 Generate Tests (Tier 2)" button under any report. Env overrides: TESTFORGE_PRIMARY_MODEL, TESTFORGE_FALLBACK_MODEL, TESTFORGE_RUNNER_IMAGE. Self-host has no quota (BYOK pays OpenRouter); managed SaaS gates Tier 2 to the Forge plan ($99/mo · 100 iterations/mo). Verified end-to-end against tinyhttp/malibu: 3 findings → 3 Vitest files → sandbox run in ~45s total. Demo video at https://testforge.run/malibu-tier2.mp4.
0.24.0 — Dimension deepening, pass 16. Stack analysis polished — substring traps eliminated and new signals added. Old code: dep.includes('vite') matched vitest, vitest-mock-extended, vite-something-else (vitest is a test framework, NOT a bundler — false strength). Now uses strict Sets per category for: test frameworks (jest/vitest/mocha/ava/tap/node-tap/@japa/runner/uvu/tape), lint tools (eslint/prettier/@biomejs/biome/rome/standard/xo/oxlint), ORMs (Prisma/Drizzle/TypeORM/Sequelize/Mongoose/MikroORM/Kysely/Knex/Objection), caches (Redis/ioredis/@upstash/memcached/lru-cache/node-cache/cache-manager), monorepo (Turbo/Nx/Lerna/Rush/Changesets), modern bundlers (vite/esbuild/SWC/Turbopack/Parcel/Rspack/Rollup), and new categories: modern frameworks (Next/Remix/Astro/Nuxt/SvelteKit/SolidStart/Qwik/Hono/h3), runtime validation (Zod/Yup/Joi/ajv/Valibot/Arktype/Effect/io-ts/class-validator), tRPC, TS runtimes (tsx/ts-node/esno). New tsconfig strict-mode detection: parses tsconfig.json and emits a low-severity finding when TypeScript is present but compilerOptions.strict is not true. New "API server without validation library" finding: medium severity, fires only when a server framework is detected (Express/Fastify/Koa/Hono/NestJS) and no Zod/Yup/Joi/Valibot is in deps. Monorepo detection now also looks at nx.json and pnpm-workspace.yaml (not just turbo.json). Tests: 159 → 166. New fixtures tests/fixtures/stack-modern/ (Next + Hono + Prisma + Zod + tRPC + Vitest + Vite + Biome + tsconfig strict) and tests/fixtures/stack-legacy/ (Express + Mongo + no TS + vite-something-else and vitest-mock-extended traps that must NOT count).
0.23.0 — Dimension deepening, pass 15. Visual regression and property-based testing both move from substring soup to AST-aware signals. Visual regression: new lib/visual-regression.ts walks JSXAttribute nodes for the style attribute, counts REAL style={{…}} props (not lines containing "style="), and inspects each object property's string value for hardcoded pixel values (/(\d{2,4})px\b/g) and inline hex color literals (#abc / #abcdef / #abcdef00). Findings fire at proper thresholds (≥3 files with inline styles + no CSS Modules → medium; ≥10 hardcoded px / ≥5 inline colors → low). Property-based testing: new lib/property-based.ts removes the previous noisy "function with this.* > 1 is impure" heuristic (fired on every class method) and replaces substring checks with proper AST detection: imports of fast-check / jsverify / @fast-check/vitest; fc.assert() / fc.property() / fc.check() call sites; typeof x === '…' / Array.isArray(x) / x instanceof Class type guards; assert(...) / invariant(...) runtime invariants. New scope-aware findings: "no framework", "framework but no fc.assert calls" (catches import without usage), "no runtime invariants". Tests: 150 → 159. New fixtures: tests/fixtures/visual-quality/ (Bad.tsx with 3 components heavy in inline styles + comment trap; Good.tsx using CSS Modules) and tests/fixtures/property-quality/ (util.js with type guards + assert.ok; util.property.test.js with two fc.property invariants).
0.22.0 — Dimension deepening, pass 14. Edge-case detection moves from broken line-level checks (the old code asked "does the ENTIRE PROJECT contain .length?" to decide if any array access was bounds-checked — false clean on every real project) to AST-aware footgun detection. New lib/edge-cases.ts catches six real bug shapes: (1) parseInt(x) without explicit radix (MDN best-practice); (2) JSON.parse(x) outside a try/catch (range-tracks try blocks via pre-pass); (3) new Date(nonLiteralString) (Invalid Date silently breaks downstream math); (4) loose equality ==/!= (with the == null exception preserved as canonical nullish check); (5) Number(x) used inline in a binary expression / return / member access where guarding is structurally impossible (parent-aware Babel traverse to skip const n = Number(x) cases); (6) switch without default:. Public report grows a byRule field with hit counts per rule. Score: weighted cost per rule (JSON-parse 4pts, parseInt/Number 2pts, switch/loose 1pt). Tests: 141 → 150. New fixture tests/fixtures/edge-cases/ with src/bad.js (every rule fires) and src/good.js (well-guarded variants; nothing fires, including == null and const n = Number(x); if (isNaN(n))…).
0.21.0 — Dimension deepening, pass 13. Vision and Scope analyzers cleaned up from substring soup to precise matching. The old code had several broken cases the new module fixes: analytics substring matched cache-analytics, crypto-analytics-lib (false positive); author in a README matched the auth feature; the implementation check scanned the README itself so any feature documented there was automatically "implemented." New lib/strategic-signals.ts consolidates: case-insensitive README discovery (README.md, Readme.md, readme.md, docs/README.md); explicit Features-section extraction via markdown parsing (capture content under ## Features until the next ## heading or EOF); strict dep-name sets for product analytics, feature flags, error tracking, APM — no more .includes() traps; word-boundary keyword matching (\b<word>\b) so auth no longer matches author. Vision dimension drops its CI/CD finding (now lives only in DORA, pass 12) to avoid double-surfacing. Code-scanner now loads .md files (was excluded, hence README invisible). Scope's implementation check explicitly excludes README/markdown/package.json from the haystack so docs can't satisfy their own claims. Tests: 133 → 141. New fixtures tests/fixtures/strategic-strong/ (every documented feature actually implemented) and tests/fixtures/strategic-weak/ (Payments + Notifications in README without implementation, plus cache-analytics / crypto-analytics-lib deps as substring traps).
0.20.0 — Dimension deepening, pass 12. DORA metrics reframed from fabricated estimates ("Daily (estimated)") to honest capability framing. Real DORA needs git/deploy history; a static analyzer can't see how often the team deploys. What it CAN see is the STATIC SIGNALS that map to each axis. New lib/dora-signals.ts extracts: CI workflow files (.github/workflows/*.yml, .gitlab-ci.yml, .circleci/config.yml, etc.) — parsed via js-yaml to count jobs, detect type-check steps, and identify deploy jobs by name pattern; deployment platform configs (Dockerfile, vercel.json, render.yaml, fly.toml, app.yaml, netlify.toml, serverless.yml, terraform/, kubernetes/, helm/); observability deps (Sentry, Datadog, NewRelic, OpenTelemetry, Honeycomb, Rollbar, Bugsnag — 20+ specific dep names); structured logging deps (pino, winston, bunyan, roarr); feature-flag deps (LaunchDarkly, Statsig, Unleash, Flagsmith, Posthog, GrowthBook, Split, ConfigCat); CODEOWNERS / branch-protection files. Each of the 4 DORA axes is now described as Capability: Good | Partial | Weak rather than a fake frequency string. Per-axis findings fire only when the matching capability is weak: missing CI, missing deploy automation, missing type-check in CI, missing observability, missing feature flags, missing CODEOWNERS. Code-scanner extended to load .github/**, Dockerfile, Procfile, CODEOWNERS (extensionless config files were silently excluded before). Tests: 125 → 133. New fixtures tests/fixtures/dora-mature/ (full CI workflow + Dockerfile + CODEOWNERS + Sentry + pino + Posthog) and tests/fixtures/dora-immature/ (nothing wired).
0.19.0 — Dimension deepening, pass 11. Mutation testing moves from a single test-to-source ratio to AST-based assertion-quality analysis. True mutation testing requires running mutated code (out of scope for a static analyzer), but assertion shapes are a strong proxy for mutation-kill rate and statically observable. New lib/mutation-quality.ts walks each test file's AST, classifies every assertion call into strong (toBe, toEqual, toThrow, toBeInstanceOf, toHaveLength, toMatchObject, toHaveBeenCalledWith, toBeCloseTo, comparison matchers, …), weak (toBeTruthy, toBeFalsy, toBeDefined, toBeNull, toBeNaN, toHaveBeenCalled — a mutation 42 → 41 still satisfies toBeTruthy()), snapshot (toMatchSnapshot, toMatchInlineSnapshot), or other. Handles .not / .resolves / .rejects modifiers, ava t.X style, chai should.X. Public report grows assertionStats (per file) + assertionTotals (project rollup including weakRatio, snapshotRatio, overallVariety = count of distinct strong-matcher types). Score model: base from test-to-source ratio plus adjustments (+5 if variety ≥ 5, −10 if weakRatio > 0.3, −5 if snapshotRatio > 0.5, +10 if Stryker present). Bounded [10, 90]. New findings: medium "test file(s) dominated by weak assertions" (>50% weak); low "snapshot-dominated test file(s)" (≥90% snapshot); low "low matcher variety" (<4 distinct strong types across the project); existing "Stryker not configured" preserved. Tests: 118 → 125. New fixture tests/fixtures/mutation-quality/ with three contrast test files (strong / weak / snapshot) plus a focused source under test.
0.18.0 — Dimension deepening, pass 10. Chaos / resilience analyzer moves from substring matching (allContent.includes('SIGTERM') — any comment with the word fooled it) to AST-based detection of actual resilience patterns. New lib/chaos-patterns.ts walks parsed ASTs for: graceful shutdown handlers (process.on('SIGTERM'|'SIGINT', ...)); process-level safety nets (process.on('unhandledRejection'|'uncaughtException', ...)); retry library imports + call sites (p-retry / async-retry / axios-retry / exponential-backoff / cockatiel); manual retry loops (for/while + try/catch + setTimeout — heuristic); Express global error middleware (4-arg handler signature); Fastify setErrorHandler; new AbortController() instantiation; Idempotency-Key header reads. Public report grows patterns field with ChaosPatternHit[] per category. New findings: critical "no try/catch anywhere"; high "no graceful shutdown" / "no global error handler"; medium "no retry/backoff" / "no unhandledRejection guard"; medium "payment code without Idempotency-Key" (only fires when stripe/payment deps detected). Pass 6 covered circuit breakers + outbound timeouts; pass 10 doesn't duplicate those. Tests: 109 → 118. New fixtures tests/fixtures/chaos-resilient/ (every pattern wired) and tests/fixtures/chaos-fragile/ (nothing wired, plus comment-trap mentioning the keywords).
0.17.0 — Dimension deepening, pass 9. License compliance check rewritten from broken to functional. The previous version had a knownGPL list containing ['react', 'vue', 'angular', 'moment', 'underscore'] — all of which are MIT-licensed. It also never populated the copyleftDeps array it promised in its return type. The new version: walks node_modules/ (when present) and reads each package's license field; categorizes per SPDX into permissive (MIT/ISC/Apache-2.0/BSD-*/0BSD/CC0/Unlicense), copyleftWeak (LGPL/MPL/EPL — LGPL correctly classified as weak even though it contains "GPL"), copyleftStrong (GPL/AGPL/OSL/SSPL — incompatible with proprietary distribution), proprietary (UNLICENSED, "SEE LICENSE IN …"), or unknown (missing field). Emits per-category findings: strong copyleft = HIGH severity with the warning about source-disclosure obligations; weak copyleft = MEDIUM with the linking-exception caveat; UNLICENSED = MEDIUM; missing field = LOW. When node_modules/ isn't present, emits an honest "license audit could not run" finding instead of silently returning a fake clean report. Public report shape grows: inspected, byCategory (counts per category), strongCopyleft/weakCopyleft (full package lists), plus the existing fields preserved. runLicenseCheck(deps, projectPath?) — back-compat, but you need projectPath for the audit to actually run. Both index.ts and mcp-server.ts call sites updated. Tests: 101 → 109. New fixture tests/fixtures/license-mixed/ with a synthetic node_modules/ containing MIT, GPL-3.0, LGPL-2.1, UNLICENSED, no-license, and a scoped @scope/scoped-mit package; 8 tests including a focused unit test for categorizeLicense covering 11 edge cases.
0.16.0 — Dimension deepening, pass 8. Supply-chain audit becomes lockfile-aware. New lib/supply-chain.ts reads package-lock.json (lockfileVersion 2/3, npm v7+), surfaces the full transitive dependency graph, and adds detection for: (1) non-registry sources — packages installed via git+, github:, file:, link:, or http:// (skip npm's tarball signing); (2) missing integrity hashes — registry-resolved entries with no integrity SRI; (3) duplicate-version drift — same package resolved to multiple versions; (4) transitive CVE matches — the existing hardcoded vuln list now scans EVERY entry in the lockfile, not just direct deps. New "no lockfile" finding when a project has no package-lock.json at all. CVE catalogue extended to include minimist, word-wrap, jsonwebtoken alongside the existing list. Public report shape grows fields: totalTransitive, nonRegistrySources, missingIntegrity, duplicateVersions. runSupplyChainAudit(deps, devDeps, projectPath?) is back-compat: old two-arg callers still work but only see direct-dep CVEs (with the new "no lockfile" finding emitted to flag the limitation). Both index.ts (HTTP server) and mcp-server.ts (test runner) call sites updated to pass projectPath. Tests: 93 → 101. New fixtures tests/fixtures/supply-chain-dirty/ (lock with all four red flags) and tests/fixtures/supply-chain-clean/ (negative).
0.15.0 — Dimension deepening, pass 7. OWASP Top 10 (2021) coverage redesigned to be honest about what the analyzer can and can't detect. The previous version counted "categories with any finding" as "covered" — which inverted the meaning (more vulnerabilities = higher score). The new report distinguishes three orthogonal signals: analyzer-coverage (which categories the analyzer ships rules for, project-independent), project-findings (per-category severity breakdown from this project's findings), and gaps (categories the analyzer doesn't yet cover — currently A08 Software Integrity and A10 SSRF, explicitly flagged). Public report grows byCategory: OwaspCategoryReport[] with severity-bucketed counts per code and the detector categories that contributed to each. New rollup findings: any OWASP code with ≥1 critical or ≥3 high findings surfaces as a category-level finding (e.g. A03:2021 — Injection: 4 finding(s) with severity breakdown) so dashboards can show the OWASP framing. New lib/owasp-map.ts is the single source of truth for security-category → OWASP code mapping (a finding can map to multiple codes; CORS now correctly maps to A05, not A01). Tests: 86 → 93. New cases assert score-stability across finding count (analyzer-coverage doesn't change when project findings change), correct bucketing, rollup triggers, no-false-positive on sparse low-severity findings, plus an integration test mapping vulnerable-fixture findings into OWASP categories.
0.14.0 — Dimension deepening, pass 6. Load analyzer moves from substring-matching to AST-based middleware + call-pattern detection. New lib/load-patterns.ts walks parsed ASTs for: app.use(rateLimit(...)) / fastify.register(fastifyRateLimit) style middleware registration; app.use(compression()); cache calls (redis.get/set, cache.get/set, etc. — receiver-name discriminated); pool constructions (new Pool({...}), mysql.createPool); timeout configurations (server.timeout = N, axios.create({ timeout }), fetch(url, { signal })); health endpoints (actual route registration matching /health, /ready, /live, /healthz, /status); circuit breaker imports (opossum/brakes/cockatiel) and breaker.fire() calls; sync I/O inside route handlers (readFileSync / writeFileSync / execSync etc. — a new HIGH-severity finding for a real production performance bug). Public report shape grows a patterns object with file+line locations for every detected hit. Bug fixes: circuit-breaker rule had a precedence bug (!hasCircuitBreaker && allContent.includes('fetch') || allContent.includes('axios')) that fired on every codebase containing the word "axios"; now it correctly requires both external-call presence AND missing breaker. Boolean flags (hasRateLimiting, hasCaching, etc.) are now backed by AST hits OR explicit strong-evidence deps (not loose substring matches like 'cache'). Tests: 79 → 86. New fixtures: tests/fixtures/load-resilient/ (every pattern correctly wired) and tests/fixtures/load-fragile/ (nothing wired, plus sync I/O in handlers, plus comments that mention the keywords as a regex trap).
0.13.0 — Dimension deepening, pass 5. Accessibility analyzer for JSX/TSX moves from line-by-line regex to AST-based JSX attribute inspection. New lib/a11y-jsx.ts walks JSXElement nodes (parent of JSXOpeningElement so it can see children) and runs proper attribute-aware checks. New rules: img-no-alt (HIGH, WCAG 1.1.1); button-no-accessible-name (HIGH, WCAG 4.1.2) — icon-only buttons that have only a self-closing <svg/> child with no aria-label are now correctly flagged; anchor-no-accessible-name (HIGH, WCAG 2.4.4); anchor-target-blank-no-noopener (MEDIUM — tab-nabbing risk); input-no-label (MEDIUM, WCAG 3.3.2) — excludes type="hidden" | "submit" | "button"; clickable-non-interactive (MEDIUM, WCAG 2.1.1) — <div onClick> without role="button" tabIndex={0}; aria-empty (MEDIUM) — aria-label="" is worse than no aria-label. Children-aware accessible-name resolution: text content, expression containers (string/template/identifier), and recursive child JSX elements all contribute. Same-content <a target="_blank" rel="noopener noreferrer"> no longer fires the tab-nabbing rule (rel parses correctly). Findings carry the matching WCAG criterion in the public wcagCriterion field. HTML/Vue/Svelte files keep the prior regex path (Babel doesn't parse them). Tests: 69 → 79. New fixture tests/fixtures/a11y-jsx/ with 11 JSX patterns covering 6 anti-patterns + their accessible counterparts; 10 tests assert each detection plus negative cases for the accessible variants and <input type="hidden">.
0.12.0 — Dimension deepening, pass 4. Predictive failures goes from 5 project-level heuristic counts to cross-signal per-file risk aggregation. New lib/predictive.ts ingests signals from other dimensions (security findings by severity, N+1 hits, dead exports) plus AST-derived cyclomatic complexity (new lib/complexity.ts) and TODO/FIXME density per file, and produces a ranked list of risk hotspots. Each hotspot carries a reasons[] breakdown so you know exactly why a file scored high (e.g. "security: 2 critical · 1 N+1 hit · hot function (cc=23 in processOrder)"). The aggregator runs in two modes: standalone (derives N+1, dead-code, complexity itself) or cross-signal (caller passes pre-computed findings — preferred when running the full pipeline). The dimension's public report grows a topRiskyFiles: FileRisk[] field; up to 5 hotspots also surface as findings with category: 'Predictive' and severity scaling with the per-file score. Deterministic by construction — same inputs → same scores; weights centralized in lib/predictive.ts. Replaces the previous brace-counting "max nesting" heuristic which over-fired on arrow functions and template literals. Tests: 64 → 69. New test cases cover hotspot surfacing, cross-signal aggregation, multi-reason scoring, the Risk hotspot: finding shape, and the Low-risk no-signal floor.
0.11.0 — Dimension deepening, pass 3. Contract analysis goes from a substring check on filenames to real cross-referencing between OpenAPI/Swagger specs and AST-discovered routes. New lib/openapi-parse.ts loads openapi.{yaml,yml,json} / swagger.* / api-spec.* files (parses with js-yaml, validates the openapi: / swagger: root, extracts paths → method tuples + operationIds). New lib/endpoint-discovery.ts AST-walks source files for app.get/router.post/fastify.put/etc., recording the canonical (method, path) of each registration. canonicalPath normalizes /users/{id} (OpenAPI) and /users/:id (Express) to the same shape so they match. New findings emitted: undocumented endpoints (in code, missing from spec), orphan endpoints (in spec, no handler in code), invalid-but-named spec files (file looks like a spec but no openapi: root / parse error), and a smarter missing-versioning check (only fires when a spec exists to compare). code-scanner.ts now loads YAML/JSON files (with package-lock and yarn-lock explicitly excluded). Tests: 58 → 64. New fixtures: tests/fixtures/contracts/ (spec + Express server with intentional mismatches: 2 matched, 2 undocumented, 1 orphan) and tests/fixtures/contracts-missing/ (8 endpoints, no spec at all).
0.10.0 — Dimension deepening, pass 2. Unit-analyzer goes from a regex test-counter to an AST-aware test-quality analyzer. New lib/test-quality.ts walks each parseable test file and produces a structured TestFileQuality per file. New report shape carries a quality block at the top level: { totalCases, skippedCases, focusedCases, assertionlessCases, emptyCases, isolatedTestFiles }. Detection of new anti-patterns: it.skip / xit / it.todo (skipped — rot risk); it.only / fit (focused — silently kills sibling tests in CI); test bodies with NO recognized assertion call (expect/assert/should/t.X/snapshot matchers from Jest/Vitest/Mocha/Chai/AVA/tap/node:test/Testing Library); empty test bodies (only comments / trivial statements); test files that import nothing project-relative (testing only the framework). Recognized frameworks expanded: Jest, Vitest, Mocha, AVA, Node Tap, node:test, Testing Library, Chai. Tests: 51 → 58. New fixture tests/fixtures/test-quality/ with healthy + 5 unhealthy patterns; 7 tests assert each detection.
0.9.0 — Dimension deepening, pass 1. The keepers across the 21 dimensions stay at 21; the depth inside each one grows. This pass takes N+1 detection and dead-code detection from regex-and-substring heuristics to AST-aware analysis using the same Babel + visitor + cross-file infrastructure the security spine uses.N+1 detection — new lib/n-plus-one.ts. Walks parsed ASTs for db sinks (.query/.exec/.findOne/.findUnique, sql\`, prisma.x, mongoose.x, sequelize.x) nested inside for/for-of/for-in/while/do-while, plus the higher-order arr.forEach/map/filter/reduce/some/every/find/flatMapforms (callback body = loop body). Skips calls already wrapped inPromise.all/Promise.allSettled(parallelised, not N+1). Replaces the prior{/}line-counter that over-fired on inner closures and missed db calls in arrow-function loop bodies.<br><br>**Dead-code detection** — newlib/dead-code.ts. For each project file, the AST yields its declared/exported symbols + every referenced identifier + every imported module specifier. An exported symbol is "dead" iff no OTHER file references its name. Replaces the prior allContent.includes(name) heuristic that flagged nothing because every symbol's own declaration line contained its name. **Unused-deps** check now matches on the module ROOT (lodash) so import { get } from 'lodash/get'counts as a use — covers a common false-positive that wrongly flagged sub-path-only imports.<br><br>Tests: 41 → 51. New fixtures attests/fixtures/n-plus-one/(3 positive cases: for-of, forEach, classic for; 2 negative cases:Promise.all-wrapped, no-loop) and tests/fixtures/dead-code/` (used vs. unused exports, sub-path import, genuinely-unused dep). Limitations called out in the source: cross-file dead-code is name-based (global-scope collisions over-count as "used"), and N+1 doesn't follow function calls into closures (intentional — would explode FP rate).
0.8.1 — Patch. Internal cleanup matching the repo-wide lint backlog clearance (127 → 0 errors). Type tightening across mcp-server.ts, local-db.ts, types.d.ts: replaced any / as any with Awaited<ReturnType<…>>, unknown with narrowing, and concrete shapes (e.g. new ReportRow). Dead-code purges (unused regex constants in accessibility-analyzer.ts, unused findParamForExpression in function-summaries.ts, dead imports in mcp-server.ts / test-runner.ts). catch (err: any) → (err) then (err as Error).message. No behavior changes; same analyzer outputs on the same fixtures (41/41 tests pass).
0.8.0 — Spine, Phase 4c. User-authored rules DSL. Projects can drop a .testforge/rules.yaml (or .yml / .json) at the repo root to declare custom pattern detectors that ride on top of the built-in analyzer — no fork required. Each rule has id, title, severity, category, an optional description/fixSuggestion, and a match block. Match shapes in v1: callee (exact dotted match, string or array), calleeRegex (anchored as written), taintedArg (require the arg at this index to come back tainted via the Phase 2 engine), and argRegex (require the string-literal arg at this index to match). Taint-gated rules get HIGH confidence (real source-to-sink flow); shape-only rules get MEDIUM. Malformed rules log a one-shot warning and are skipped — one bad rule never aborts analysis. Up to 200 rules per project. Rules can also be supplied programmatically via the new userRules?: UserRule[] config field (overrides the on-disk file). Tests: 36 → 41; new tests/fixtures/user-rules/ exercises all three match shapes plus the no-fire negative paths.
0.7.0 — Spine, Phase 4b. Cross-file taint propagation. New lib/cross-file-summaries.ts walks every parseable file in a single pre-pass, computes the per-file function summary table (from Phase 4a), then publishes the ones that carry sinks under <resolvedPath>::<exportName> keys. A companion lib/module-resolver.ts resolves relative imports against the candidate file set (.ts, .tsx, .mts, .cts, .js, .jsx, .mjs, .cjs, plus /index.* directory-imports and explicit-extension swaps) without touching disk. Each file gets its own collectFileImports map of "local-name → cross-file key" — handles ESM (import { x }, import x, import * as ns) and CJS (const { x } = require(...), const x = require(...).y, const ns = require(...)). The analyzer's checkCrossFunctionSinkCall now consults the cross-file index for both direct identifier calls and ns.X member calls, emitting findings at the call site of the importing file. Deferred: re-exports (export { x } from './y'), tsconfig path aliases, node_modules resolution, dynamic require(). Tests: 30 → 36; new helpers/db-helper.js (CJS) + helpers/redirect-helper.js (ESM) + cross-file-cjs.js + cross-file-esm.js fixture set.
0.6.0 — Spine, Phase 4a. Cross-function taint propagation (intra-file). New lib/function-summaries.ts builds a per-file table summarizing each named/aliased function: which parameters land in a sink (and which category), which sanitizers wrap them, whether the return value propagates taint. The analyzer then emits findings at the call site when a helper with a sink summary is called with tainted arguments. Catches function runQuery(q) { db.query(q); } + runQuery('...' + req.body.x) as critical/high SQL injection. Handles named declarations, aliased function expressions (const fn = function() {…}), arrow functions (const fn = (a, b) => …). Per-helper intra-procedural taint runs to a small fixpoint so chains param → const A = param + '…' → const B = A → sink(B) resolve cleanly. Deferred: cross-file resolution (Phase 4b), higher-order references like [].map(handler). Tests: 25 → 30, new cross-function.js fixture covering SQL inj / open redirect / path traversal / XSS via helpers.
0.5.0 — Spine, Phase 3. Structured fix suggestions. Each finding can now carry fix: { description, before, after, importsNeeded?, applicable }. applicable: true means "safe to apply mechanically" — the dashboard / CLI can offer a one-click apply (still asking confirmation). applicable: false means "directional advice, the rewrite needs human judgment." Categories that auto-rewrite: SQL injection (concat / template → parameterized form with $N placeholders + bind array), hardcoded named secrets (const api_key = 'sk_…' → const api_key = process.env.API_KEY), reflected XSS via res.send (wrap argument with escape()), innerHTML / dangerouslySetInnerHTML (wrap with DOMPurify.sanitize(...)). Description-only suggestions for eval/Function/exec, open redirect, path traversal, CORS wildcard, sensitive field in res.json (destructure-omit). Public response shape stays additive; old consumers unaffected.
0.4.0 — Spine, Phase 2. Generalized intra-procedural taint tracking across all sinks (was only SQL injection in 0.3.0). New lib/taint.ts engine: per-file table of Map<localName, {source, sanitizers[]}>, expression-tree walker that traces taint through identifiers, member access, template literals, string concat, conditional/logical ops, and JSON.parse. Recognizes 20+ sanitizers (DOMPurify, sanitize-html, escape, path.normalize, parseInt/Number, encodeURIComponent, allowlist .includes()/.has()). New per-finding flow field — narrative like "argument flows from request through DOMPurify.sanitize". confidence semantics tightened: HIGH = source→sink no sanitizer, MEDIUM = sanitizer in path, LOW = pattern matched without taint. All 6 sink categories (SQL inj, RCE, path traversal, open redirect, reflected XSS, DOM XSS) now share the same engine — adding a new source or sanitizer extends all of them at once.
0.3.0 — Spine, Phase 1. Security analyzer moved from line-level regex to a Babel AST traversal. New per-finding confidence field (high / medium / low). Inline suppression comments (// testforge-disable-next-line <category> and // testforge-disable-file <category>). Findings now carry a column number alongside the line. File-size cap (500 KB) and per-file 250 ms parse-and-traverse budget. Basic intra-procedural taint: SQL injection detection catches const q = '…' + req.x; db.query(q); shape, not just inline interpolation. False-positive corpus and true-positive corpus added under tests/fixtures/ to lock in the new precision. eval() re-categorized from XSS to "Dangerous Functions" (more accurate — it's RCE, not script-injection). Old consumers unaffected: the public response shape is additive-only.
0.2.19 — /test and /quick-scan now persist their summary to ~/.testforge/history.db on completion (previously written to in-memory Maps only — runs evaporated on restart).
0.2.18 — Default port changed from 3001 → 33221 to avoid local-dev collisions. /api/reports/latest returns 404 when the local DB is empty instead of fabricated seed data. fast-json-stringify listed as direct dep (defensive against npx cache quirks). /health now reports the actual package version.
0.2.17 and earlier — see git history.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@whitenoisenpm/testforge-mcp

What it does

Quick Start (Tier 1)

Tier 2 — Generate & Run (LLM tests + sandbox)

Simulate — exercise the running app

Manual MCP Setup

Cursor / Windsurf / Claude Desktop

VS Code

MCP Tools

REST API (running standalone)

Local data

Environment Variables

Changelog highlights

License