deadweightjs

v0.4.0

Published

16 days ago

Scan LLM traces, ablate prompts, surface dead weight. Local CLI + dashboard. Pitch: "Sentry for prompt waste."

0High
0Medium
0Low

arwenizer

llm prompt-engineering ablation observability langfuse openai anthropic cli

deadweight

Sentry for prompt waste. Scan LLM traces, ablate each chunk of every prompt, surface the parts that don't measurably influence the output — ranked by tokens you'd save if you cut them.

Top dead-weight chunks
┌────┬──────────┬──────────────────┬─────┬──────┬──────┬──────┬────────┐
│  # │ Chunk    │ Type             │ N   │ Mean │ p95  │ Flag │ Saved  │
├────┼──────────┼──────────────────┼─────┼──────┼──────┼──────┼────────┤
│  1 │ ex-3     │ few_shot_example │ 18  │ 0.97 │ 1.00 │ dead │ $14.20 │
│  2 │ rag-7    │ rag_document     │ 12  │ 0.93 │ 0.98 │ dead │ $11.05 │
│  3 │ sys-pol  │ system_block     │ 20  │ 0.91 │ 0.95 │ dead │  $8.66 │
│  …                                                                   │
└────┴──────────┴──────────────────┴─────┴──────┴──────┴──────┴────────┘

Status: v0.1, ready for early testers. Full Phase A (Sprints 1–5) shipped. The MVP runs locally against your trace store, costs a few cents per scan, and produces a ranked leaderboard. Hosted backend, auth, and CI integrations are Phase B (not built yet).

What is deadweight?

deadweight answers one question: "is each part of my prompt worth the tokens it costs?" It reads LLM traces from your observability platform, ablates each chunk of every prompt, and produces a ranked list of chunks with dollar-savings estimates. The internal comparator LLM does pairwise equivalence — original output vs. ablated output — a deliberately different operation from rubric scoring. The output is a ranked list of prompt chunks, not a quality grade.

What it's not

deadweight is not an output evaluator. It does not score whether your LLM responses are helpful, accurate, safe, or on-brand — Langfuse, LangSmith, Arize Phoenix, and Braintrust already do that, and deadweight reads from them rather than competing.

(Full framing in SPEC.md §0.)

What it does

Ingests recent LLM traces from Langfuse (or any JSONL file with the right shape — see docs/openai-sdk-shim.md).
Segments each trace's prompt into chunks: system blocks, RAG docs, few-shot examples, tool descriptions.
Ablates every ablatable chunk by re-running the call with that chunk removed.
Judges each output pair (original vs. without-chunk) for semantic equivalence using a separate LLM with a calibrated prompt (SPEC §9.1).
Ranks chunks by tokens_saved × mean_score — top of the list is the most-removable, most-token-heavy content.
Caches every replay + judgment to disk, so re-runs cost ~$0.

The output is a ranked leaderboard (CLI table or web dashboard) with sample ablations and judge reasoning per chunk. Decide what to delete.

Quickstart (90 seconds)

npm i -g deadweightjs                        # adds `deadweight` to PATH

mkdir my-scan && cd my-scan
deadweight init                              # interactive — paste keys
deadweight scan --limit 200                  # ingest, ablate, rank
deadweight report                            # leaderboard
deadweight serve                             # dashboard at :3737

Requires Node 20+. The dashboard is bundled — no extra install steps.

From source (for hacking on deadweight itself)

git clone https://github.com/arwenizEr/deadweight && cd deadweight
pnpm install && pnpm -r build
cd packages/cli && npm link && cd ../..      # adds `deadweight` to PATH

Try it without keys (local mocks)

Two terminals, no real LLM credentials needed:

node dev/mock-langfuse.mjs                   # 18 fixture traces on :3001
node dev/mock-llm.mjs                        # mock judge + replay on :3002

Then in a third terminal:

mkdir mock-scan && cd mock-scan
deadweight init     # host: http://localhost:3001 ; any pk/sk
                    # judge baseURL: http://localhost:3002 ; any key
deadweight scan --limit 18 -y
deadweight serve

Open http://localhost:3737. The mock comparator always returns "outputs differ" so nothing flags as dead weight — the structure of the leaderboard is what to inspect. Real LLMs give you a real influence-score distribution.

What's in scope (v0.2)

| Feature | Status | | ------------------------------------------------------- | -------------------- | | Langfuse trace ingestion | ✓ | | LangSmith trace ingestion | ✓ v0.2.x | | Arize Phoenix trace ingestion | ✓ v0.2.x | | OpenLLMetry / OTel JSONL ingestion | ✓ v0.2.x | | openai-sdk JSONL ingestion | ✓ | | Segmenter (system/RAG/few-shot/tools, 29 fixture tests) | ✓ | | Replay against original LLM | ✓ Anthropic + OpenAI | | Comparator with verbatim SPEC §9.1 prompt | ✓ | | Disk-cached idempotent re-runs | ✓ | | Template-signature grouping + sampling | ✓ | | deadweight scan / report / serve CLI | ✓ | | Continuous monitoring (cron-style schedules) | ✓ v0.2.0 | | Hosted backend (packages/server, Postgres + Hono) | ✓ Sprint 8.1+8.2 (Docker-only, unreleased) | | Local Next.js dashboard | ✓ | | Cost guardrail above $5 | ✓ | | --explain-judge for top-3 chunks | ✓ |

What's not in scope (Phase B)

| Feature | Status | | -------------------------------------- | ---------------------------- | | Hosted backend / Postgres / accounts | Sprint 8–9 | | Slack / GitHub PR-comment integrations | Sprint 11 | | Prompt-rewriting suggestions | Sprint 12 | | Local / fine-tuned judges (Ollama) | Sprint 13 |

If any of these matter to you, open an issue — early signal influences priority.

Remote mode (hosted backend, Sprint 8 — unreleased on npm)

If you want to accumulate scan history across multiple machines or teammates, point the CLI at a deadweightjs-server instance and skip the local SQLite write entirely:

docker compose up -d                                    # starts postgres + server
deadweight scan --remote http://localhost:8080 --limit 200
deadweight serve --remote http://localhost:8080         # dashboard reads from the server too

The server is a separate workspace package (packages/server, not yet on npm — Docker is the supported distribution for now). It exposes a small REST API (/v1/scans, /v1/health, six write endpoints the CLI uses, and seven read endpoints the dashboard uses). Remote mode is end-to-end as of Sprint 8.3 — scan, dashboard list, scan detail, traces, and per-trace chunks all work against either backend. (Schedules are still local-only; the dashboard's /schedules page renders a notice in remote mode.)

Config-driven equivalent:

// .deadweight/config.json
{
  "source": { "...": "..." },
  "judge":  { "...": "..." },
  "remote": { "url": "http://localhost:8080" }
}

The --remote <url> flag overrides config.remote.url. Pass --remote-token <token> (or set config.remote.token) once Sprint 8.4 turns on the bearer-token gate; for now, tokens thread through harmlessly.

Schedules (continuous monitoring, Sprint 6)

Re-scan on a cron schedule and watch chunks drift over time:

deadweight schedule add weekly-prompts --cron "0 9 * * 1" --limit 200
deadweight schedule install weekly-prompts   # prints the cron / Task Scheduler snippet — paste it yourself

The snippet calls deadweight schedule run weekly-prompts, which is also what you'd run manually to test. The dashboard's /schedules page shows trend lines per chunk and a sparkline of recent mean-scores; /scans/<id> gets a "from schedule: " badge for scheduled runs. deadweight never writes to crontab on your behalf.

Configuration

deadweight init writes .deadweight/config.json in the current directory. Schema in packages/core/src/config/schema.ts; full annotated examples in docs/config.example.json (Langfuse), docs/config.example.langsmith.json (LangSmith), docs/config.example.phoenix.json (Arize Phoenix), docs/config.example.otel.json (OpenLLMetry / OTel JSONL), and docs/config.example.openai-sdk.json (openai-sdk JSONL). Shorter inline below. Keys never leave your machine.

{
  "source": { "type": "langfuse", "host": "...", "publicKey": "...", "secretKey": "..." },
  "judge": { "provider": "anthropic", "model": "claude-opus-4-7", "apiKey": "..." },
  "replay": { "useOriginalProvider": true, "fallback": { "provider": "openai", "apiKey": "..." } },
  "thresholds": { "semanticEquivalence": 0.85 },
  "sample": { "perTemplate": 20 },
  "pricing": {}
}

Sensible per-model pricing defaults are baked in (claude-opus-4-7, sonnet-4-6, haiku-4-5; gpt-4o, gpt-4o-mini). Override per model in pricing to match your contract rates.

FAQ

How much does a scan cost? A scan of N sampled traces × M ablatable chunks runs roughly N×M judge calls + N×M replay calls. With claude-opus-4-7 as the judge that's ~$0.10–$1 per 100 ablations. Re-runs hit the disk cache and cost $0. The cost guardrail blocks at $5 unless you pass -y.

Can I trust the judge? The judge prompt is copied verbatim from SPEC §9.1 and there's a 10-fixture calibration suite (SPEC §9.3) that asserts Pearson r ≥ 0.85 against hand-anchored targets. Run it with DEADWEIGHT_RUN_CALIBRATION=1 pnpm --filter deadweightjs-core test (costs ~$0.10).

Does removing a flagged chunk really save money? Probably yes — the judge says the model produced equivalent output without it. But always spot-check the sample ablations on the dashboard before deleting. Don't auto-apply.

Why local-only / no auth? v0.1 is meant for solo developers inspecting their own prompts. Multi-tenant + auth is Phase B (Sprint 8–9).

Are my keys safe? Never logged, never sent anywhere except their own provider's API. Stored in .deadweight/config.json in your project directory — gitignore it.

Development

pnpm install
pnpm -r build
pnpm -r test

Requires Node 20+ and pnpm 9+. See docs/cli-tests.md for the manual test runbook (one section per sprint step).

Architecture decisions log: DECISIONS.md.

Packages

| Package | npm name | What it is | | --------------- | ------------------- | ------------------------------------------------------------------- | | packages/core | deadweightjs-core | Adapters, segmenter, ablation, judge, scan aggregation. Pure logic. | | packages/cli | deadweightjs | The deadweight command (depends on deadweightjs-core). | | packages/web | deadweightjs-web | Local Next.js dashboard. Bundled into the cli tarball. |

License

MIT. See LICENSE.