deadweightjs
v0.4.0
Published
Scan LLM traces, ablate prompts, surface dead weight. Local CLI + dashboard. Pitch: "Sentry for prompt waste."
Maintainers
Readme
deadweight
Sentry for prompt waste. Scan LLM traces, ablate each chunk of every prompt, surface the parts that don't measurably influence the output — ranked by tokens you'd save if you cut them.
Top dead-weight chunks
┌────┬──────────┬──────────────────┬─────┬──────┬──────┬──────┬────────┐
│ # │ Chunk │ Type │ N │ Mean │ p95 │ Flag │ Saved │
├────┼──────────┼──────────────────┼─────┼──────┼──────┼──────┼────────┤
│ 1 │ ex-3 │ few_shot_example │ 18 │ 0.97 │ 1.00 │ dead │ $14.20 │
│ 2 │ rag-7 │ rag_document │ 12 │ 0.93 │ 0.98 │ dead │ $11.05 │
│ 3 │ sys-pol │ system_block │ 20 │ 0.91 │ 0.95 │ dead │ $8.66 │
│ … │
└────┴──────────┴──────────────────┴─────┴──────┴──────┴──────┴────────┘Status: v0.1, ready for early testers. Full Phase A (Sprints 1–5) shipped. The MVP runs locally against your trace store, costs a few cents per scan, and produces a ranked leaderboard. Hosted backend, auth, and CI integrations are Phase B (not built yet).
What is deadweight?
deadweight answers one question: "is each part of my prompt worth the tokens it costs?" It reads LLM traces from your observability platform, ablates each chunk of every prompt, and produces a ranked list of chunks with dollar-savings estimates. The internal comparator LLM does pairwise equivalence — original output vs. ablated output — a deliberately different operation from rubric scoring. The output is a ranked list of prompt chunks, not a quality grade.
What it's not
deadweight is not an output evaluator. It does not score whether your LLM responses are helpful, accurate, safe, or on-brand — Langfuse, LangSmith, Arize Phoenix, and Braintrust already do that, and deadweight reads from them rather than competing.
(Full framing in SPEC.md §0.)
What it does
- Ingests recent LLM traces from Langfuse (or any JSONL file with the right shape — see docs/openai-sdk-shim.md).
- Segments each trace's prompt into chunks: system blocks, RAG docs, few-shot examples, tool descriptions.
- Ablates every ablatable chunk by re-running the call with that chunk removed.
- Judges each output pair (original vs. without-chunk) for semantic equivalence using a separate LLM with a calibrated prompt (SPEC §9.1).
- Ranks chunks by
tokens_saved × mean_score— top of the list is the most-removable, most-token-heavy content. - Caches every replay + judgment to disk, so re-runs cost ~$0.
The output is a ranked leaderboard (CLI table or web dashboard) with sample ablations and judge reasoning per chunk. Decide what to delete.
Quickstart (90 seconds)
npm i -g deadweightjs # adds `deadweight` to PATH
mkdir my-scan && cd my-scan
deadweight init # interactive — paste keys
deadweight scan --limit 200 # ingest, ablate, rank
deadweight report # leaderboard
deadweight serve # dashboard at :3737Requires Node 20+. The dashboard is bundled — no extra install steps.
From source (for hacking on deadweight itself)
git clone https://github.com/arwenizEr/deadweight && cd deadweight
pnpm install && pnpm -r build
cd packages/cli && npm link && cd ../.. # adds `deadweight` to PATHTry it without keys (local mocks)
Two terminals, no real LLM credentials needed:
node dev/mock-langfuse.mjs # 18 fixture traces on :3001
node dev/mock-llm.mjs # mock judge + replay on :3002Then in a third terminal:
mkdir mock-scan && cd mock-scan
deadweight init # host: http://localhost:3001 ; any pk/sk
# judge baseURL: http://localhost:3002 ; any key
deadweight scan --limit 18 -y
deadweight serveOpen http://localhost:3737. The mock comparator always returns "outputs differ" so nothing flags as dead weight — the structure of the leaderboard is what to inspect. Real LLMs give you a real influence-score distribution.
What's in scope (v0.2)
| Feature | Status |
| ------------------------------------------------------- | -------------------- |
| Langfuse trace ingestion | ✓ |
| LangSmith trace ingestion | ✓ v0.2.x |
| Arize Phoenix trace ingestion | ✓ v0.2.x |
| OpenLLMetry / OTel JSONL ingestion | ✓ v0.2.x |
| openai-sdk JSONL ingestion | ✓ |
| Segmenter (system/RAG/few-shot/tools, 29 fixture tests) | ✓ |
| Replay against original LLM | ✓ Anthropic + OpenAI |
| Comparator with verbatim SPEC §9.1 prompt | ✓ |
| Disk-cached idempotent re-runs | ✓ |
| Template-signature grouping + sampling | ✓ |
| deadweight scan / report / serve CLI | ✓ |
| Continuous monitoring (cron-style schedules) | ✓ v0.2.0 |
| Hosted backend (packages/server, Postgres + Hono) | ✓ Sprint 8.1+8.2 (Docker-only, unreleased) |
| Local Next.js dashboard | ✓ |
| Cost guardrail above $5 | ✓ |
| --explain-judge for top-3 chunks | ✓ |
What's not in scope (Phase B)
| Feature | Status | | -------------------------------------- | ---------------------------- | | Hosted backend / Postgres / accounts | Sprint 8–9 | | Slack / GitHub PR-comment integrations | Sprint 11 | | Prompt-rewriting suggestions | Sprint 12 | | Local / fine-tuned judges (Ollama) | Sprint 13 |
If any of these matter to you, open an issue — early signal influences priority.
Remote mode (hosted backend, Sprint 8 — unreleased on npm)
If you want to accumulate scan history across multiple machines or teammates, point the CLI at a deadweightjs-server instance and skip the local SQLite write entirely:
docker compose up -d # starts postgres + server
deadweight scan --remote http://localhost:8080 --limit 200
deadweight serve --remote http://localhost:8080 # dashboard reads from the server tooThe server is a separate workspace package (packages/server,
not yet on npm — Docker is the supported distribution for now). It
exposes a small REST API (/v1/scans, /v1/health, six write
endpoints the CLI uses, and seven read endpoints the dashboard
uses). Remote mode is end-to-end as of Sprint 8.3 — scan, dashboard
list, scan detail, traces, and per-trace chunks all work against
either backend. (Schedules are still local-only; the dashboard's
/schedules page renders a notice in remote mode.)
Config-driven equivalent:
// .deadweight/config.json
{
"source": { "...": "..." },
"judge": { "...": "..." },
"remote": { "url": "http://localhost:8080" }
}The --remote <url> flag overrides config.remote.url. Pass
--remote-token <token> (or set config.remote.token) once Sprint
8.4 turns on the bearer-token gate; for now, tokens thread through
harmlessly.
Schedules (continuous monitoring, Sprint 6)
Re-scan on a cron schedule and watch chunks drift over time:
deadweight schedule add weekly-prompts --cron "0 9 * * 1" --limit 200
deadweight schedule install weekly-prompts # prints the cron / Task Scheduler snippet — paste it yourselfThe snippet calls deadweight schedule run weekly-prompts, which is
also what you'd run manually to test. The dashboard's /schedules
page shows trend lines per chunk and a sparkline of recent
mean-scores; /scans/<id> gets a "from schedule: " badge for
scheduled runs. deadweight never writes to crontab on your behalf.
Configuration
deadweight init writes .deadweight/config.json in the current directory.
Schema in packages/core/src/config/schema.ts;
full annotated examples in
docs/config.example.json (Langfuse),
docs/config.example.langsmith.json (LangSmith),
docs/config.example.phoenix.json (Arize Phoenix),
docs/config.example.otel.json (OpenLLMetry / OTel JSONL), and
docs/config.example.openai-sdk.json
(openai-sdk JSONL). Shorter inline below. Keys never leave your machine.
{
"source": { "type": "langfuse", "host": "...", "publicKey": "...", "secretKey": "..." },
"judge": { "provider": "anthropic", "model": "claude-opus-4-7", "apiKey": "..." },
"replay": { "useOriginalProvider": true, "fallback": { "provider": "openai", "apiKey": "..." } },
"thresholds": { "semanticEquivalence": 0.85 },
"sample": { "perTemplate": 20 },
"pricing": {}
}Sensible per-model pricing defaults are baked in (claude-opus-4-7,
sonnet-4-6, haiku-4-5; gpt-4o, gpt-4o-mini). Override per model in
pricing to match your contract rates.
FAQ
How much does a scan cost? A scan of N sampled traces × M ablatable
chunks runs roughly N×M judge calls + N×M replay calls. With
claude-opus-4-7 as the judge that's ~$0.10–$1 per 100 ablations. Re-runs
hit the disk cache and cost $0. The cost guardrail blocks at $5 unless
you pass -y.
Can I trust the judge? The judge prompt is copied verbatim from
SPEC §9.1 and there's a 10-fixture calibration
suite (SPEC §9.3) that asserts Pearson
r ≥ 0.85 against hand-anchored targets. Run it with
DEADWEIGHT_RUN_CALIBRATION=1 pnpm --filter deadweightjs-core test
(costs ~$0.10).
Does removing a flagged chunk really save money? Probably yes — the judge says the model produced equivalent output without it. But always spot-check the sample ablations on the dashboard before deleting. Don't auto-apply.
Why local-only / no auth? v0.1 is meant for solo developers inspecting their own prompts. Multi-tenant + auth is Phase B (Sprint 8–9).
Are my keys safe? Never logged, never sent anywhere except their
own provider's API. Stored in .deadweight/config.json in your project
directory — gitignore it.
Development
pnpm install
pnpm -r build
pnpm -r testRequires Node 20+ and pnpm 9+. See docs/cli-tests.md
for the manual test runbook (one section per sprint step).
Architecture decisions log: DECISIONS.md.
Packages
| Package | npm name | What it is |
| --------------- | ------------------- | ------------------------------------------------------------------- |
| packages/core | deadweightjs-core | Adapters, segmenter, ablation, judge, scan aggregation. Pure logic. |
| packages/cli | deadweightjs | The deadweight command (depends on deadweightjs-core). |
| packages/web | deadweightjs-web | Local Next.js dashboard. Bundled into the cli tarball. |
License
MIT. See LICENSE.
