deadweightjs-core
v0.4.0
Published
Core engine for deadweight: trace adapters, segmenter, ablation, judge, scan aggregation. Used by the deadweightjs CLI; safe to depend on directly if you want to embed scanning into another tool.
Downloads
540
Maintainers
Readme
deadweight
Sentry for prompt waste. Scan LLM traces, ablate each chunk of every prompt, surface the parts that don't measurably influence the output — ranked by tokens you'd save if you cut them.
Top dead-weight chunks
┌────┬──────────┬──────────────────┬─────┬──────┬──────┬──────┬────────┐
│ # │ Chunk │ Type │ N │ Mean │ p95 │ Flag │ Saved │
├────┼──────────┼──────────────────┼─────┼──────┼──────┼──────┼────────┤
│ 1 │ ex-3 │ few_shot_example │ 18 │ 0.97 │ 1.00 │ dead │ $14.20 │
│ 2 │ rag-7 │ rag_document │ 12 │ 0.93 │ 0.98 │ dead │ $11.05 │
│ 3 │ sys-pol │ system_block │ 20 │ 0.91 │ 0.95 │ dead │ $8.66 │
│ … │
└────┴──────────┴──────────────────┴─────┴──────┴──────┴──────┴────────┘Status: v0.1, ready for early testers. Full Phase A (Sprints 1–5) shipped. The MVP runs locally against your trace store, costs a few cents per scan, and produces a ranked leaderboard. Hosted backend, auth, and CI integrations are Phase B (not built yet).
What is deadweight?
deadweight answers one question: "is each part of my prompt worth the tokens it costs?" It reads LLM traces from your observability platform, ablates each chunk of every prompt, and produces a ranked list of chunks with dollar-savings estimates. The internal comparator LLM does pairwise equivalence — original output vs. ablated output — a deliberately different operation from rubric scoring. The output is a ranked list of prompt chunks, not a quality grade.
What it's not
deadweight is not an output evaluator. It does not score whether your LLM responses are helpful, accurate, safe, or on-brand — Langfuse, LangSmith, Arize Phoenix, and Braintrust already do that, and deadweight reads from them rather than competing.
(Full framing in SPEC.md §0.)
What it does
- Ingests recent LLM traces from Langfuse (or any JSONL file with the right shape — see docs/openai-sdk-shim.md).
- Segments each trace's prompt into chunks: system blocks, RAG docs, few-shot examples, tool descriptions.
- Ablates every ablatable chunk by re-running the call with that chunk removed.
- Judges each output pair (original vs. without-chunk) for semantic equivalence using a separate LLM with a calibrated prompt (SPEC §9.1).
- Ranks chunks by
tokens_saved × mean_score— top of the list is the most-removable, most-token-heavy content. - Caches every replay + judgment to disk, so re-runs cost ~$0.
The output is a ranked leaderboard (CLI table or web dashboard) with sample ablations and judge reasoning per chunk. Decide what to delete.
Quickstart (90 seconds)
npm i -g deadweightjs # adds `deadweight` to PATH
mkdir my-scan && cd my-scan
deadweight init # interactive — paste keys
deadweight scan --limit 200 # ingest, ablate, rank
deadweight report # leaderboard
deadweight serve # dashboard at :3737Requires Node 20+. The dashboard is bundled — no extra install steps.
From source (for hacking on deadweight itself)
git clone https://github.com/arwenizEr/deadweight && cd deadweight
pnpm install && pnpm -r build
cd packages/cli && npm link && cd ../.. # adds `deadweight` to PATHTry it without keys (local mocks)
Two terminals, no real LLM credentials needed:
node dev/mock-langfuse.mjs # 18 fixture traces on :3001
node dev/mock-llm.mjs # mock judge + replay on :3002Then in a third terminal:
mkdir mock-scan && cd mock-scan
deadweight init # host: http://localhost:3001 ; any pk/sk
# judge baseURL: http://localhost:3002 ; any key
deadweight scan --limit 18 -y
deadweight serveOpen http://localhost:3737. The mock comparator always returns "outputs differ" so nothing flags as dead weight — the structure of the leaderboard is what to inspect. Real LLMs give you a real influence-score distribution.
What's in scope (v0.2)
| Feature | Status |
| ------------------------------------------------------- | -------------------- |
| Langfuse trace ingestion | ✓ |
| LangSmith trace ingestion | ✓ v0.2.x |
| Arize Phoenix trace ingestion | ✓ v0.2.x |
| OpenLLMetry / OTel JSONL ingestion | ✓ v0.2.x |
| openai-sdk JSONL ingestion | ✓ |
| Segmenter (system/RAG/few-shot/tools, 29 fixture tests) | ✓ |
| Replay against original LLM | ✓ Anthropic + OpenAI |
| Comparator with verbatim SPEC §9.1 prompt | ✓ |
| Disk-cached idempotent re-runs | ✓ |
| Template-signature grouping + sampling | ✓ |
| deadweight scan / report / serve CLI | ✓ |
| Continuous monitoring (cron-style schedules) | ✓ v0.2.0 |
| Hosted backend (packages/server, Postgres + Hono) | ✓ Sprint 8.1+8.2 (Docker-only, unreleased) |
| Local Next.js dashboard | ✓ |
| Cost guardrail above $5 | ✓ |
| --explain-judge for top-3 chunks | ✓ |
What's not in scope (Phase B)
| Feature | Status | | -------------------------------------- | ---------------------------- | | Hosted backend / Postgres / accounts | Sprint 8–9 | | Slack / GitHub PR-comment integrations | Sprint 11 | | Prompt-rewriting suggestions | Sprint 12 | | Local / fine-tuned judges (Ollama) | Sprint 13 |
If any of these matter to you, open an issue — early signal influences priority.
Remote mode (hosted backend, Sprint 8 — unreleased on npm)
If you want to accumulate scan history across multiple machines or teammates, point the CLI at a deadweightjs-server instance and skip the local SQLite write entirely:
docker compose up -d # starts postgres + server
deadweight scan --remote http://localhost:8080 --limit 200
deadweight serve --remote http://localhost:8080 # dashboard reads from the server tooThe server is a separate workspace package (packages/server,
not yet on npm — Docker is the supported distribution for now). It
exposes a small REST API (/v1/scans, /v1/health, six write
endpoints the CLI uses, and seven read endpoints the dashboard
uses). Remote mode is end-to-end as of Sprint 8.3 — scan, dashboard
list, scan detail, traces, and per-trace chunks all work against
either backend. (Schedules are still local-only; the dashboard's
/schedules page renders a notice in remote mode.)
Config-driven equivalent:
// .deadweight/config.json
{
"source": { "...": "..." },
"judge": { "...": "..." },
"remote": { "url": "http://localhost:8080" }
}The --remote <url> flag overrides config.remote.url. Pass
--remote-token <token> (or set config.remote.token) once Sprint
8.4 turns on the bearer-token gate; for now, tokens thread through
harmlessly.
Schedules (continuous monitoring, Sprint 6)
Re-scan on a cron schedule and watch chunks drift over time:
deadweight schedule add weekly-prompts --cron "0 9 * * 1" --limit 200
deadweight schedule install weekly-prompts # prints the cron / Task Scheduler snippet — paste it yourselfThe snippet calls deadweight schedule run weekly-prompts, which is
also what you'd run manually to test. The dashboard's /schedules
page shows trend lines per chunk and a sparkline of recent
mean-scores; /scans/<id> gets a "from schedule: " badge for
scheduled runs. deadweight never writes to crontab on your behalf.
Configuration
deadweight init writes .deadweight/config.json in the current directory.
Schema in packages/core/src/config/schema.ts;
full annotated examples in
docs/config.example.json (Langfuse),
docs/config.example.langsmith.json (LangSmith),
docs/config.example.phoenix.json (Arize Phoenix),
docs/config.example.otel.json (OpenLLMetry / OTel JSONL), and
docs/config.example.openai-sdk.json
(openai-sdk JSONL). Shorter inline below. Keys never leave your machine.
{
"source": { "type": "langfuse", "host": "...", "publicKey": "...", "secretKey": "..." },
"judge": { "provider": "anthropic", "model": "claude-opus-4-7", "apiKey": "..." },
"replay": { "useOriginalProvider": true, "fallback": { "provider": "openai", "apiKey": "..." } },
"thresholds": { "semanticEquivalence": 0.85 },
"sample": { "perTemplate": 20 },
"pricing": {}
}Sensible per-model pricing defaults are baked in (claude-opus-4-7,
sonnet-4-6, haiku-4-5; gpt-4o, gpt-4o-mini). Override per model in
pricing to match your contract rates.
FAQ
How much does a scan cost? A scan of N sampled traces × M ablatable
chunks runs roughly N×M judge calls + N×M replay calls. With
claude-opus-4-7 as the judge that's ~$0.10–$1 per 100 ablations. Re-runs
hit the disk cache and cost $0. The cost guardrail blocks at $5 unless
you pass -y.
Can I trust the judge? The judge prompt is copied verbatim from
SPEC §9.1 and there's a 10-fixture calibration
suite (SPEC §9.3) that asserts Pearson
r ≥ 0.85 against hand-anchored targets. Run it with
DEADWEIGHT_RUN_CALIBRATION=1 pnpm --filter deadweightjs-core test
(costs ~$0.10).
Does removing a flagged chunk really save money? Probably yes — the judge says the model produced equivalent output without it. But always spot-check the sample ablations on the dashboard before deleting. Don't auto-apply.
Why local-only / no auth? v0.1 is meant for solo developers inspecting their own prompts. Multi-tenant + auth is Phase B (Sprint 8–9).
Are my keys safe? Never logged, never sent anywhere except their
own provider's API. Stored in .deadweight/config.json in your project
directory — gitignore it.
Development
pnpm install
pnpm -r build
pnpm -r testRequires Node 20+ and pnpm 9+. See docs/cli-tests.md
for the manual test runbook (one section per sprint step).
Architecture decisions log: DECISIONS.md.
Packages
| Package | npm name | What it is |
| --------------- | ------------------- | ------------------------------------------------------------------- |
| packages/core | deadweightjs-core | Adapters, segmenter, ablation, judge, scan aggregation. Pure logic. |
| packages/cli | deadweightjs | The deadweight command (depends on deadweightjs-core). |
| packages/web | deadweightjs-web | Local Next.js dashboard. Bundled into the cli tarball. |
License
MIT. See LICENSE.
