npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

scenegrad

v0.0.1

Published

Agents close the gap between scene-now and scene-then. Tiny substrate for goal-conditioned planning over scene state.

Downloads

99

Readme

scenegrad

Drop into your agent in 2 lines. Get a scrubbable replay. Level up when you're ready.

v0.0.1 license

scenegrad viewer

Above: 12 support-triage trajectories at a glance, then drill into one. The bulk view shows the distribution (3 escalated-vip in red, 6 escalated-t2 in amber, 3 auto-resolved in emerald). Filter chips narrow the grid; click any card to scrub through the full trajectory with typed widgets — Ticket card on the left shows the world morphing (status: newinvestigatingescalated-vip, [CRITICAL] reply appearing), Customer card on the right materializes with ENTERPRISE badge + LTV when the agent runs enrich_with_account. All 12 runs cost ~$0.02 in Haiku tokens.

→ Try the live demo · Single-trace viewer


Is this for you?

scenegrad pays off when your agent does multi-step work that mutates state. Specifically:

  • ✅ Your agent calls multiple tools across multiple turns
  • ✅ Those tools change something — a CRM record, a ticket status, a database row, a scheduled job
  • ✅ You can't tell from the tool-call log alone whether the world ended up in the right state
  • ✅ You want to compare runs across models or prompts and see trajectories, not just final scores
  • ✅ You're using Vercel AI SDK, Anthropic SDK, LangChain, or your own loop (5-line drop-in for any)

scenegrad is not the right fit if:

  • ❌ Your agent is single-call chat / RAG with no state mutation — Phoenix or Helicone will serve you better
  • ❌ You only need output scoring (LLM-as-judge) — Braintrust does that better; consider scenegrad alongside, not instead
  • ❌ You need a full eval platform with annotation queues, datasets-as-a-service, etc. — that's not the scope here

We're explicitly not trying to be a replacement for those tools. scenegrad fills the gap they don't: visualizing typed scene state across multi-step trajectories, with a viewer that scales from single-trace to bulk grid.


scenegrad is a tiny observability + evaluation substrate for AI agents. Pay only for what you use:

import { trace } from "scenegrad";

const t = trace.start();

// your existing Vercel AI SDK / LangChain / custom loop, untouched:
const result = await generateText({
  model: anthropic("claude-haiku-4-5"),
  tools: { /* your tools, unchanged */ },
  onStepFinish: t.captureStep,         // ← only addition
  prompt: "...",
});

t.dump("./traces/run.jsonl");

Then view it:

npx scenegrad view ./traces           # bulk view of every JSONL in ./traces
npx scenegrad view ./traces/run.jsonl # single-trace view of one file

v0.0.1 note: the CLI requires Bun on your PATH (the bin scripts are TypeScript with a #!/usr/bin/env bun shebang). Node-compatible bin compilation lands in v0.0.2.

That's it. Drop in, run, view your runs in a browser. No goal to design. No assertions to write. No restructuring of your agent.

When you want more — add a snapshot() to capture world state between calls. Add a goal() of assertions to measure gap closure. Each level is opt-in.


The four-tier ladder

| Tier | What you write | What you get | |---|---|---| | 0 — trace | trace.start() + one hook | Tool-call timeline, scrubbable. Replaces logs. | | 1 — + snapshot | Add snapshot: () => fetchWorld() | World deltas between calls. See what changed, not just what was called. | | 2 — + goal | Add goal: (s) => [...assertions] | Gap-closure curve, drift detection, runtime status() for agent guidance. | | 3 — + solver | Use defineEnv + LLMSolver / GreedySolver | scenegrad drives the loop — for benches, comparison, leaderboards. |

The same trajectory format flows through all four tiers. You can adopt at tier 0, level up months later as you understand your agent's failure modes.


Tier 1 — add a snapshot

When tools mutate external state (DBs, APIs, CRMs, queues), seeing what changed is more useful than seeing what was called. Pass a snapshot function:

const t = trace.start({
  snapshot: async () => ({
    user:        await db.users.findOne({ session_id }),
    welcome_sent: await emails.exists({ template: "welcome" }),
  }),
});

Now each step in the trajectory captures both scene_before and scene_after. The viewer can render the world delta per step.


Tier 2 — add a goal, get drift detection + status()

Define what "done" looks like as assertions:

const watcher = observe({
  snapshot: async () => fetchWorld(),
  goal: (s) => [
    { name: "name_collected",   check: (s) => ({ satisfied: !!s.user?.name,        gap: 1 }) },
    { name: "email_collected",  check: (s) => ({ satisfied: !!s.user?.email,       gap: 1 }) },
    { name: "role_specified",   check: (s) => ({ satisfied: !!s.user?.role,        gap: 1 }) },
    { name: "welcome_email_sent", check: (s) => ({ satisfied: s.welcome_sent,       gap: 1 }) },
  ],
});

// In your existing loop, each turn:
const status = await watcher.status();

const result = await generateText({
  model: anthropic("..."),
  system: `Onboarding. Still unmet: ${status.unmet.map(a => a.name).join(", ")}.`,
  tools: { /* unchanged */ },
  onStepFinish: async ({ toolCalls }) => {
    for (const c of toolCalls ?? [])
      await watcher.recordStep({ tool: { name: c.toolName, args: c.input } });
  },
});

The agent's checklist is now grounded in actual world state — not its working memory. When you evaluate post-hoc, you read the same assertions back through watcher.trajectory(). Spec written once; serves both runtime guidance and evaluation.

This is also TDD-shaped agent development: write the assertion → run → watch the gap → tighten. See examples/inbox.ts for the canonical three-iteration progression.


Tier 3 — solver mode, for benches

When you want scenegrad to drive the loop (for benchmarks, comparison across models, controlled tests):

import { defineEnv, LLMSolver, GreedySolver } from "scenegrad";

const task = defineEnv({
  init:  () => ({ count: 0 }),
  goal:  (s) => [{ name: "count = 5",
                   check: s => ({ satisfied: s.count === 5, gap: 5 - s.count }) }],
  tools: () => [{ name: "inc" }, { name: "dec" }],
  step:  (s, t) => t.name === "inc" ? { count: s.count + 1 } : { count: s.count - 1 },
});

await new GreedySolver().solve(task, "default");           // baseline
await new LLMSolver({ model: "claude-haiku-4-5" }).solve(task, "default");  // LLM-driven

Both solvers produce the same SolveResult shape. Compare them on the same env to see how much the LLM drifts from the optimal greedy baseline.


Why this isn't another agent framework

If you already use… | scenegrad adds… ---|--- LangChain / LangGraph | Drift measurement and per-task evaluation. LangChain orchestrates; scenegrad measures. OpenTelemetry / Phoenix | A vocabulary for what to observe (scene + goal + diff), not just how to ship spans. Custom eval scripts | A common shape so your evals are comparable across runs, models, teams. System prompts to constrain behavior | A way to say "done" the framework can VERIFY, not just hope the LLM honors.

scenegrad doesn't replace your agent. It instruments your task so behavior becomes visible, comparable, and refinable.


What scenegrad does NOT do

  • Doesn't write your distance function. Domain-specific. Sometimes hard.
  • Doesn't induce your toolkit. You author it; autocompile refines it over time.
  • Doesn't solve local-optima / dead-end paths. It exposes them; your solver picks the search strategy.
  • Doesn't replace your agent. Your agent runs whatever loop it runs; scenegrad measures it.
  • Doesn't unify cross-domain distance. ARC's "12 cells off" and SAP's "3 audit controls violated" aren't directly comparable — each domain owns its units.
  • Doesn't force the agent to predict gradients. The natural drift signals (gap-not-closing, goal-claimed-but-unmet, vs-baseline) work without polluting the prompt.

The thinking framework

We sense the scene / world now. We make assertions on it — sensor values, query responses, photos, descriptions. It's always an image of the scene, not the scene itself.

We have a future scene vaguely in mind. We assert that as best we can. The diff between now and then is the gradient. Tools are actions that close it. We make progress, re-evaluate, sometimes redefine the goal as we learn. Reality is complex; the loop is simple.

Two gradients flow through this: the agent's (closing scene-now to scene-then per step) and yours (closing the spec to reality, by tightening assertions when behavior surprises you). Both are gradient descent. Both happen in the same framework.


Install + run the examples

npm install scenegrad
# optional, for LLMSolver:
npm install @anthropic-ai/sdk
# optional, for tier-0 with Vercel AI SDK:
npm install ai @ai-sdk/anthropic zod

# tier 0 — drop-in trace, no goal
ANTHROPIC_API_KEY=... bun examples/trace-only-aisdk.ts

# tier 2 — observer mode with goal + status injection
ANTHROPIC_API_KEY=... bun examples/onboarding.ts                  # Anthropic SDK
ANTHROPIC_API_KEY=... bun examples/support-triage-aisdk.ts        # Vercel AI SDK

# tier 3 — solver mode (for benches)
bun examples/counter.ts                                            # no LLM
ANTHROPIC_API_KEY=... bun examples/inbox.ts                        # LLMSolver, TDD progression

Status

v0.0.1 — substrate types, defineEnv, observe (tiers 0/1/2), trace.start() (tier-0 alias), GreedySolver, LLMSolver, JSONL trace format, viewer scaffold.

Reference benches live in scenebench: ARC-trajectory ships first; AutomationBench (806 real tasks); S4Bench (SAP) and LeRobot (robotics) follow.

Related

  • scene-otel — wire format scenegrad emits trajectories in
  • scenecast — typed scene shapes + multi-size widgets that render trajectories visually
  • scenebench — benchmarks built on scenegrad
  • autocompile — observes accumulated trajectories, hardens patterns to code

License

MIT.