npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@polarityinc/polarity-keystone

v0.2.5

Published

TypeScript/JavaScript SDK for the Keystone agent validation platform

Readme

Keystone SDK for TypeScript / JavaScript

TypeScript client for the Keystone agent evaluation + sandboxed-execution platform. Shares a single pricing + prompt SSOT with the Python and Go SDKs — byte-identical cost estimates and prompt rendering across all three runtimes.

Install

npm install @polarityinc/polarity-keystone

Zero runtime dependencies — uses only the standard Node APIs (fetch, AsyncLocalStorage). Node ≥ 18.

60-second quick start: Eval()

The shortest path from "I have an agent" to "I have an evaluation":

import {
  Eval,
  Factuality,
  AnswerRelevancy,
} from '@polarityinc/polarity-keystone';

const result = await Eval('summarisation-quality', {
  data: [
    { input: 'Long article about whales...', expected: 'Whales are mammals.' },
    { input: 'Article about Java GC...',     expected: 'Java GC reclaims memory.' },
  ],
  task: async (input) => myAgent(input),         // your agent / prompt
  scores: [
    new Factuality({ model: 'paragon-fast' }),
    new AnswerRelevancy({}),
  ],
  maxConcurrency: 4,
});

console.log(result.summary);                     // p50/p95/mean per scorer

If KEYSTONE_API_KEY is set, the run is also recorded to your dashboard; otherwise it stays purely local. Same shape in Python and Go.

Sandbox-as-a-tool ergonomics

create() / get() / list() return a bound SandboxHandle so an agent loop can call the sandbox without threading the ID:

const sb = await ks.sandboxes.create({ spec_id: 'spec-123' });

await sb.exec('python script.py');
await sb.write('/tmp/input.json', JSON.stringify(payload));
const out = await sb.read('/tmp/output.json');
const diff = await sb.diff();

await sb.destroy();

Same pattern on ExperimentHandle and AgentSnapshotHandle:

const exp = await ks.experiments.create({ name: 'nightly', spec_id: 's' });
const results = await exp.runAndWait({
  scores: [new Factuality({}), new ExactMatch({ expectedKey: 'expected' })],
});
const cmp = await exp.compare(otherExp);                  // handle or string ID
const m   = await exp.metrics();

const snap = await ks.agents.upload({ name: 'codex', /* ... */ });
await snap.delete();

The handles still implement the underlying Sandbox / Experiment / AgentSnapshot shape, so reading sb.id, exp.status, snap.version keeps working unchanged. The old service-level methods (ks.sandboxes.runCommand(id, …), ks.experiments.run(id)) stay too — handle methods just delegate.

Auto-instrument every LLM client at once

import { autoInstrument } from '@polarityinc/polarity-keystone';

autoInstrument({
  openai,                                   // import OpenAI from 'openai'
  anthropic,                                // import Anthropic from '@anthropic-ai/sdk'
  aiSdk: { generateText, streamText },      // Vercel AI SDK
  langchainCallbackManager: cm,             // LangChain.js
  sandboxId: process.env.KEYSTONE_SANDBOX_ID,
});

Wraps OpenAI, Anthropic, Mistral, Google GenAI, LiteLLM, Claude Agent SDK, DSPy, LangChain in one call — every prompt, token count, and tool call shows up in your dashboard with no other code changes.

Manual tracing when you want it

import { traced, TracedSpan } from '@polarityinc/polarity-keystone';

// 1. As a function decorator (auto-spans every call)
const fetchUser = traced(async (id: string) => db.users.find(id), { name: 'fetchUser' });

// 2. As a one-shot wrapper
await traced('embed-doc', async () => await openai.embeddings.create({ ... }));

// 3. Class-based for finer control
const span = new TracedSpan({ name: 'planning' });
try { /* ... */ } finally { span.end(); }

Spans automatically nest using AsyncLocalStorage — no need to plumb a context object through your code.

Multi-provider gateways / proxies — recordLLMCall()

ks.wrap(client) patches a client object's .create() method. If your code is a gateway / proxy / custom routing layer that calls upstream LLMs through raw fetch() — switching across Anthropic, OpenAI, OpenRouter, Gemini, etc. per request — there's no client object to wrap. Use ks.recordLLMCall(opts) to emit the same llm_call event shape wrap() produces internally:

import { Keystone } from '@polarityinc/polarity-keystone';

const ks = new Keystone();

// Inside your gateway handler, after the upstream call settles:
const start = Date.now();
const upstream = await fetch(upstreamUrl, { method: 'POST', body: JSON.stringify(req) });
const json = await upstream.json();

ks.recordLLMCall({
  provider: 'openrouter',                          // free-form label
  model: json.model,                               // resolved upstream model
  requestedModel: req.model,                       // what the caller asked for
  inputTokens: json.usage.prompt_tokens,
  outputTokens: json.usage.completion_tokens,
  durationMs: Date.now() - start,
  inputMessages: req.messages,                     // truncated to ~4KB on the wire
  outputText: json.choices[0].message.content ?? '',
  toolCalls: json.choices[0].message.tool_calls?.map((tc) => ({
    name: tc.function.name,
    id: tc.id,
    arguments: tc.function.arguments,
  })),
  metadata: { 'gen_ai.proxy.fell_back': false },   // any custom OTel-style attrs
});

Fire-and-forget. Never throws. Same on-the-wire shape as wrap() events, so traces emitted from a gateway and from a wrapped SDK client land in the dashboard with identical schema. Sandbox routing follows the same rules as wrap() (explicit sandboxIdKEYSTONE_SANDBOX_ID env → agent mode).

If you also wrap a client locally on the caller side, you'll get one event per call from each side. Pick one, or distinguish them with metadata.gen_ai.proxy.recorded_by to dedup server-side.

What's in the SDK

  • 9 client servicessandboxes, specs, experiments, alerts, agents, datasets, scoring, export, prompts
  • 3 bound handlesSandboxHandle, ExperimentHandle, AgentSnapshotHandle with delegated methods
  • 29 built-in scorers (5 families):
    • Heuristic (6): ExactMatch, Levenshtein, NumericDiff, JSONDiff, JSONValidity, SemanticListContains
    • LLM-judge (9): Factuality, Battle, ClosedQA, Humor, Moderation, Summarization, SQLJudge, Translation, Security
    • RAG (8): ContextPrecision, ContextRecall, ContextRelevancy, ContextEntityRecall, Faithfulness, AnswerRelevancy, AnswerSimilarity, AnswerCorrectness
    • Embedding (1): EmbeddingSimilarity
    • Sandbox invariants (5): FileExists, FileContains, CommandExits, SQLEquals, LLMJudge
  • scorer(fn, opts?) — wrap any (scenario) → score function as a custom scorer
  • Eval(name, { data, task, scores }) — Braintrust-style one-call eval primitive
  • Tracingtraced(fn, { name? }) decorator + TracedSpan class-based form + AsyncLocalStorage parent linking
  • wrapClient + per-provider helpers (wrapOpenAI, wrapAnthropic, wrapMistral, wrapGoogleGenAI, wrapClaudeAgentSDK, wrapAISDK, wrapMastraAgent)
  • ks.recordLLMCall(opts) — gateway/proxy entry point: emit llm_call events without a wrappable SDK client object
  • autoInstrument — patches OpenAI, Anthropic, Mistral, Google GenAI, LiteLLM, Claude Agent SDK, DSPy, LangChain in one call
  • Prompt managementks.prompts.create/get/list/delete, Prompt.render(vars), byte-identical renderer matching Python & Go
  • Bulk exportks.export.{traces,spans,scenarios,scores}(filter, pageSize) returning AsyncIterables; ks.export.experiment(id, { format }) for JSON or NDJSON
  • OpenTelemetry bridgewrap() emits gen_ai.* metadata on LLM spans; registerOtelFlush(cb) hook

Versioning

Semver. Currently on 2.0.0-alpha while the Python/Go/TS parity surface stabilises.

License

MIT.