@polarityinc/polarity-keystone
v0.2.5
Published
TypeScript/JavaScript SDK for the Keystone agent validation platform
Readme
Keystone SDK for TypeScript / JavaScript
TypeScript client for the Keystone agent evaluation + sandboxed-execution platform. Shares a single pricing + prompt SSOT with the Python and Go SDKs — byte-identical cost estimates and prompt rendering across all three runtimes.
Install
npm install @polarityinc/polarity-keystoneZero runtime dependencies — uses only the standard Node APIs (fetch,
AsyncLocalStorage). Node ≥ 18.
60-second quick start: Eval()
The shortest path from "I have an agent" to "I have an evaluation":
import {
Eval,
Factuality,
AnswerRelevancy,
} from '@polarityinc/polarity-keystone';
const result = await Eval('summarisation-quality', {
data: [
{ input: 'Long article about whales...', expected: 'Whales are mammals.' },
{ input: 'Article about Java GC...', expected: 'Java GC reclaims memory.' },
],
task: async (input) => myAgent(input), // your agent / prompt
scores: [
new Factuality({ model: 'paragon-fast' }),
new AnswerRelevancy({}),
],
maxConcurrency: 4,
});
console.log(result.summary); // p50/p95/mean per scorerIf KEYSTONE_API_KEY is set, the run is also recorded to your dashboard;
otherwise it stays purely local. Same shape in Python and Go.
Sandbox-as-a-tool ergonomics
create() / get() / list() return a bound SandboxHandle so an agent
loop can call the sandbox without threading the ID:
const sb = await ks.sandboxes.create({ spec_id: 'spec-123' });
await sb.exec('python script.py');
await sb.write('/tmp/input.json', JSON.stringify(payload));
const out = await sb.read('/tmp/output.json');
const diff = await sb.diff();
await sb.destroy();Same pattern on ExperimentHandle and AgentSnapshotHandle:
const exp = await ks.experiments.create({ name: 'nightly', spec_id: 's' });
const results = await exp.runAndWait({
scores: [new Factuality({}), new ExactMatch({ expectedKey: 'expected' })],
});
const cmp = await exp.compare(otherExp); // handle or string ID
const m = await exp.metrics();
const snap = await ks.agents.upload({ name: 'codex', /* ... */ });
await snap.delete();The handles still implement the underlying Sandbox / Experiment /
AgentSnapshot shape, so reading sb.id, exp.status, snap.version keeps
working unchanged. The old service-level methods (ks.sandboxes.runCommand(id, …),
ks.experiments.run(id)) stay too — handle methods just delegate.
Auto-instrument every LLM client at once
import { autoInstrument } from '@polarityinc/polarity-keystone';
autoInstrument({
openai, // import OpenAI from 'openai'
anthropic, // import Anthropic from '@anthropic-ai/sdk'
aiSdk: { generateText, streamText }, // Vercel AI SDK
langchainCallbackManager: cm, // LangChain.js
sandboxId: process.env.KEYSTONE_SANDBOX_ID,
});Wraps OpenAI, Anthropic, Mistral, Google GenAI, LiteLLM, Claude Agent SDK, DSPy, LangChain in one call — every prompt, token count, and tool call shows up in your dashboard with no other code changes.
Manual tracing when you want it
import { traced, TracedSpan } from '@polarityinc/polarity-keystone';
// 1. As a function decorator (auto-spans every call)
const fetchUser = traced(async (id: string) => db.users.find(id), { name: 'fetchUser' });
// 2. As a one-shot wrapper
await traced('embed-doc', async () => await openai.embeddings.create({ ... }));
// 3. Class-based for finer control
const span = new TracedSpan({ name: 'planning' });
try { /* ... */ } finally { span.end(); }Spans automatically nest using AsyncLocalStorage — no need to plumb a
context object through your code.
Multi-provider gateways / proxies — recordLLMCall()
ks.wrap(client) patches a client object's .create() method. If your
code is a gateway / proxy / custom routing layer that calls upstream LLMs
through raw fetch() — switching across Anthropic, OpenAI, OpenRouter,
Gemini, etc. per request — there's no client object to wrap. Use
ks.recordLLMCall(opts) to emit the same llm_call event shape wrap()
produces internally:
import { Keystone } from '@polarityinc/polarity-keystone';
const ks = new Keystone();
// Inside your gateway handler, after the upstream call settles:
const start = Date.now();
const upstream = await fetch(upstreamUrl, { method: 'POST', body: JSON.stringify(req) });
const json = await upstream.json();
ks.recordLLMCall({
provider: 'openrouter', // free-form label
model: json.model, // resolved upstream model
requestedModel: req.model, // what the caller asked for
inputTokens: json.usage.prompt_tokens,
outputTokens: json.usage.completion_tokens,
durationMs: Date.now() - start,
inputMessages: req.messages, // truncated to ~4KB on the wire
outputText: json.choices[0].message.content ?? '',
toolCalls: json.choices[0].message.tool_calls?.map((tc) => ({
name: tc.function.name,
id: tc.id,
arguments: tc.function.arguments,
})),
metadata: { 'gen_ai.proxy.fell_back': false }, // any custom OTel-style attrs
});Fire-and-forget. Never throws. Same on-the-wire shape as wrap() events,
so traces emitted from a gateway and from a wrapped SDK client land in the
dashboard with identical schema. Sandbox routing follows the same rules as
wrap() (explicit sandboxId → KEYSTONE_SANDBOX_ID env → agent mode).
If you also wrap a client locally on the caller side, you'll get one
event per call from each side. Pick one, or distinguish them with
metadata.gen_ai.proxy.recorded_by to dedup server-side.
What's in the SDK
- 9 client services —
sandboxes,specs,experiments,alerts,agents,datasets,scoring,export,prompts - 3 bound handles —
SandboxHandle,ExperimentHandle,AgentSnapshotHandlewith delegated methods - 29 built-in scorers (5 families):
- Heuristic (6):
ExactMatch,Levenshtein,NumericDiff,JSONDiff,JSONValidity,SemanticListContains - LLM-judge (9):
Factuality,Battle,ClosedQA,Humor,Moderation,Summarization,SQLJudge,Translation,Security - RAG (8):
ContextPrecision,ContextRecall,ContextRelevancy,ContextEntityRecall,Faithfulness,AnswerRelevancy,AnswerSimilarity,AnswerCorrectness - Embedding (1):
EmbeddingSimilarity - Sandbox invariants (5):
FileExists,FileContains,CommandExits,SQLEquals,LLMJudge
- Heuristic (6):
scorer(fn, opts?)— wrap any(scenario) → scorefunction as a custom scorerEval(name, { data, task, scores })— Braintrust-style one-call eval primitive- Tracing —
traced(fn, { name? })decorator +TracedSpanclass-based form +AsyncLocalStorageparent linking wrapClient+ per-provider helpers (wrapOpenAI,wrapAnthropic,wrapMistral,wrapGoogleGenAI,wrapClaudeAgentSDK,wrapAISDK,wrapMastraAgent)ks.recordLLMCall(opts)— gateway/proxy entry point: emitllm_callevents without a wrappable SDK client objectautoInstrument— patches OpenAI, Anthropic, Mistral, Google GenAI, LiteLLM, Claude Agent SDK, DSPy, LangChain in one call- Prompt management —
ks.prompts.create/get/list/delete,Prompt.render(vars), byte-identical renderer matching Python & Go - Bulk export —
ks.export.{traces,spans,scenarios,scores}(filter, pageSize)returningAsyncIterables;ks.export.experiment(id, { format })for JSON or NDJSON - OpenTelemetry bridge —
wrap()emitsgen_ai.*metadata on LLM spans;registerOtelFlush(cb)hook
Versioning
Semver. Currently on 2.0.0-alpha while the Python/Go/TS parity surface stabilises.
License
MIT.
