@polarityinc/polarity-keystone

v0.2.5

Published

3 days ago

TypeScript/JavaScript SDK for the Keystone agent validation platform

0High
0Medium
0Low

Keystone SDK for TypeScript / JavaScript

TypeScript client for the Keystone agent evaluation + sandboxed-execution platform. Shares a single pricing + prompt SSOT with the Python and Go SDKs — byte-identical cost estimates and prompt rendering across all three runtimes.

Install

npm install @polarityinc/polarity-keystone

Zero runtime dependencies — uses only the standard Node APIs (fetch, AsyncLocalStorage). Node ≥ 18.

60-second quick start: `Eval()`

The shortest path from "I have an agent" to "I have an evaluation":

import {
  Eval,
  Factuality,
  AnswerRelevancy,
} from '@polarityinc/polarity-keystone';

const result = await Eval('summarisation-quality', {
  data: [
    { input: 'Long article about whales...', expected: 'Whales are mammals.' },
    { input: 'Article about Java GC...',     expected: 'Java GC reclaims memory.' },
  ],
  task: async (input) => myAgent(input),         // your agent / prompt
  scores: [
    new Factuality({ model: 'paragon-fast' }),
    new AnswerRelevancy({}),
  ],
  maxConcurrency: 4,
});

console.log(result.summary);                     // p50/p95/mean per scorer

If KEYSTONE_API_KEY is set, the run is also recorded to your dashboard; otherwise it stays purely local. Same shape in Python and Go.

Sandbox-as-a-tool ergonomics

create() / get() / list() return a bound SandboxHandle so an agent loop can call the sandbox without threading the ID:

const sb = await ks.sandboxes.create({ spec_id: 'spec-123' });

await sb.exec('python script.py');
await sb.write('/tmp/input.json', JSON.stringify(payload));
const out = await sb.read('/tmp/output.json');
const diff = await sb.diff();

await sb.destroy();

Same pattern on ExperimentHandle and AgentSnapshotHandle:

const exp = await ks.experiments.create({ name: 'nightly', spec_id: 's' });
const results = await exp.runAndWait({
  scores: [new Factuality({}), new ExactMatch({ expectedKey: 'expected' })],
});
const cmp = await exp.compare(otherExp);                  // handle or string ID
const m   = await exp.metrics();

const snap = await ks.agents.upload({ name: 'codex', /* ... */ });
await snap.delete();

The handles still implement the underlying Sandbox / Experiment / AgentSnapshot shape, so reading sb.id, exp.status, snap.version keeps working unchanged. The old service-level methods (ks.sandboxes.runCommand(id, …), ks.experiments.run(id)) stay too — handle methods just delegate.

Auto-instrument every LLM client at once

import { autoInstrument } from '@polarityinc/polarity-keystone';

autoInstrument({
  openai,                                   // import OpenAI from 'openai'
  anthropic,                                // import Anthropic from '@anthropic-ai/sdk'
  aiSdk: { generateText, streamText },      // Vercel AI SDK
  langchainCallbackManager: cm,             // LangChain.js
  sandboxId: process.env.KEYSTONE_SANDBOX_ID,
});

Wraps OpenAI, Anthropic, Mistral, Google GenAI, LiteLLM, Claude Agent SDK, DSPy, LangChain in one call — every prompt, token count, and tool call shows up in your dashboard with no other code changes.

Manual tracing when you want it

import { traced, TracedSpan } from '@polarityinc/polarity-keystone';

// 1. As a function decorator (auto-spans every call)
const fetchUser = traced(async (id: string) => db.users.find(id), { name: 'fetchUser' });

// 2. As a one-shot wrapper
await traced('embed-doc', async () => await openai.embeddings.create({ ... }));

// 3. Class-based for finer control
const span = new TracedSpan({ name: 'planning' });
try { /* ... */ } finally { span.end(); }

Spans automatically nest using AsyncLocalStorage — no need to plumb a context object through your code.

Multi-provider gateways / proxies — `recordLLMCall()`

ks.wrap(client) patches a client object's .create() method. If your code is a gateway / proxy / custom routing layer that calls upstream LLMs through raw fetch() — switching across Anthropic, OpenAI, OpenRouter, Gemini, etc. per request — there's no client object to wrap. Use ks.recordLLMCall(opts) to emit the same llm_call event shape wrap() produces internally:

import { Keystone } from '@polarityinc/polarity-keystone';

const ks = new Keystone();

// Inside your gateway handler, after the upstream call settles:
const start = Date.now();
const upstream = await fetch(upstreamUrl, { method: 'POST', body: JSON.stringify(req) });
const json = await upstream.json();

ks.recordLLMCall({
  provider: 'openrouter',                          // free-form label
  model: json.model,                               // resolved upstream model
  requestedModel: req.model,                       // what the caller asked for
  inputTokens: json.usage.prompt_tokens,
  outputTokens: json.usage.completion_tokens,
  durationMs: Date.now() - start,
  inputMessages: req.messages,                     // truncated to ~4KB on the wire
  outputText: json.choices[0].message.content ?? '',
  toolCalls: json.choices[0].message.tool_calls?.map((tc) => ({
    name: tc.function.name,
    id: tc.id,
    arguments: tc.function.arguments,
  })),
  metadata: { 'gen_ai.proxy.fell_back': false },   // any custom OTel-style attrs
});

Fire-and-forget. Never throws. Same on-the-wire shape as wrap() events, so traces emitted from a gateway and from a wrapped SDK client land in the dashboard with identical schema. Sandbox routing follows the same rules as wrap() (explicit sandboxId → KEYSTONE_SANDBOX_ID env → agent mode).

If you also wrap a client locally on the caller side, you'll get one event per call from each side. Pick one, or distinguish them with metadata.gen_ai.proxy.recorded_by to dedup server-side.

What's in the SDK

9 client services — sandboxes, specs, experiments, alerts, agents, datasets, scoring, export, prompts
3 bound handles — SandboxHandle, ExperimentHandle, AgentSnapshotHandle with delegated methods
29 built-in scorers (5 families):
- Heuristic (6): ExactMatch, Levenshtein, NumericDiff, JSONDiff, JSONValidity, SemanticListContains
- LLM-judge (9): Factuality, Battle, ClosedQA, Humor, Moderation, Summarization, SQLJudge, Translation, Security
- RAG (8): ContextPrecision, ContextRecall, ContextRelevancy, ContextEntityRecall, Faithfulness, AnswerRelevancy, AnswerSimilarity, AnswerCorrectness
- Embedding (1): EmbeddingSimilarity
- Sandbox invariants (5): FileExists, FileContains, CommandExits, SQLEquals, LLMJudge
scorer(fn, opts?) — wrap any (scenario) → score function as a custom scorer
Eval(name, { data, task, scores }) — Braintrust-style one-call eval primitive
Tracing — traced(fn, { name? }) decorator + TracedSpan class-based form + AsyncLocalStorage parent linking
wrapClient + per-provider helpers (wrapOpenAI, wrapAnthropic, wrapMistral, wrapGoogleGenAI, wrapClaudeAgentSDK, wrapAISDK, wrapMastraAgent)
ks.recordLLMCall(opts) — gateway/proxy entry point: emit llm_call events without a wrappable SDK client object
autoInstrument — patches OpenAI, Anthropic, Mistral, Google GenAI, LiteLLM, Claude Agent SDK, DSPy, LangChain in one call
Prompt management — ks.prompts.create/get/list/delete, Prompt.render(vars), byte-identical renderer matching Python & Go
Bulk export — ks.export.{traces,spans,scenarios,scores}(filter, pageSize) returning AsyncIterables; ks.export.experiment(id, { format }) for JSON or NDJSON
OpenTelemetry bridge — wrap() emits gen_ai.* metadata on LLM spans; registerOtelFlush(cb) hook

Versioning

Semver. Currently on 2.0.0-alpha while the Python/Go/TS parity surface stabilises.

License

MIT.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme