@runtime-judgement/rj-langsmith
v0.1.0
Published
LangSmith → Runtime Judgement bridge — pulls runs from a LangSmith project, runs a snapshot suite, writes feedback back as LangSmith feedback. Zero customer code change.
Downloads
86
Maintainers
Readme
@runtime-judgement/rj-langsmith
The LangSmith ↔ Runtime Judgement bridge — point at a LangSmith
project, RJ pulls failed runs, runs a snapshot suite, writes the verdict
back to LangSmith as feedback. Zero customer code change.
Sister package: @runtime-judgement/rj-langfuse — same shape for Langfuse users.
If you're choosing between them: pick the package that matches whichever tracer your team already runs. The RJ side of each bridge is identical; only the source-tracer SDK + auth differs.
5-minute setup
1. Install
npm install @runtime-judgement/rj-langsmith
# or
pnpm add @runtime-judgement/rj-langsmith2. Set the environment
# LangSmith — get LANGSMITH_API_KEY from https://smith.langchain.com/settings
# LANGSMITH_API_URL is optional; defaults to https://api.smith.langchain.com.
# Self-hosted LangSmith customers point it at their internal URL.
export LANGSMITH_API_KEY="lsv2_pt_..."
# export LANGSMITH_API_URL="https://api.smith.langchain.com" # optional
# Runtime Judgement — get these from https://runtime-judgement-app.vercel.app/app/settings
export RJ_API_URL="https://runtime-judgement-app.vercel.app"
export RJ_API_KEY="rj_..."
# The snapshot suite to run on each cycle
export RJ_SUITE_ID="01HZ..."3. Run a cycle
npx rj-langsmith run --project my-agent --since 24hYou'll see a single line on stdout:
rj-langsmith: ingested=12 passed=10 changed=2 feedback=12 elapsed=4318msBehind the scenes the bridge:
- Queried LangSmith for every root run in the named project since 24h ago.
- Hydrated each trace's full run graph (root + descendants) and
filtered to traces that contain at least one run with a non-empty
errororstatus === "error". - Normalised each trace from LangSmith's
runmodel to the OTELgen_ai.*semconv that RJ speaks. - POSTed each normalised trace to
/api/traceson your RJ instance. RJ deduplicates bySHA-256(body)per user, so re-running this is safe. - Triggered the named snapshot suite against your snapshots.
- Wrote one LangSmith feedback row per ingested trace
(
key=runtime_judgement.<suite-id>) attached to the trace's root run withscore=1.0so the verdict shows up in LangSmith dashboards and eval views.
SDK usage
The CLI is a thin wrapper around the LangSmithBridge class. Drop the
SDK into your own job system (Inngest, Vercel Cron, Trigger.dev, plain
cron):
import { LangSmithBridge } from "@runtime-judgement/rj-langsmith"
const bridge = new LangSmithBridge({
langsmithApiKey: process.env.LANGSMITH_API_KEY!,
// Optional — defaults to https://api.smith.langchain.com
langsmithApiUrl: process.env.LANGSMITH_API_URL,
rjApiUrl: process.env.RJ_API_URL!,
rjApiKey: process.env.RJ_API_KEY!,
rjSuiteId: process.env.RJ_SUITE_ID!,
})
// One-shot
const summary = await bridge.cycle({
project: "my-agent",
since: "24h",
limit: 100,
})
// → { ingested: 12, verdicts: { passed: 10, changed: 2 }, feedback: 12 }
// Or decomposed
const ingested = await bridge.pullAndIngest({
project: "my-agent",
since: "24h",
})
const run = await bridge.runSnapshotSuite()
const feedback = await bridge.writeBackFeedback(run.suiteRunId)What's mapped (v0.1 — the 80% case)
LangSmith and Langfuse expose observability in different shapes. LangSmith
has a flat list of run rows linked by parent_run_id, where each run is
typed by run_type (llm | tool | chain | retriever | embedding
| prompt | parser). The normalizer translates each run into one OTEL
gen-ai span.
| LangSmith | OTEL gen-ai |
| ------------------------------------------ | -------------------------------------------------------- |
| run.id | span_id |
| run.parent_run_id | parent_span_id |
| run.trace_id (root run's id if missing) | trace_id |
| run.name | span_name |
| run.start_time / run.end_time | timestamp + duration (ns) |
| run.error non-empty / status='error' | status_code = "Error" + status_message |
| run.events | events (timestamp + attributes normalised) |
| run_type='llm' extra.invocation_params | gen_ai.request.{model,temperature,max_tokens,top_p} |
| run_type='llm' {prompt,completion,total}_tokens | gen_ai.usage.{input,output,total}_tokens |
| run_type='llm' inputs.messages / outputs.generations | gen_ai.prompt / gen_ai.completion (JSON-stringified) |
| run_type='tool' name/inputs/outputs | tool.name / tool.parameters / tool.output |
| run_type='retriever' outputs.documents | retrieval.documents |
| run_type='chain' and unknown types | input.value / output.value (passthrough) |
| run.tags / session_name | langsmith.run.tags / langsmith.run.session |
| run.extra.metadata | langsmith.metadata |
| root manifest_id | resource_attributes.service.version |
What's not mapped yet (v0.1 gaps)
- Tool calls inside an
llmrun'soutputs(OpenAItool_callsarray) are stringified verbatim intogen_ai.completion. Extracting them into separatetool_callspans belongs in the RJ parser, not the normalizer. inputs_s3_url/outputs_s3_url(used by LangSmith Hub for large payloads) are surfaced as attributes but not auto-hydrated. Callers with large traces should pre-fetch the S3 payloads before passing the run into the normalizer.- LangSmith
feedbackalready present on the source run is not round-tripped into the OTEL output. Feedback is handled separately viawriteBackFeedback.
If your workload hits a gap, the normalizer's langSmithRunToOtelSpan
is a pure function — open an issue with a sample run and the gap can
land in a v0.2 patch.
CLI reference
rj-langsmith run --project <name> --since <duration|ISO> [--suite <id>] [--limit <n>]| Flag | Env var | Default | Notes |
| ----------- | ------------------- | ------- | -------------------------------------------------------- |
| --project | — | — | LangSmith project/session name (required) |
| --suite | RJ_SUITE_ID | — | Snapshot suite ULID |
| --since | — | 24h | 24h, 7d, 30m or ISO 8601 timestamp |
| --limit | — | 100 | Max LangSmith traces per cycle (capped at 1000) |
Exit codes:
0— cycle completed (verdicts in stdout)1— config / argument error2— runtime error (HTTP failure, malformed response)
Why direct HTTP, not the langsmith SDK?
The langsmith npm package exists and works, but it pulls a heavy
LangChain peer-graph and has a broader surface than this bridge needs
(eval harness, LangSmith Hub, async client). We hit LangSmith's REST
endpoints directly (/runs/query, /runs/<id>, /feedback) so the
bridge stays single-purpose, zero-peer-dep, and easy to embed in any
cron job. If a future feature needs an SDK-only API (e.g. typed
LangSmith Hub access), we'll add it as a peer dep at that point.
Building from source
pnpm install --filter @runtime-judgement/rj-langsmith
pnpm --filter @runtime-judgement/rj-langsmith build
pnpm --filter @runtime-judgement/rj-langsmith testThe output lands in dist/ with .d.ts declarations.
