@runtime-judgement/rj-langsmith

v0.1.0

Published

19 days ago

LangSmith → Runtime Judgement bridge — pulls runs from a LangSmith project, runs a snapshot suite, writes feedback back as LangSmith feedback. Zero customer code change.

Downloads

0High
0Medium
0Low

rossamac01

langsmith langchain runtime-judgement observability agents llm attribution bridge

@runtime-judgement/rj-langsmith

The LangSmith ↔ Runtime Judgement bridge — point at a LangSmith project, RJ pulls failed runs, runs a snapshot suite, writes the verdict back to LangSmith as feedback. Zero customer code change.

Sister package: @runtime-judgement/rj-langfuse — same shape for Langfuse users.

If you're choosing between them: pick the package that matches whichever tracer your team already runs. The RJ side of each bridge is identical; only the source-tracer SDK + auth differs.

5-minute setup

1. Install

npm install @runtime-judgement/rj-langsmith
# or
pnpm add @runtime-judgement/rj-langsmith

2. Set the environment

# LangSmith — get LANGSMITH_API_KEY from https://smith.langchain.com/settings
# LANGSMITH_API_URL is optional; defaults to https://api.smith.langchain.com.
# Self-hosted LangSmith customers point it at their internal URL.
export LANGSMITH_API_KEY="lsv2_pt_..."
# export LANGSMITH_API_URL="https://api.smith.langchain.com"   # optional

# Runtime Judgement — get these from https://runtime-judgement-app.vercel.app/app/settings
export RJ_API_URL="https://runtime-judgement-app.vercel.app"
export RJ_API_KEY="rj_..."

# The snapshot suite to run on each cycle
export RJ_SUITE_ID="01HZ..."

3. Run a cycle

npx rj-langsmith run --project my-agent --since 24h

You'll see a single line on stdout:

rj-langsmith: ingested=12 passed=10 changed=2 feedback=12 elapsed=4318ms

Behind the scenes the bridge:

Queried LangSmith for every root run in the named project since 24h ago.
Hydrated each trace's full run graph (root + descendants) and filtered to traces that contain at least one run with a non-empty error or status === "error".
Normalised each trace from LangSmith's run model to the OTEL gen_ai.* semconv that RJ speaks.
POSTed each normalised trace to /api/traces on your RJ instance. RJ deduplicates by SHA-256(body) per user, so re-running this is safe.
Triggered the named snapshot suite against your snapshots.
Wrote one LangSmith feedback row per ingested trace (key=runtime_judgement.<suite-id>) attached to the trace's root run with score=1.0 so the verdict shows up in LangSmith dashboards and eval views.

SDK usage

The CLI is a thin wrapper around the LangSmithBridge class. Drop the SDK into your own job system (Inngest, Vercel Cron, Trigger.dev, plain cron):

import { LangSmithBridge } from "@runtime-judgement/rj-langsmith"

const bridge = new LangSmithBridge({
  langsmithApiKey: process.env.LANGSMITH_API_KEY!,
  // Optional — defaults to https://api.smith.langchain.com
  langsmithApiUrl: process.env.LANGSMITH_API_URL,
  rjApiUrl: process.env.RJ_API_URL!,
  rjApiKey: process.env.RJ_API_KEY!,
  rjSuiteId: process.env.RJ_SUITE_ID!,
})

// One-shot
const summary = await bridge.cycle({
  project: "my-agent",
  since: "24h",
  limit: 100,
})
// → { ingested: 12, verdicts: { passed: 10, changed: 2 }, feedback: 12 }

// Or decomposed
const ingested = await bridge.pullAndIngest({
  project: "my-agent",
  since: "24h",
})
const run = await bridge.runSnapshotSuite()
const feedback = await bridge.writeBackFeedback(run.suiteRunId)

What's mapped (v0.1 — the 80% case)

LangSmith and Langfuse expose observability in different shapes. LangSmith has a flat list of run rows linked by parent_run_id, where each run is typed by run_type (llm | tool | chain | retriever | embedding | prompt | parser). The normalizer translates each run into one OTEL gen-ai span.

| LangSmith | OTEL gen-ai | | ------------------------------------------ | -------------------------------------------------------- | | run.id | span_id | | run.parent_run_id | parent_span_id | | run.trace_id (root run's id if missing) | trace_id | | run.name | span_name | | run.start_time / run.end_time | timestamp + duration (ns) | | run.error non-empty / status='error' | status_code = "Error" + status_message | | run.events | events (timestamp + attributes normalised) | | run_type='llm' extra.invocation_params | gen_ai.request.{model,temperature,max_tokens,top_p} | | run_type='llm' {prompt,completion,total}_tokens | gen_ai.usage.{input,output,total}_tokens | | run_type='llm' inputs.messages / outputs.generations | gen_ai.prompt / gen_ai.completion (JSON-stringified) | | run_type='tool' name/inputs/outputs | tool.name / tool.parameters / tool.output | | run_type='retriever' outputs.documents | retrieval.documents | | run_type='chain' and unknown types | input.value / output.value (passthrough) | | run.tags / session_name | langsmith.run.tags / langsmith.run.session | | run.extra.metadata | langsmith.metadata | | root manifest_id | resource_attributes.service.version |

What's not mapped yet (v0.1 gaps)

Tool calls inside an llm run's outputs (OpenAI tool_calls array) are stringified verbatim into gen_ai.completion. Extracting them into separate tool_call spans belongs in the RJ parser, not the normalizer.
inputs_s3_url / outputs_s3_url (used by LangSmith Hub for large payloads) are surfaced as attributes but not auto-hydrated. Callers with large traces should pre-fetch the S3 payloads before passing the run into the normalizer.
LangSmith feedback already present on the source run is not round-tripped into the OTEL output. Feedback is handled separately via writeBackFeedback.

If your workload hits a gap, the normalizer's langSmithRunToOtelSpan is a pure function — open an issue with a sample run and the gap can land in a v0.2 patch.

CLI reference

rj-langsmith run --project <name> --since <duration|ISO> [--suite <id>] [--limit <n>]

| Flag | Env var | Default | Notes | | ----------- | ------------------- | ------- | -------------------------------------------------------- | | --project | — | — | LangSmith project/session name (required) | | --suite | RJ_SUITE_ID | — | Snapshot suite ULID | | --since | — | 24h | 24h, 7d, 30m or ISO 8601 timestamp | | --limit | — | 100 | Max LangSmith traces per cycle (capped at 1000) |

Exit codes:

0 — cycle completed (verdicts in stdout)
1 — config / argument error
2 — runtime error (HTTP failure, malformed response)

Why direct HTTP, not the `langsmith` SDK?

The langsmith npm package exists and works, but it pulls a heavy LangChain peer-graph and has a broader surface than this bridge needs (eval harness, LangSmith Hub, async client). We hit LangSmith's REST endpoints directly (/runs/query, /runs/<id>, /feedback) so the bridge stays single-purpose, zero-peer-dep, and easy to embed in any cron job. If a future feature needs an SDK-only API (e.g. typed LangSmith Hub access), we'll add it as a peer dep at that point.

Building from source

pnpm install --filter @runtime-judgement/rj-langsmith
pnpm --filter @runtime-judgement/rj-langsmith build
pnpm --filter @runtime-judgement/rj-langsmith test

The output lands in dist/ with .d.ts declarations.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@runtime-judgement/rj-langsmith

5-minute setup

1. Install

2. Set the environment

3. Run a cycle

SDK usage

What's mapped (v0.1 — the 80% case)

What's not mapped yet (v0.1 gaps)

CLI reference

Why direct HTTP, not the langsmith SDK?

Building from source

Why direct HTTP, not the `langsmith` SDK?