@runtime-judgement/rj-langfuse
v0.1.0
Published
Langfuse → Runtime Judgement bridge — pulls failed observations, runs a snapshot suite, writes scores back as Langfuse scores. Zero customer code change.
Maintainers
Readme
@runtime-judgement/rj-langfuse
The Langfuse ↔ Runtime Judgement bridge — point at a Langfuse project,
RJ pulls failed observations, runs a snapshot suite, writes the verdict
back to Langfuse as a score. Zero customer code change.
5-minute setup
1. Install
npm install @runtime-judgement/rj-langfuse
# or
pnpm add @runtime-judgement/rj-langfuse2. Set the environment
# Langfuse — get these from https://cloud.langfuse.com/project/.../settings
export LANGFUSE_HOST="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
# Runtime Judgement — get these from https://runtime-judgement-app.vercel.app/app/settings
export RJ_API_URL="https://runtime-judgement-app.vercel.app"
export RJ_API_KEY="rj_..."
# The snapshot suite to run on each cycle
export RJ_SUITE_ID="01HZ..."3. Run a cycle
npx rj-langfuse run --since 24hYou'll see a single line on stdout:
rj-langfuse: ingested=12 passed=10 changed=2 scores=12 elapsed=4318msBehind the scenes the bridge:
- Pulled every Langfuse trace from the last 24 hours that contains at
least one observation with
level=ERRORor astatusMessage. - Normalised each one from Langfuse's
traces+observations+scoresmodel to the OTELgen_ai.*semconv that RJ speaks. - POSTed each normalised trace to
/api/traceson your RJ instance. RJ deduplicates bySHA-256(body)per user, so re-running this is safe. - Triggered the named snapshot suite against your snapshots.
- Wrote one Langfuse score per ingested trace (
runtime_judgement.<suite-id>) withvalue=1.0so the verdict shows up in Langfuse dashboards.
SDK usage
The CLI is a thin wrapper around the LangfuseBridge class. Drop the SDK
into your own job system (Inngest, Vercel Cron, Trigger.dev, plain cron):
import { LangfuseBridge } from "@runtime-judgement/rj-langfuse"
const bridge = new LangfuseBridge({
langfuseHost: process.env.LANGFUSE_HOST!,
langfusePublicKey: process.env.LANGFUSE_PUBLIC_KEY!,
langfuseSecretKey: process.env.LANGFUSE_SECRET_KEY!,
rjApiUrl: process.env.RJ_API_URL!,
rjApiKey: process.env.RJ_API_KEY!,
rjSuiteId: process.env.RJ_SUITE_ID!,
})
// One-shot
const summary = await bridge.cycle({ since: "24h", limit: 100 })
// → { ingested: 12, verdicts: { passed: 10, changed: 2 }, scoresWritten: 12 }
// Or decomposed
const ingested = await bridge.pullAndIngest({ since: "24h" })
const run = await bridge.runSnapshotSuite()
const scores = await bridge.writeBackScores(run.suiteRunId)What's mapped (v0.1 — the 80% case)
| Langfuse | OTEL gen-ai |
| ----------------------------------- | ----------------------------------------------------- |
| trace.id | trace_id |
| observation.id | span_id |
| observation.parentObservationId | parent_span_id |
| observation.name | span_name |
| observation.startTime / endTime | timestamp + duration (ns) |
| observation.level=ERROR | status_code = "Error" |
| observation.statusMessage | status_message |
| GENERATION model | gen_ai.request.model, gen_ai.response.model |
| GENERATION usage.{input,output} | gen_ai.usage.{input,output}_tokens |
| GENERATION modelParameters.* | gen_ai.request.{temperature,max_tokens,top_p,...} |
| GENERATION input / output | gen_ai.prompt / gen_ai.completion (JSON-stringified) |
| GENERATION promptName/Version | gen_ai.prompt.name / gen_ai.prompt.version |
| SPAN with name=tool:* | tool.name, tool.parameters, tool.output |
| SPAN input / output (non-tool) | input.value / output.value |
| trace.{userId,sessionId,tags,...} | resource_attributes.langfuse.trace.* |
What's not mapped yet (v0.1 gaps)
- Multi-turn message arrays inside
input/outputare stringified verbatim. Real per-message events would need splitting into OTEL events under the parent span. The RJ class-6 extractor handles the stringified form today. - Langfuse
scoresare NOT round-tripped back into the OTEL output — they're handled separately viawriteBackScores. - Tool-call arguments inside the OpenAI-shaped
tool_callsarray are captured astool.parametersonly when the span name followstool:<name>,tool_call,function_call, or*_toolpatterns. Real OpenAI tool-call extraction belongs in the RJ parser, not the normalizer.
If your workload hits a gap, the normalizer's langfuseObservationToOtelSpan
is a pure function — open an issue with a sample observation and the
gap can land in a v0.2 patch.
CLI reference
rj-langfuse run --suite <id> --since <duration|ISO> [--limit <n>]| Flag | Env var | Default | Notes |
| ---------- | ---------------- | ------- | ----------------------------------------------- |
| --suite | RJ_SUITE_ID | — | Snapshot suite ULID |
| --since | — | 24h | 24h, 7d, 30m or ISO 8601 timestamp |
| --limit | — | 100 | Max Langfuse traces per cycle (capped at 1000) |
Exit codes:
0— cycle completed (verdicts in stdout)1— config / argument error2— runtime error (HTTP failure, malformed response)
Building from source
pnpm install --filter @runtime-judgement/rj-langfuse
pnpm --filter @runtime-judgement/rj-langfuse build
pnpm --filter @runtime-judgement/rj-langfuse testThe output lands in dist/ with .d.ts declarations.
