@runtime-judgement/rj-langfuse

v0.1.0

Published

19 days ago

Langfuse → Runtime Judgement bridge — pulls failed observations, runs a snapshot suite, writes scores back as Langfuse scores. Zero customer code change.

0High
0Medium
0Low

rossamac01

langfuse runtime-judgement observability agents llm attribution bridge

@runtime-judgement/rj-langfuse

The Langfuse ↔ Runtime Judgement bridge — point at a Langfuse project, RJ pulls failed observations, runs a snapshot suite, writes the verdict back to Langfuse as a score. Zero customer code change.

5-minute setup

1. Install

npm install @runtime-judgement/rj-langfuse
# or
pnpm add @runtime-judgement/rj-langfuse

2. Set the environment

# Langfuse — get these from https://cloud.langfuse.com/project/.../settings
export LANGFUSE_HOST="https://cloud.langfuse.com"
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."

# Runtime Judgement — get these from https://runtime-judgement-app.vercel.app/app/settings
export RJ_API_URL="https://runtime-judgement-app.vercel.app"
export RJ_API_KEY="rj_..."

# The snapshot suite to run on each cycle
export RJ_SUITE_ID="01HZ..."

3. Run a cycle

npx rj-langfuse run --since 24h

You'll see a single line on stdout:

rj-langfuse: ingested=12 passed=10 changed=2 scores=12 elapsed=4318ms

Behind the scenes the bridge:

Pulled every Langfuse trace from the last 24 hours that contains at least one observation with level=ERROR or a statusMessage.
Normalised each one from Langfuse's traces+observations+scores model to the OTEL gen_ai.* semconv that RJ speaks.
POSTed each normalised trace to /api/traces on your RJ instance. RJ deduplicates by SHA-256(body) per user, so re-running this is safe.
Triggered the named snapshot suite against your snapshots.
Wrote one Langfuse score per ingested trace (runtime_judgement.<suite-id>) with value=1.0 so the verdict shows up in Langfuse dashboards.

SDK usage

The CLI is a thin wrapper around the LangfuseBridge class. Drop the SDK into your own job system (Inngest, Vercel Cron, Trigger.dev, plain cron):

import { LangfuseBridge } from "@runtime-judgement/rj-langfuse"

const bridge = new LangfuseBridge({
  langfuseHost: process.env.LANGFUSE_HOST!,
  langfusePublicKey: process.env.LANGFUSE_PUBLIC_KEY!,
  langfuseSecretKey: process.env.LANGFUSE_SECRET_KEY!,
  rjApiUrl: process.env.RJ_API_URL!,
  rjApiKey: process.env.RJ_API_KEY!,
  rjSuiteId: process.env.RJ_SUITE_ID!,
})

// One-shot
const summary = await bridge.cycle({ since: "24h", limit: 100 })
// → { ingested: 12, verdicts: { passed: 10, changed: 2 }, scoresWritten: 12 }

// Or decomposed
const ingested = await bridge.pullAndIngest({ since: "24h" })
const run = await bridge.runSnapshotSuite()
const scores = await bridge.writeBackScores(run.suiteRunId)

What's mapped (v0.1 — the 80% case)

| Langfuse | OTEL gen-ai | | ----------------------------------- | ----------------------------------------------------- | | trace.id | trace_id | | observation.id | span_id | | observation.parentObservationId | parent_span_id | | observation.name | span_name | | observation.startTime / endTime | timestamp + duration (ns) | | observation.level=ERROR | status_code = "Error" | | observation.statusMessage | status_message | | GENERATION model | gen_ai.request.model, gen_ai.response.model | | GENERATION usage.{input,output} | gen_ai.usage.{input,output}_tokens | | GENERATION modelParameters.* | gen_ai.request.{temperature,max_tokens,top_p,...} | | GENERATION input / output | gen_ai.prompt / gen_ai.completion (JSON-stringified) | | GENERATION promptName/Version | gen_ai.prompt.name / gen_ai.prompt.version | | SPAN with name=tool:* | tool.name, tool.parameters, tool.output | | SPAN input / output (non-tool) | input.value / output.value | | trace.{userId,sessionId,tags,...} | resource_attributes.langfuse.trace.* |

What's not mapped yet (v0.1 gaps)

Multi-turn message arrays inside input/output are stringified verbatim. Real per-message events would need splitting into OTEL events under the parent span. The RJ class-6 extractor handles the stringified form today.
Langfuse scores are NOT round-tripped back into the OTEL output — they're handled separately via writeBackScores.
Tool-call arguments inside the OpenAI-shaped tool_calls array are captured as tool.parameters only when the span name follows tool:<name>, tool_call, function_call, or *_tool patterns. Real OpenAI tool-call extraction belongs in the RJ parser, not the normalizer.

If your workload hits a gap, the normalizer's langfuseObservationToOtelSpan is a pure function — open an issue with a sample observation and the gap can land in a v0.2 patch.

CLI reference

rj-langfuse run --suite <id> --since <duration|ISO> [--limit <n>]

| Flag | Env var | Default | Notes | | ---------- | ---------------- | ------- | ----------------------------------------------- | | --suite | RJ_SUITE_ID | — | Snapshot suite ULID | | --since | — | 24h | 24h, 7d, 30m or ISO 8601 timestamp | | --limit | — | 100 | Max Langfuse traces per cycle (capped at 1000) |

Exit codes:

0 — cycle completed (verdicts in stdout)
1 — config / argument error
2 — runtime error (HTTP failure, malformed response)

Building from source

pnpm install --filter @runtime-judgement/rj-langfuse
pnpm --filter @runtime-judgement/rj-langfuse build
pnpm --filter @runtime-judgement/rj-langfuse test

The output lands in dist/ with .d.ts declarations.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@runtime-judgement/rj-langfuse

5-minute setup

1. Install

2. Set the environment

3. Run a cycle

SDK usage

What's mapped (v0.1 — the 80% case)

What's not mapped yet (v0.1 gaps)

CLI reference

Building from source