argus-bedrock-tracer

v0.1.0

Published

13 days ago

TypeScript SDK for integrating Node.js/Bedrock agents with the argus Python evaluation runner

0High
0Medium
0Low

ketan.gupta

eval bedrock aws llm agent tracing

@ai-eval/bedrock-tracer

TypeScript SDK for integrating your Node.js / Amazon Bedrock agent with the Argus evaluation and observability dashboard.

The SDK provides two integration modes:

| Mode | When to use | |---|---| | instrument() | Live tracing — wrap an existing BedrockRuntimeClient to stream real-time traces into Argus with zero agent logic changes | | EvalTracer + EvalServer | Batch evaluation — expose an eval endpoint that the Python runner calls to score offline prompt sets |

Live Tracing — `instrument()`

Installation

npm install
npm run build   # compiles TypeScript → dist/

Usage

import { BedrockRuntimeClient } from "@aws-sdk/client-bedrock-runtime";
import { instrument } from "./dist/instrument";

const client = new BedrockRuntimeClient({ region: "us-east-1" });

const tracer = instrument(client, {
  server:  "http://localhost:7070",  // Argus UI base URL
  apiKey:  "aek_...",               // project API key from Argus Settings
  runName: "My Agent — Production", // optional label shown in the UI
  tags:    ["prod", "v2"],
  agentId: "my-agent",
  verbose: true,                    // log each push to console (default: true)
});

// Use `client` exactly as before — instrument() intercepts transparently
const response = await client.send(converseCommand);

// Stop intercepting when done
tracer.stop();
console.log("run ID:", tracer.runId);

How it works

instrument() patches client.send() to intercept ConverseCommand and ConverseStreamCommand calls.
On the first call of each agent invocation it captures:
- System prompt from command.input.system
- Context — the injected content block(s) prepended before the user question (e.g. page URL, user state JSON)
- User prompt — the actual last user question (last content block in the current turn)
After each LLM step it pushes a trace snapshot to POST /api/ingest — the same run is updated in-place so you see live progress in the dashboard.
When the model returns end_turn the run is marked complete and state is reset for the next invocation.
A conversation ID is auto-derived from a djb2 hash of the system prompt + first user message — the same conversation always gets the same ID across turns with no extra code.

Context vs. User Prompt separation

If your agent prepends an injected context block as a separate content block before the user's question, the SDK detects and separates them automatically:

command.input.messages = [
  {
    role: "user",
    content: [
      { text: "Current page url is https://... user state: {...}" },  // ← context
      { text: "Give me a login overview" }                           // ← user prompt
    ]
  }
]

context is sent separately in the ingest payload and shown in a collapsible Context card in the Argus trace detail panel.
prompt is the actual user question — shown in the traces table and the User Prompt card.

`InstrumentOptions`

| Option | Type | Default | Description | |---|---|---|---| | server | string | required | Argus UI base URL, e.g. "http://localhost:7070" | | apiKey | string | — | Project API key — automatically assigns traces to the matching project | | runName | string | "Live — <date>" | Label for the run shown in the UI | | tags | string[] | — | Tags attached to the run | | agentId | string | — | Agent identifier stored in run metadata | | verbose | boolean | true | Log each push to console |

`Instrumenter` interface

interface Instrumenter {
  readonly runId: string | null;  // eval-server run ID (null until first push succeeds)
  stop(): void;                   // restores the original client.send()
}

Batch Evaluation — `EvalTracer` + `EvalServer`

Use this mode when the Python eval runner drives evaluation (it calls your agent server with each prompt from a prompt set).

Python eval runner
    │
    │  POST /eval  { prompt, system_prompt, run_id, prompt_id }
    ▼
EvalServer  (this SDK)
    │
    │  calls your handler
    ▼
EvalTracer.run(prompt, toolHandler, tools, systemPrompt)
    │
    │  Bedrock ConverseCommand loop
    ▼
Bedrock (Claude / Nova / Llama …)
    │
    │  tool_use → toolHandler → tool results
    ▼
EvalTracer returns { content, trace }
    │
    │  POST response  { content, trace: EvalTrace }
    ▼
Python eval runner  ← evaluators run against content + trace

Quick start

import {
  EvalTracer,
  EvalServer,
  ToolHandler,
} from "@ai-eval/bedrock-tracer";
import { Tool } from "@aws-sdk/client-bedrock-runtime";

const tools: Tool[] = [
  {
    toolSpec: {
      name: "calculator",
      description: "Evaluate a simple arithmetic expression",
      inputSchema: {
        json: {
          type: "object",
          properties: {
            expression: { type: "string", description: "e.g. '2 + 2'" },
          },
          required: ["expression"],
        },
      },
    },
  },
];

const toolHandler: ToolHandler = async (name, input) => {
  if (name === "calculator") {
    const result = Function(`"use strict"; return (${input.expression})`)();
    return String(result);
  }
  throw new Error(`Unknown tool: ${name}`);
};

const tracer = new EvalTracer({
  modelId: "anthropic.claude-3-5-sonnet-20241022-v2:0",
  region: "us-east-1",
  maxIterations: 10,
});

const server = new EvalServer({
  port: 3000,
  handler: async (req) => {
    const { content, trace } = await tracer.run(
      req.prompt,
      toolHandler,
      tools,
      req.system_prompt ?? undefined
    );
    return { content, trace };
  },
});

server.start();
// Listening on http://localhost:3000
// POST /eval   — eval endpoint
// GET  /health — health check

`EvalTracer` options

| Option | Type | Default | Description | |---|---|---|---| | modelId | string | required | Bedrock model ID | | region | string | AWS_REGION env / "us-east-1" | AWS region | | maxIterations | number | 20 | Max agentic loop iterations | | client | BedrockRuntimeClient | auto-created | Bring your own client |

tracer.run(
  prompt: string,
  toolHandler?: ToolHandler,
  tools?: Tool[],
  systemPrompt?: string,
): Promise<{ content: string; trace: EvalTrace }>

`EvalServer` options

| Option | Type | Default | Description | |---|---|---|---| | port | number | 3000 | Listen port | | evalPath | string | "/eval" | POST endpoint path | | healthPath | string | "/health" | GET health check path | | handler | AgentHandler | required | Your agent function | | logger | Console-like | console | Custom logger |

Connecting to the Python eval runner

from eval_framework.connectors.node_connector import NodeAgentConnector
from eval_framework.runner.runner import EvaluationRunner

connector = NodeAgentConnector(
    base_url="http://localhost:3000",
    timeout_seconds=120,
)

runner = EvaluationRunner(connector=connector, ...)

Shared types

interface EvalTrace {
  steps: StepTrace[];            // one per Converse API call
  toolCallChain: string[];       // ordered list of every tool invoked
  toolsUsed: string[];           // unique tools (ordered by first use)
  totalInputTokens: number;
  totalOutputTokens: number;
  totalToolCalls: number;
  totalLatencyMs: number;
  finalResponse: string;
  terminatedReason: "end_turn" | "max_iterations" | "error" | "unknown";
}

Building

npm run build   # compiles to dist/
npm run dev     # watch mode

AWS credentials

Credentials are resolved by @aws-sdk/client-bedrock-runtime in standard priority order:

AWS_ACCESS_KEY_ID / AWS_SECRET_ACCESS_KEY / AWS_SESSION_TOKEN env vars
~/.aws/credentials profile (AWS_PROFILE env var)
IAM instance / task role

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme