npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@bitkaio/langchain-kserve

v0.2.1

Published

LangChain.js integration for KServe inference services

Readme

@bitkaio/langchain-kserve

LangChain.js integration for KServe inference services.

Connect LangChain chains and agents to any model hosted on KServe — whether it's a vLLM-served Qwen2.5-Coder behind an Istio ingress, a Triton-served custom model, or a TGI-served Llama — with full streaming, tool calling, vision, and production-grade connection handling.

Features

  • Two model classes: ChatKServe for chat/instruct models, KServeLLM for base completion models
  • Dual protocol support: OpenAI-compatible API (/v1/chat/completions) and V2 Inference Protocol (/v2/models/{model}/infer), with auto-detection
  • Full streaming: SSE for OpenAI-compat, NDJSON for V2
  • Tool calling: Full OpenAI function-calling format with tool_choice, parallel_tool_calls, and proper invalid_tool_calls handling for malformed responses
  • Vision / multimodal: Send images alongside text via OpenAI content blocks
  • Token usage tracking: llmOutput and generationInfo populated from vLLM responses, including streaming
  • Logprobs: Optional per-token log-probabilities in generationInfo
  • Finish reason: Always propagated in generationInfo and response_metadata
  • Model introspection: getModelInfo() returns unified KServeModelInfo for both protocols
  • Production-ready: TLS/CA bundles, bearer token auth, async token providers (K8s service account tokens), exponential backoff with jitter, 120s default timeout for cold starts

Installation

npm install @bitkaio/langchain-kserve @langchain/core
# or
pnpm add @bitkaio/langchain-kserve @langchain/core

Requirements: Node.js 22.14+

Quick Start

import { ChatKServe } from "@bitkaio/langchain-kserve";

const llm = new ChatKServe({
  baseUrl: "https://qwen-coder.my-cluster.example.com",
  modelName: "qwen2.5-coder-32b-instruct",
  temperature: 0.2,
});

const response = await llm.invoke("Write a TypeScript binary search function.");
console.log(response.content);

Usage

Basic invocation

import { ChatKServe } from "@bitkaio/langchain-kserve";
import { HumanMessage, SystemMessage } from "@langchain/core/messages";

const llm = new ChatKServe({
  baseUrl: "https://qwen-coder.my-cluster.example.com",
  modelName: "qwen2.5-coder-32b-instruct",
});

// Single string (shorthand)
const result = await llm.invoke("Explain KServe in one sentence.");

// With messages array
const result2 = await llm.invoke([
  new SystemMessage("You are an expert TypeScript developer."),
  new HumanMessage("What is a monad?"),
]);

Streaming

const stream = await llm.stream("Implement a red-black tree in TypeScript.");

for await (const chunk of stream) {
  process.stdout.write(chunk.content as string);
}

In a chain

import { ChatKServe } from "@bitkaio/langchain-kserve";
import { ChatPromptTemplate } from "@langchain/core/prompts";
import { StringOutputParser } from "@langchain/core/output_parsers";

const llm = new ChatKServe({
  baseUrl: "https://qwen-coder.my-cluster.example.com",
  modelName: "qwen2.5-coder-32b-instruct",
  temperature: 0.2,
  streaming: true,
});

const prompt = ChatPromptTemplate.fromMessages([
  ["system", "You are an expert {language} developer."],
  ["human", "{input}"],
]);

const chain = prompt.pipe(llm).pipe(new StringOutputParser());

const response = await chain.invoke({
  language: "TypeScript",
  input: "Implement a LRU cache",
});

Token usage tracking

Token usage is available in both llmOutput (request-level) and generationInfo (per-generation):

const result = await llm._generate([new HumanMessage("Hello")], {});

// Request-level
console.log(result.llmOutput);
// { tokenUsage: { promptTokens: 5, completionTokens: 12, totalTokens: 17 } }

// Per-generation
const info = result.generations[0].generationInfo;
console.log(info?.tokenUsage);    // { promptTokens: 5, completionTokens: 12, totalTokens: 17 }
console.log(info?.finishReason);  // "stop"

Logprobs

import { ChatKServe } from "@bitkaio/langchain-kserve";

const llm = new ChatKServe({
  baseUrl: "https://qwen-coder.my-cluster.example.com",
  modelName: "qwen2.5-coder-32b-instruct",
  protocol: "openai",
  logprobs: true,
  topLogprobs: 5,
});

const result = await llm.invoke("Hello");
const info = result.response_metadata;
console.log(info.logprobs); // { content: [{ token: "Hi", logprob: -0.3, top_logprobs: [...] }] }

Tool calling

Tool calling requires the OpenAI-compatible protocol (vLLM, TGI, etc.).

import { ChatKServe } from "@bitkaio/langchain-kserve";
import { tool } from "@langchain/core/tools";
import { z } from "zod";

const llm = new ChatKServe({
  baseUrl: "https://qwen-coder.my-cluster.example.com",
  modelName: "qwen2.5-coder-32b-instruct",
  protocol: "openai",
  parallelToolCalls: true,
});

const searchTool = tool(
  async ({ query }) => `Results for: ${query}`,
  {
    name: "search_codebase",
    description: "Search the codebase for relevant code",
    schema: z.object({ query: z.string() }),
  }
);

const llmWithTools = llm.bindTools([searchTool], { toolChoice: "auto" });

const result = await llmWithTools.invoke(
  "Find all usages of the AuthService class"
);

if (result.tool_calls && result.tool_calls.length > 0) {
  console.log("Tool call:", result.tool_calls[0]);
}

// Malformed arguments from the model are captured in invalid_tool_calls
if (result.invalid_tool_calls && result.invalid_tool_calls.length > 0) {
  console.log("Invalid call:", result.invalid_tool_calls[0]);
}

Vision / multimodal

Send images alongside text. Works with OpenAI-compatible runtimes that support vision (e.g., vLLM with a multimodal model).

import { ChatKServe } from "@bitkaio/langchain-kserve";
import { HumanMessage } from "@langchain/core/messages";
import { readFile } from "node:fs/promises";

const llm = new ChatKServe({
  baseUrl: "https://llava.my-cluster.example.com",
  modelName: "llava-1.6",
  protocol: "openai",
});

// Base64-encoded image (preferred for cluster-internal use)
const imageData = await readFile("chart.png", { encoding: "base64" });
const message = new HumanMessage({
  content: [
    { type: "text", text: "Describe what you see in this chart:" },
    {
      type: "image_url",
      image_url: {
        url: `data:image/png;base64,${imageData}`,
        detail: "high",
      },
    },
  ],
});

const response = await llm.invoke([message]);
console.log(response.content);

URL-based images are also supported (the model pod fetches the image):

const message = new HumanMessage({
  content: [
    { type: "text", text: "What's in this image?" },
    { type: "image_url", image_url: { url: "https://example.com/image.jpg" } },
  ],
});

JSON mode / Response format

// Force valid JSON output
const llm = new ChatKServe({
  baseUrl: "...",
  modelName: "...",
  protocol: "openai",
  responseFormat: { type: "json_object" },
});

// Force output matching a specific JSON schema (vLLM grammar-constrained decoding)
const llm = new ChatKServe({
  baseUrl: "...",
  modelName: "...",
  protocol: "openai",
  responseFormat: {
    type: "json_schema",
    json_schema: {
      name: "person",
      strict: true,
      schema: {
        type: "object",
        properties: { name: { type: "string" }, age: { type: "integer" } },
        required: ["name", "age"],
      },
    },
  },
});

Structured output

import { ChatKServe } from "@bitkaio/langchain-kserve";
import { z } from "zod";

const llm = new ChatKServe({
  baseUrl: "...",
  modelName: "...",
  protocol: "openai",
});

const Person = z.object({ name: z.string(), age: z.number() });

// functionCalling (default) — most reliable
const structured = llm.withStructuredOutput(Person);
const result = await structured.invoke("Extract: John is 30 years old.");
console.log(result.name, result.age); // "John", 30

// jsonSchema — vLLM grammar-constrained decoding
const structured2 = llm.withStructuredOutput(Person, { method: "jsonSchema" });

// jsonMode — json_object with schema instruction
const structured3 = llm.withStructuredOutput(Person, { method: "jsonMode" });

// includeRaw — get both raw AIMessage and parsed output
const structured4 = llm.withStructuredOutput(Person, { includeRaw: true });
const result4 = await structured4.invoke("Extract: John is 30.");
console.log(result4.parsed);       // { name: "John", age: 30 }
console.log(result4.raw);          // AIMessage(...)
console.log(result4.parsingError); // null

Embeddings

import { KServeEmbeddings } from "@bitkaio/langchain-kserve";

const embeddings = new KServeEmbeddings({
  baseUrl: "https://my-embedding-model.cluster.example.com",
  modelName: "Qwen/Qwen3-Embedding-0.6B",
});

// Embed documents (batched automatically)
const vectors = await embeddings.embedDocuments(["Hello world", "How are you?"]);
console.log(vectors.length, vectors[0].length); // 2, <embedding_dim>

// Embed a single query
const queryVector = await embeddings.embedQuery("What is KServe?");

// With dimensions (Matryoshka models)
const embeddingsWithDims = new KServeEmbeddings({
  baseUrl: "...",
  modelName: "...",
  dimensions: 512,
  chunkSize: 500, // max texts per API call (default: 1000)
});

Model introspection

import { ChatKServe } from "@bitkaio/langchain-kserve";

const llm = new ChatKServe({
  baseUrl: "https://qwen-coder.my-cluster.example.com",
  modelName: "qwen2.5-coder-32b-instruct",
});

const info = await llm.getModelInfo();

console.log(info.modelName);    // "qwen2.5-coder-32b-instruct"
console.log(info.platform);     // "openai-compat" or V2 platform string
console.log(info.raw);          // full response from the endpoint

V2 Inference Protocol

Use protocol: "v2" for runtimes that expose the native KServe V2/Open Inference Protocol (e.g., Triton Inference Server):

import { ChatKServe } from "@bitkaio/langchain-kserve";

const llm = new ChatKServe({
  baseUrl: "https://triton.my-cluster.example.com",
  modelName: "llama-3-8b-instruct",
  protocol: "v2",
  chatTemplate: "llama",
});

const result = await llm.invoke("Explain transformers in simple terms.");

Note: Tool calling and vision are not supported on V2. Attempting either throws a KServeInferenceError immediately (before any HTTP call) with a clear message.

Base (non-chat) models with KServeLLM

import { KServeLLM } from "@bitkaio/langchain-kserve";

const llm = new KServeLLM({
  baseUrl: "https://base-model.my-cluster.example.com",
  modelName: "llama-3-base",
  protocol: "v2",
  temperature: 0.8,
  maxTokens: 256,
});

const completion = await llm.invoke("Once upon a time");

// Token usage (from streaming or non-streaming)
const result = await llm.generate(["Once upon a time"]);
console.log(result.llmOutput); // { tokenUsage: {...} } if vLLM, else undefined

Environment variable configuration

All constructor options can be set via environment variables:

| Environment Variable | Constructor Field | |----------------------|-------------------| | KSERVE_BASE_URL | baseUrl | | KSERVE_MODEL_NAME | modelName | | KSERVE_API_KEY | apiKey | | KSERVE_PROTOCOL | protocol | | KSERVE_CA_BUNDLE | caBundle |

export KSERVE_BASE_URL=https://qwen-coder.my-cluster.example.com
export KSERVE_MODEL_NAME=qwen2.5-coder-32b-instruct
export KSERVE_API_KEY=my-bearer-token

Authentication

Static API key / bearer token:

const llm = new ChatKServe({
  baseUrl: "https://my-model.cluster.example.com",
  modelName: "my-model",
  apiKey: "my-bearer-token",
});

Dynamic token provider (e.g., Kubernetes service account tokens):

import { readFile } from "node:fs/promises";

const llm = new ChatKServe({
  baseUrl: "https://my-model.cluster.example.com",
  modelName: "my-model",
  tokenProvider: async () => {
    const token = await readFile(
      "/var/run/secrets/kubernetes.io/serviceaccount/token",
      "utf-8"
    );
    return token.trim();
  },
});

Custom TLS / CA bundle

const llm = new ChatKServe({
  baseUrl: "https://my-model.internal.example.com",
  modelName: "my-model",
  caBundle: "/etc/ssl/certs/my-internal-ca.pem",
});

Connection tuning

const llm = new ChatKServe({
  baseUrl: "https://my-model.cluster.example.com",
  modelName: "my-model",
  timeout: 300_000,  // 5 minutes for cold starts
  maxRetries: 5,
});

DeepAgents integration

import { ChatKServe } from "@bitkaio/langchain-kserve";
import { tool } from "@langchain/core/tools";
import { z } from "zod";

const readFile = tool(
  async ({ path }) => { /* ... */ },
  {
    name: "read_file",
    description: "Read a file from the codebase",
    schema: z.object({ path: z.string() }),
  }
);

const llm = new ChatKServe({
  baseUrl: process.env.KSERVE_BASE_URL!,
  modelName: "qwen2.5-coder-32b-instruct",
  temperature: 0.2,
  streaming: true,
});

import { Agent } from "deepagents";

const agent = new Agent({ llm, tools: [readFile] });

const result = await agent.invoke({
  input: "Refactor the authentication module to use JWT tokens",
});

API Reference

ChatKServe

Extends BaseChatModel. Use for chat/instruct models.

Constructor options

| Option | Type | Default | Description | |--------|------|---------|-------------| | baseUrl | string | KSERVE_BASE_URL env | KServe inference service URL | | modelName | string | KSERVE_MODEL_NAME env | Model name as registered in KServe | | protocol | "openai" \| "v2" \| "auto" | "auto" | Inference protocol | | apiKey | string | KSERVE_API_KEY env | Static bearer token | | tokenProvider | () => Promise<string> | — | Dynamic token provider | | verifySsl | boolean | true | Verify SSL certificates | | caBundle | string | KSERVE_CA_BUNDLE env | Path to CA certificate bundle | | temperature | number | — | Sampling temperature | | maxTokens | number | — | Max tokens to generate | | topP | number | — | Top-p nucleus sampling | | stop | string[] | — | Stop sequences | | streaming | boolean | false | Enable streaming | | logprobs | boolean | — | Return per-token log-probabilities (OpenAI-compat only) | | topLogprobs | number | — | Number of top logprobs per token (OpenAI-compat only) | | parallelToolCalls | boolean | — | Allow multiple tool calls in one turn (OpenAI-compat only) | | responseFormat | OpenAIResponseFormat | — | Response format constraint (e.g. { type: "json_object" }) (OpenAI-compat only) | | timeout | number | 120000 | Request timeout (ms) | | maxRetries | number | 3 | Max retry attempts | | chatTemplate | "chatml" \| "llama" \| "custom" | "chatml" | Template for V2 protocol | | customChatTemplate | string | — | Custom template string |

Methods

| Method | Returns | Description | |--------|---------|-------------| | invoke(input) | Promise<AIMessage> | Single-turn generation | | stream(input) | AsyncIterable<AIMessageChunk> | Streaming generation | | bindTools(tools, kwargs?) | Runnable | Bind tools for function calling | | withStructuredOutput(schema, config?) | Runnable | Structured output via function calling, JSON schema, or JSON mode | | getModelInfo() | Promise<KServeModelInfo> | Fetch model metadata from endpoint |

KServeLLM

Extends BaseLLM. Use for base/completion models. Same options as ChatKServe, minus chatTemplate and customChatTemplate.

KServeModelInfo

interface KServeModelInfo {
  modelName: string;
  modelVersion?: string;
  platform?: string;
  inputs?: Array<Record<string, unknown>>;
  outputs?: Array<Record<string, unknown>>;
  raw: Record<string, unknown>;
}

Error classes

import {
  KServeError,                // base class
  KServeConnectionError,      // network/DNS failures
  KServeAuthenticationError,  // 401/403
  KServeModelNotFoundError,   // 404, model not loaded
  KServeInferenceError,       // 4xx/5xx during inference, or V2 unsupported feature
  KServeTimeoutError,         // timeout exceeded
} from "@bitkaio/langchain-kserve";

API Reference

KServeEmbeddings

const embeddings = new KServeEmbeddings({
  baseUrl: "...",           // or KSERVE_EMBEDDINGS_BASE_URL / KSERVE_BASE_URL env
  modelName: "...",         // or KSERVE_EMBEDDINGS_MODEL_NAME env
  apiKey: "...",            // optional, or KSERVE_API_KEY env
  dimensions: 512,          // optional, for Matryoshka models
  encodingFormat: "float",  // "float" (default) or "base64"
  chunkSize: 1000,          // max texts per API call
  timeout: 120_000,         // ms
  maxRetries: 3,
});

| Method | Returns | Description | |--------|---------|-------------| | embedDocuments(docs) | Promise<number[][]> | Embed a list of documents | | embedQuery(doc) | Promise<number[]> | Embed a single query |

Protocol Capability Matrix

| Feature | OpenAI-compat | V2 | |---|:---:|:---:| | Text generation | ✅ | ✅ | | Streaming | ✅ | ✅ | | Tool calling | ✅ | ❌ | | Vision / multimodal | ✅ | ❌ | | Logprobs | ✅ | ❌ | | Token usage tracking | ✅ | ❌ | | Finish reason | ✅ | partial | | JSON mode / responseFormat | ✅ | ❌ | | Structured output | ✅ | ❌ | | Embeddings | ✅ | — |

Attempting tool calling, vision, or responseFormat with V2 throws KServeInferenceError immediately (before any HTTP call) with a clear message.

Protocol auto-detection

When protocol: "auto" (the default), the client probes GET /v1/models:

  • 200 response → OpenAI-compatible protocol (/v1/chat/completions)
  • Non-200 / connection error → V2 Inference Protocol (/v2/models/{model}/infer)

The detected protocol is cached for the lifetime of the model instance. Pin a protocol explicitly to skip detection:

const llm = new ChatKServe({
  baseUrl: "...",
  modelName: "...",
  protocol: "openai", // or "v2"
});

Development

# Install dependencies
pnpm install

# Build
pnpm build

# Run unit tests
pnpm test

# Run with watch mode
pnpm test:watch

# Type check
pnpm typecheck

# Integration tests (requires live KServe endpoint)
KSERVE_BASE_URL=https://... KSERVE_MODEL_NAME=... pnpm test:integration

License

MIT