universal-llm-client
v4.0.0
Published
A universal LLM client with transparent provider failover, streaming tool execution, pluggable reasoning, and native observability.
Downloads
325
Maintainers
Readme
universal-llm-client
A universal LLM client for JavaScript/TypeScript with transparent provider failover, streaming tool execution, pluggable reasoning strategies, and native observability.
import { AIModel } from 'universal-llm-client';
const model = new AIModel({
model: 'gemini-2.5-flash',
providers: [
{ type: 'google', apiKey: process.env.GOOGLE_API_KEY },
{ type: 'openai', url: 'https://openrouter.ai/api', apiKey: process.env.OPENROUTER_KEY },
{ type: 'ollama' },
],
});
const response = await model.chat([
{ role: 'user', content: 'Hello!' },
]);One model, multiple backends. If Google fails, it transparently fails over to OpenRouter, then to local Ollama. Your code never knows the difference.
Features
- 🔄 Transparent Failover — Priority-ordered provider chain with retries, health tracking, and cooldowns
- 🛠️ Tool Calling — Register tools once, works across all providers. Autonomous multi-turn execution loop
- 📋 Structured Output — Zod schema validation, JSON Schema support, streaming, and type-safe responses
- 🌊 Streaming — First-class async generator streaming with pluggable decoder strategies
- 🧠 Reasoning — Native
<think>tag parsing, interleaved reasoning, and model thinking support - 🔍 Observability — Built-in auditor interface for logging, cost tracking, and behavioral analysis
- 🌐 Universal Runtime — Node.js 22+, Bun, Deno, and modern browsers
- 🤖 MCP Native — Bridge MCP servers to LLM tools with zero glue code
- 📊 Embeddings — Single and batch embedding generation
Supported Providers
| Provider | Type | Notes |
|---|---|---|
| Ollama | ollama | Local or cloud models, NDJSON streaming, model pulling, vision/multimodal |
| OpenAI | openai | GPT-4o, o3, etc. Also works with OpenRouter, Groq, LM Studio, vLLM |
| Google AI Studio | google | Gemini models, system instructions, multimodal |
| Vertex AI | vertex | Same as Google AI but with regional endpoints and Bearer tokens |
| LlamaCpp | llamacpp | Local llama.cpp / llama-server instances |
Installation
bun add universal-llm-client
# or
npm install universal-llm-clientOptional: For MCP integration:
bun add @modelcontextprotocol/sdkQuick Start
Basic Chat
import { AIModel } from 'universal-llm-client';
const model = new AIModel({
model: 'qwen3:4b',
providers: [{ type: 'ollama' }],
});
const response = await model.chat([
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'What is the capital of France?' },
]);
console.log(response.message.content);
// "The capital of France is Paris."Streaming
for await (const event of model.chatStream([
{ role: 'user', content: 'Write a haiku about code.' },
])) {
if (event.type === 'text') {
process.stdout.write(event.content);
} else if (event.type === 'thinking') {
// Model reasoning (when supported)
console.log('[thinking]', event.content);
}
}Tool Calling
model.registerTool(
'get_weather',
'Get current weather for a location',
{
type: 'object',
properties: {
city: { type: 'string', description: 'City name' },
},
required: ['city'],
},
async (args) => {
const { city } = args as { city: string };
return { temperature: 22, condition: 'sunny', city };
},
);
// Autonomous tool execution — the model calls tools and loops until done
const response = await model.chatWithTools([
{ role: 'user', content: "What's the weather in Tokyo?" },
]);
console.log(response.message.content);
// "The weather in Tokyo is 22°C and sunny."
console.log(response.toolTrace);
// [{ name: 'get_weather', args: { city: 'Tokyo' }, result: {...}, duration: 5 }]Provider Failover
const model = new AIModel({
model: 'gemini-2.5-flash',
retries: 2, // retries per provider before failover
timeout: 30000, // request timeout in ms
providers: [
{ type: 'google', apiKey: process.env.GOOGLE_KEY, priority: 0 },
{ type: 'openai', url: 'https://openrouter.ai/api', apiKey: process.env.OPENROUTER_KEY, priority: 1 },
{ type: 'ollama', url: 'http://localhost:11434', priority: 2 },
],
});
// If Google returns 500, retries twice, then seamlessly tries OpenRouter.
// If OpenRouter also fails, falls back to local Ollama.
// Your code sees a single response.
const response = await model.chat([{ role: 'user', content: 'Hello' }]);
// Check provider health at any time
console.log(model.getProviderStatus());
// [{ id: 'google-0', healthy: true }, { id: 'openai-1', healthy: true }, ...]Multimodal (Vision)
import { AIModel, multimodalMessage } from 'universal-llm-client';
const model = new AIModel({
model: 'gemini-2.5-flash',
providers: [{ type: 'google', apiKey: process.env.GOOGLE_KEY }],
});
const response = await model.chat([
multimodalMessage('What do you see in this image?', [
'https://example.com/photo.jpg',
]),
]);Embeddings
const embedModel = new AIModel({
model: 'nomic-embed-text-v2-moe:latest',
providers: [{ type: 'ollama' }],
});
const vector = await embedModel.embed('Hello world');
// [0.006, 0.026, -0.009, ...]
const vectors = await embedModel.embedArray(['Hello', 'World']);
// [[0.006, ...], [0.012, ...]]Structured Output
Get typed, validated JSON responses from any LLM using Zod schemas:
import { AIModel } from 'universal-llm-client';
import { z } from 'zod';
const model = new AIModel({
model: 'gemini-2.5-flash',
providers: [
{ type: 'google', apiKey: process.env.GOOGLE_API_KEY },
{ type: 'ollama' },
],
});
// Define your schema
const UserSchema = z.object({
name: z.string(),
age: z.number(),
email: z.string().email(),
interests: z.array(z.string()),
});
// Method 1: generateStructured (throws on validation failure)
const user = await model.generateStructured(UserSchema, [
{ role: 'user', content: 'Generate a user profile for a software developer' },
]);
console.log(user.name); // TypeScript knows this is string
console.log(user.age); // TypeScript knows this is number
console.log(user.email); // TypeScript knows this is string
console.log(user.interests); // TypeScript knows this is string[]Non-throwing variant:
// Method 2: tryParseStructured (returns result object, never throws)
const result = await model.tryParseStructured(UserSchema, messages);
if (result.ok) {
console.log('User:', result.value.name);
} else {
console.log('Error:', result.error.message);
console.log('Raw LLM output:', result.rawOutput);
}Via chat options:
// Method 3: chat with output parameter
const response = await model.chat(messages, {
output: { schema: UserSchema },
});
// response.structured is typed as { name: string, age: number, ... }
if (response.structured) {
console.log(response.structured.name);
}Streaming structured output:
// Stream partial validated objects as JSON generates
for await (const partial of model.generateStructuredStream(UserSchema, messages)) {
console.log('Partial:', partial);
// Partial: { name: 'Alice' }
// Partial: { name: 'Alice', age: 30 }
// Partial: { name: 'Alice', age: 30, email: '[email protected]' }
}Raw JSON Schema (without Zod):
const response = await model.chat(messages, {
jsonSchema: {
type: 'object',
properties: {
name: { type: 'string' },
age: { type: 'number' },
},
required: ['name', 'age'],
},
name: 'Person', // Optional, used for LLM guidance
});Separate module import (tree-shaking):
// Import only structured output types if you don't need the full client
import {
StructuredOutputError,
type StructuredOutputResult,
type StructuredOutputOptions,
parseStructured,
tryParseStructured,
zodToJsonSchema,
} from 'universal-llm-client/structured-output';Vision with structured output:
const ImageAnalysisSchema = z.object({
objects: z.array(z.string()),
scene: z.string(),
mood: z.string(),
});
const response = await model.generateStructured(ImageAnalysisSchema, [
multimodalMessage('Analyze this image', ['https://example.com/photo.jpg']),
]);Provider compatibility:
| Provider | Method | Notes |
|----------|--------|-------|
| OpenAI | response_format.json_schema | Strict mode enabled |
| Ollama | format: { schema } | Model must support grammar |
| Google | responseMimeType + responseSchema | Some features stripped |
Observability
import { AIModel, ConsoleAuditor, BufferedAuditor } from 'universal-llm-client';
// Simple console logging
const model = new AIModel({
model: 'qwen3:4b',
providers: [{ type: 'ollama' }],
auditor: new ConsoleAuditor('[LLM]'),
});
// [LLM] REQUEST [ollama] (qwen3:4b) →
// [LLM] RESPONSE [ollama] (qwen3:4b) 1200ms 68 tokens
// Buffered for custom sinks (OpenTelemetry, DB, etc.)
const auditor = new BufferedAuditor({
maxBufferSize: 100,
onFlush: async (events) => {
await sendToOpenTelemetry(events);
},
});MCP Integration
import { AIModel, MCPToolBridge } from 'universal-llm-client';
const model = new AIModel({
model: 'qwen3:4b',
providers: [{ type: 'ollama' }],
});
const mcp = new MCPToolBridge({
servers: {
filesystem: {
command: 'npx',
args: ['-y', '@modelcontextprotocol/server-filesystem', './'],
},
weather: {
url: 'https://mcp.example.com/weather',
},
},
});
await mcp.connect();
await mcp.registerTools(model);
// MCP tools are now callable via chatWithTools
const response = await model.chatWithTools([
{ role: 'user', content: 'List files in the current directory' },
]);
await mcp.disconnect();Stream Decoders
import { AIModel, createDecoder } from 'universal-llm-client';
// Passthrough — raw text, no parsing
// Standard Chat — text + native reasoning + tool calls
// Interleaved Reasoning — parses <think> and <progress> tags from text streams
const decoder = createDecoder('interleaved-reasoning', (event) => {
switch (event.type) {
case 'text': console.log(event.content); break;
case 'thinking': console.log('[think]', event.content); break;
case 'progress': console.log('[progress]', event.content); break;
case 'tool_call': console.log('[tool]', event.calls); break;
}
});
decoder.push('<think>Let me analyze this</think>The answer is 42');
decoder.flush();
console.log(decoder.getCleanContent()); // "The answer is 42"
console.log(decoder.getReasoning()); // "Let me analyze this"API Reference
AIModel
The universal client. One class, multiple backends.
new AIModel(config: AIModelConfig)Config:
| Property | Type | Default | Description |
|---|---|---|---|
| model | string | — | Model name (e.g., 'gemini-2.5-flash') |
| providers | ProviderConfig[] | — | Ordered list of provider backends |
| retries | number | 2 | Retries per provider before failover |
| timeout | number | 30000 | Request timeout in ms |
| auditor | Auditor | NoopAuditor | Observability sink |
| thinking | boolean | false | Enable model thinking/reasoning |
| debug | boolean | false | Debug logging |
| defaultParameters | object | — | Default parameters for all requests |
Provider Config:
| Property | Type | Description |
|---|---|---|
| type | string | 'ollama', 'openai', 'google', 'vertex', 'llamacpp' |
| url | string | Provider URL (has sensible defaults) |
| apiKey | string | API key or Bearer token |
| priority | number | Lower = tried first (defaults to array index) |
| model | string | Override model name for this provider |
| region | string | Vertex AI region (e.g., 'us-central1') |
| apiVersion | string | API version (e.g., 'v1beta') |
Methods:
| Method | Returns | Description |
|---|---|---|
| chat(messages, options?) | Promise<LLMChatResponse> | Send chat request |
| chatWithTools(messages, options?) | Promise<LLMChatResponse> | Chat with autonomous tool execution |
| chatStream(messages, options?) | AsyncGenerator<DecodedEvent> | Stream chat response |
| generateStructured(schema, messages, options?) | Promise<T> | Generate typed JSON validated against Zod schema |
| tryParseStructured(schema, messages, options?) | Promise<StructuredOutputResult<T>> | Non-throwing variant returning result object |
| generateStructuredStream(schema, messages, options?) | AsyncGenerator<T, T> | Stream partial validated objects as JSON generates |
| embed(text) | Promise<number[]> | Generate single embedding |
| embedArray(texts) | Promise<number[][]> | Generate batch embeddings |
| registerTool(name, desc, params, handler) | void | Register a callable tool |
| registerTools(tools) | void | Register multiple tools |
| getModels() | Promise<string[]> | List available models |
| getModelInfo() | Promise<ModelMetadata> | Get model metadata |
| getProviderStatus() | ProviderStatus[] | Check provider health |
| setModel(name) | void | Switch model at runtime |
| dispose() | Promise<void> | Clean shutdown |
Structured Output
import { z } from 'zod';
// Define your schema
const UserSchema = z.object({
name: z.string(),
age: z.number(),
email: z.string().email(),
});
// Generate typed JSON
const user = await model.generateStructured(UserSchema, messages);
// TypeScript infers: { name: string; age: number; email: string }
// Non-throwing variant
const result = await model.tryParseStructured(UserSchema, messages);
if (result.ok) {
console.log(result.value.name); // Fully typed
} else {
console.log(result.error.message);
}
// Stream partial objects
for await (const partial of model.generateStructuredStream(UserSchema, messages)) {
console.log(partial); // Partial validated objects
}Separate module import (tree-shaking):
import {
StructuredOutputError,
type StructuredOutputResult,
parseStructured,
tryParseStructured,
zodToJsonSchema,
} from 'universal-llm-client/structured-output';
// Use without importing the full client
const schema = z.object({ name: z.string() });
const jsonSchema = zodToJsonSchema(schema);ToolBuilder / ToolExecutor
import { ToolBuilder, ToolExecutor } from 'universal-llm-client';
// Fluent builder
const tool = new ToolBuilder('search')
.description('Search the web')
.addParameter('query', 'string', 'Search query', true)
.addParameter('limit', 'number', 'Max results', false)
.build();
// Execution wrappers
const safeHandler = ToolExecutor.compose(
myHandler,
h => ToolExecutor.withTimeout(h, 5000),
h => ToolExecutor.safe(h),
h => ToolExecutor.withValidation(h, ['query']),
);Auditor Interface
Implement custom observability by providing an Auditor:
interface Auditor {
record(event: AuditEvent): void;
flush?(): Promise<void>;
}Built-in implementations:
NoopAuditor— Zero overhead (default)ConsoleAuditor— Structured console loggingBufferedAuditor— Collects events for custom sinks
Architecture
universal-llm-client
├── AIModel ← Public API (the only class you import)
├── Router ← Internal failover engine
├── BaseLLMClient ← Abstract client with tool execution
├── Providers
│ ├── OllamaClient
│ ├── OpenAICompatibleClient (OpenAI, OpenRouter, Groq, LM Studio, vLLM, LlamaCpp)
│ └── GoogleClient (AI Studio + Vertex AI)
├── StreamDecoder ← Pluggable reasoning strategies
├── Auditor ← Observability interface
├── MCPToolBridge ← MCP server integration
└── HTTP Utilities ← Universal fetch-based transportDesign Principles
- Single import —
AIModelis the only class users need - Provider agnostic — Same code works with any backend
- Transparent failover — Health tracking and cooldowns happen behind the scenes
- Zero dependencies — Core library depends only on native
fetch - Agent-ready — Stateless, composable instances designed as foundation for agent frameworks
- Observable — Every request, response, tool call, retry, and failover is auditable
Runtime Support
| Runtime | Version | Status | |---|---|---| | Node.js | 22+ | ✅ Full support | | Bun | 1.0+ | ✅ Full support | | Deno | 2.0+ | ✅ Full support | | Browsers | Modern | ✅ No stdio MCP, HTTP transport only |
For Agent Framework Authors
AIModel is designed as the transport layer for agentic systems:
- Stateless — No conversation history stored. Your framework manages memory
- Composable — Create separate instances for chat, embeddings, vision
- Tool tracing —
chatWithTools()returns full execution trace - Context budget —
getModelInfo()exposescontextLength - Auditor as system bus — Inject custom sinks for cost tracking, behavioral scoring
- StreamDecoder as UI bridge — Select decoder strategy per-call
License
MIT
