@loonylabs/llm-middleware
v2.18.0
Published
Complete middleware infrastructure for LLM-based backends with multi-provider support (Ollama, Anthropic, OpenAI, Google)
Maintainers
Readme
🚀 LLM Middleware
A comprehensive TypeScript middleware library for building robust multi-provider LLM backends. Currently supports Ollama, Anthropic Claude, Google Gemini (Direct API & Vertex AI), and Requesty.AI (300+ models). Features EU data residency via Vertex AI, reasoning control, advanced JSON cleaning, logging, error handling, cost tracking, and more.
- ✨ Features
- 🚀 Quick Start
- 📋 Prerequisites
- ⚙️ Configuration
- 🏗️ Architecture
- 📖 Documentation
- 🧪 Testing and Examples
- 🔧 Advanced Features
- 🤝 Contributing
- 📄 License
- 🙏 Acknowledgments
- 🔗 Links
✨ Features
- 🏗️ Clean Architecture: Base classes and interfaces for scalable AI applications
- ✨ v2.11.0: Dynamic system messages via
getSystemMessage(request)override
- ✨ v2.11.0: Dynamic system messages via
- 🤖 Multi-Provider Architecture: Extensible provider system with strategy pattern
- ✅ Ollama: Fully supported with comprehensive parameter control
- ✅ Anthropic Claude: Complete support for Claude models (Opus, Sonnet, Haiku)
- ✅ Google Gemini Direct: Complete support for Gemini models via API Key
- ✅ Google Vertex AI: CDPA/GDPR-compliant with EU data residency (Service Account auth)
- ✅ Requesty.AI: 300+ models via unified API, built-in cost tracking
- 🔌 Pluggable: Easy to add custom providers - see LLM Providers Guide
- 🧠 Reasoning Control: Control model thinking effort via
reasoningEffortparameter- ✨ v2.14.0: Supports Gemini 2.5 (
thinkingBudget) and Gemini 3 (thinkingLevel) - 📊 Track reasoning tokens separately for cost analysis
- ✨ v2.14.0: Supports Gemini 2.5 (
- 🧹 JSON Cleaning: Recipe-based JSON repair system with automatic strategy selection
- ✨ v2.4.0: Enhanced array extraction support - properly handles JSON arrays
[...]in addition to objects{...}
- ✨ v2.4.0: Enhanced array extraction support - properly handles JSON arrays
- 🎨 FlatFormatter System: Advanced data formatting for LLM consumption
- 📊 Comprehensive Logging: Multi-level logging with metadata support
- ⚙️ Configuration Management: Flexible model and application configuration
- 🛡️ Error Handling: Robust error handling and recovery mechanisms
- 🔧 TypeScript First: Full type safety throughout the entire stack
- 📦 Modular Design: Use only what you need
- 🧪 Testing Ready: Includes example implementations and test utilities
🚀 Quick Start
Installation
Install from npm:
npm install @loonylabs/llm-middlewareOr install directly from GitHub:
npm install github:loonylabs-dev/llm-middlewareOr using a specific version/tag:
npm install github:loonylabs-dev/llm-middleware#v1.3.0Basic Usage
import { BaseAIUseCase, BaseAIRequest, BaseAIResult, LLMProvider } from '@loonylabs/llm-middleware';
// Define your request/response interfaces
interface MyRequest extends BaseAIRequest<string> {
message: string;
}
interface MyResult extends BaseAIResult {
response: string;
}
// Create your use case (uses Ollama by default)
class MyChatUseCase extends BaseAIUseCase<string, MyRequest, MyResult> {
protected readonly systemMessage = "You are a helpful assistant.";
// Required: return user message template function
protected getUserTemplate(): (formattedPrompt: string) => string {
return (message) => message;
}
protected formatUserMessage(prompt: any): string {
return typeof prompt === 'string' ? prompt : prompt.message;
}
protected createResult(content: string, usedPrompt: string, thinking?: string): MyResult {
return {
generatedContent: content,
model: this.modelConfig.name,
usedPrompt: usedPrompt,
thinking: thinking,
response: content
};
}
}
// Switch to different provider (optional)
class MyAnthropicChatUseCase extends MyChatUseCase {
protected getProvider(): LLMProvider {
return LLMProvider.ANTHROPIC; // Use Claude instead of Ollama
}
}
// Dynamic system message based on request data (v2.11.0+)
class DynamicSystemMessageUseCase extends BaseAIUseCase<MyPrompt, MyRequest, MyResult> {
protected readonly systemMessage = "Default system message";
// Override to customize system message per-request
protected getSystemMessage(request?: MyRequest): string {
const context = request?.prompt?.context;
if (context === 'technical') {
return "You are a technical expert. Be precise and detailed.";
}
return this.systemMessage;
}
// ... other methods
}import { llmService, LLMProvider, ollamaProvider, anthropicProvider, geminiProvider } from '@loonylabs/llm-middleware';
// Option 1: Use the LLM Service orchestrator (recommended for flexibility)
const response1 = await llmService.call(
"Write a haiku about coding",
{
provider: LLMProvider.OLLAMA, // Explicitly specify provider
model: "llama2",
temperature: 0.7
}
);
// Use Anthropic Claude
const response2 = await llmService.call(
"Explain quantum computing",
{
provider: LLMProvider.ANTHROPIC,
model: "claude-3-5-sonnet-20241022",
authToken: process.env.ANTHROPIC_API_KEY,
maxTokens: 1024,
temperature: 0.7
}
);
// Use Google Gemini (Direct API)
const response3 = await llmService.call(
"What is machine learning?",
{
provider: LLMProvider.GOOGLE,
model: "gemini-1.5-pro",
authToken: process.env.GEMINI_API_KEY,
maxTokens: 1024,
temperature: 0.7
}
);
// Use Google Vertex AI (EU data residency, CDPA/GDPR compliant)
const response3b = await llmService.call(
"Explain GDPR compliance",
{
provider: LLMProvider.VERTEX_AI,
model: "gemini-2.5-flash",
// Uses Service Account auth (GOOGLE_APPLICATION_CREDENTIALS)
// Region defaults to europe-west3 (Frankfurt)
reasoningEffort: 'medium' // Control thinking effort
}
);
// Option 2: Use provider directly for provider-specific features
const response4 = await ollamaProvider.callWithSystemMessage(
"Write a haiku about coding",
"You are a creative poet",
{
model: "llama2",
temperature: 0.7,
// Ollama-specific parameters
repeat_penalty: 1.1,
top_k: 40
}
);
// Or use Anthropic provider directly
const response5 = await anthropicProvider.call(
"Write a haiku about coding",
{
model: "claude-3-5-sonnet-20241022",
authToken: process.env.ANTHROPIC_API_KEY,
maxTokens: 1024
}
);
// Or use Gemini provider directly
const response6 = await geminiProvider.call(
"Write a haiku about coding",
{
model: "gemini-1.5-pro",
authToken: process.env.GEMINI_API_KEY,
maxOutputTokens: 1024
}
);
// Set default provider for your application
llmService.setDefaultProvider(LLMProvider.OLLAMA);
// Now calls use Ollama by default
const response7 = await llmService.call("Hello!", { model: "llama2" });For more details on the multi-provider system, see the LLM Providers Guide.
import {
FlatFormatter,
personPreset
} from '@loonylabs/llm-middleware';
class ProfileGeneratorUseCase extends BaseAIUseCase {
protected readonly systemMessage = `You are a professional profile creator.
IMPORTANT: Respond with ONLY valid JSON following this schema:
{
"name": "Person name",
"title": "Professional title",
"summary": "Brief professional overview",
"skills": "Key skills and expertise",
"achievements": "Notable accomplishments"
}`;
// Use FlatFormatter and presets for rich context building
protected formatUserMessage(prompt: any): string {
const { person, preferences, guidelines } = prompt;
const contextSections = [
// Use preset for structured data
personPreset.formatForLLM(person, "## PERSON INFO:"),
// Use FlatFormatter for custom structures
`## PREFERENCES:\n${FlatFormatter.flatten(preferences, {
format: 'bulleted',
keyValueSeparator: ': '
})}`,
// Format guidelines with FlatFormatter
`## GUIDELINES:\n${FlatFormatter.flatten(
guidelines.map(g => ({
guideline: g,
priority: "MUST FOLLOW"
})),
{
format: 'numbered',
entryTitleKey: 'guideline',
ignoredKeys: ['guideline']
}
)}`
];
return contextSections.join('\n\n');
}
protected createResult(content: string, usedPrompt: string, thinking?: string): MyResult {
return {
generatedContent: content,
model: this.modelConfig.name,
usedPrompt,
thinking,
profile: JSON.parse(content)
};
}
}
// Use it
const profileGen = new ProfileGeneratorUseCase();
const result = await profileGen.execute({
prompt: {
person: { name: "Alice", occupation: "Engineer" },
preferences: { tone: "professional", length: "concise" },
guidelines: ["Highlight technical skills", "Include leadership"]
},
authToken: "optional-token"
});📋 Prerequisites
- Node.js 18+
- TypeScript 4.9+
- LLM Provider configured (e.g., Ollama server for Ollama provider)
⚙️ Configuration
Create a .env file in your project root:
# Server Configuration
PORT=3000
NODE_ENV=development
# Logging
LOG_LEVEL=info
# LLM Provider Configuration
MODEL1_NAME=phi3:mini # Required: Your model name
MODEL1_URL=http://localhost:11434 # Optional: Defaults to localhost (Ollama)
MODEL1_TOKEN=optional-auth-token # Optional: For authenticated providers
# Anthropic API Configuration (Optional)
ANTHROPIC_API_KEY=your_anthropic_api_key_here # Your Anthropic API key
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022 # Default Claude model
# Google Gemini Direct API Configuration (Optional)
GEMINI_API_KEY=your_gemini_api_key_here # Your Google Gemini API key
GEMINI_MODEL=gemini-1.5-pro # Default Gemini model
# Google Vertex AI Configuration (Optional - for CDPA/GDPR compliance)
GOOGLE_CLOUD_PROJECT=your_project_id # Google Cloud Project ID
VERTEX_AI_REGION=europe-west3 # EU region (Frankfurt)
VERTEX_AI_MODEL=gemini-2.5-flash # Default Vertex AI model
GOOGLE_APPLICATION_CREDENTIALS=./vertex-ai-service-account.json # Service AccountMulti-Provider Support: The middleware is fully integrated with Ollama, Anthropic Claude, Google Gemini (Direct API & Vertex AI), and Requesty.AI. See the LLM Providers Guide for details on the provider system and how to use or add providers.
🏗️ Architecture
The middleware follows Clean Architecture principles:
src/
├── middleware/
│ ├── controllers/base/ # Base HTTP controllers
│ ├── usecases/base/ # Base AI use cases
│ ├── services/ # External service integrations
│ │ ├── llm/ # LLM provider services (Ollama, OpenAI, etc.)
│ │ ├── json-cleaner/ # JSON repair and validation
│ │ └── response-processor/ # AI response processing
│ └── shared/ # Common utilities and types
│ ├── config/ # Configuration management
│ ├── types/ # TypeScript interfaces
│ └── utils/ # Utility functions
└── examples/ # Example implementations
└── simple-chat/ # Basic chat example📖 Documentation
- Getting Started Guide
- Architecture Overview
- LLM Providers Guide - Multi-provider architecture and how to use different LLM services
- Reasoning Control Guide - Control model thinking with
reasoningEffortparameter - LLM Provider Parameters - Ollama-specific parameter reference and presets
- Request Formatting Guide - FlatFormatter vs RequestFormatterService
- Performance Monitoring - Metrics and logging
- API Reference
- Examples
- CHANGELOG - Release notes and breaking changes
🧪 Testing
The middleware includes comprehensive test suites covering unit tests, integration tests, robustness tests, and end-to-end workflows.
Quick Start
# Build the middleware first
npm run build
# Run all automated tests
npm run test:all
# Run unit tests only
npm run test:unit📖 For complete testing documentation, see tests/README.md
The test documentation includes:
- 📋 Quick reference table for all tests
- 🚀 Detailed test descriptions and prerequisites
- ⚠️ Troubleshooting guide
- 🔬 Development workflow best practices
🐦 Tweet Generator Example
The Tweet Generator example showcases parameter configuration for controlling output length:
import { TweetGeneratorUseCase } from '@loonylabs/llm-middleware';
const tweetGenerator = new TweetGeneratorUseCase();
const result = await tweetGenerator.execute({
prompt: 'The importance of clean code in software development'
});
console.log(result.tweet); // Generated tweet
console.log(result.characterCount); // Character count
console.log(result.withinLimit); // true if ≤ 280 charsKey Features:
- 🎯 Token Limiting: Uses
maxTokens: 70to limit output to ~280 characters (provider-agnostic!) - 📊 Character Validation: Automatically checks if output is within Twitter's limit
- 🎨 Marketing Preset: Optimized parameters for engaging, concise content
- ✅ Testable: Integration test verifies parameter effectiveness
Parameter Configuration:
protected getParameterOverrides(): ModelParameterOverrides {
return {
// ✅ NEW in v2.7.0: Provider-agnostic maxTokens (recommended)
maxTokens: 70, // Works for Anthropic, OpenAI, Ollama, Google
// Parameter tuning
temperatureOverride: 0.7,
repeatPenalty: 1.3,
frequencyPenalty: 0.3,
presencePenalty: 0.2,
topP: 0.9,
topK: 50,
repeatLastN: 32
};
}
// 💡 Legacy Ollama-specific approach (still works):
protected getParameterOverrides(): ModelParameterOverrides {
return {
num_predict: 70, // Ollama-specific (deprecated)
// ... other params
};
}This example demonstrates:
- How to configure parameters for specific output requirements
- Token limiting as a practical use case
- Validation and testing of parameter effectiveness
- Real-world application (social media content generation)
See src/examples/tweet-generator/ for full implementation.
🎯 Example Application
Run the included examples:
# Clone the repository
git clone https://github.com/loonylabs-dev/llm-middleware.git
cd llm-middleware
# Install dependencies
npm install
# Copy environment template
cp .env.example .env
# Start your LLM provider (example for Ollama)
ollama serve
# Run the example
npm run devTest the API:
curl -X POST http://localhost:3000/api/chat \
-H "Content-Type: application/json" \
-d '{"message": "Hello, how are you?"}'🔧 Advanced Features
Advanced JSON repair with automatic strategy selection and modular operations:
import { JsonCleanerService, JsonCleanerFactory } from '@loonylabs/llm-middleware';
// Simple usage (async - uses new recipe system with fallback)
const result = await JsonCleanerService.processResponseAsync(malformedJson);
console.log(result.cleanedJson);
// Legacy sync method (still works)
const cleaned = JsonCleanerService.processResponse(malformedJson);
// Advanced: Quick clean with automatic recipe selection
const result = await JsonCleanerFactory.quickClean(malformedJson);
console.log('Success:', result.success);
console.log('Confidence:', result.confidence);
console.log('Changes:', result.totalChanges);Features:
- 🎯 Automatic strategy selection (Conservative/Aggressive/Adaptive)
- 🔧 Modular detectors & fixers for specific problems
- ✨ Extracts JSON from Markdown/Think-Tags
- 🔄 Checkpoint/Rollback support for safe repairs
- 📊 Detailed metrics (confidence, quality, performance)
- 🛡️ Fallback to legacy system for compatibility
Available Templates:
import { RecipeTemplates } from '@loonylabs/llm-middleware';
const conservativeRecipe = RecipeTemplates.conservative();
const aggressiveRecipe = RecipeTemplates.aggressive();
const adaptiveRecipe = RecipeTemplates.adaptive();See Recipe System Documentation for details.
For simple data: Use FlatFormatter
const flat = FlatFormatter.flatten({ name: 'Alice', age: 30 });For complex nested prompts: Use RequestFormatterService
import { RequestFormatterService } from '@loonylabs/llm-middleware';
const prompt = {
context: { genre: 'sci-fi', tone: 'dark' },
instruction: 'Write an opening'
};
const formatted = RequestFormatterService.formatUserMessage(
prompt, (s) => s, 'MyUseCase'
);
// Outputs: ## CONTEXT:\ngenre: sci-fi\ntone: dark\n\n## INSTRUCTION:\nWrite an openingSee Request Formatting Guide for details.
Automatic performance tracking with UseCaseMetricsLoggerService:
// Automatically logged for all use cases:
// - Execution time
// - Token usage (input/output)
// - Generation speed (tokens/sec)
// - Parameters usedMetrics appear in console logs:
✅ Completed AI use case [MyUseCase = phi3:mini] SUCCESS
Time: 2.5s | Input: 120 tokens | Output: 85 tokens | Speed: 34.0 tokens/secSee Performance Monitoring Guide for advanced usage.
Multi-level logging with contextual metadata:
import { logger } from '@loonylabs/llm-middleware';
logger.info('Operation completed', {
context: 'MyService',
metadata: { userId: 123, duration: 150 }
});Flexible model management:
import { getModelConfig } from '@loonylabs/llm-middleware';
// MODEL1_NAME is required in .env or will throw error
const config = getModelConfig('MODEL1');
console.log(config.name); // Value from MODEL1_NAME env variable
console.log(config.baseUrl); // Value from MODEL1_URL or default localhostOverride the model configuration provider to use your own custom model configurations:
Use Cases:
- Multi-environment deployments (dev, staging, production)
- Dynamic model selection based on runtime conditions
- Loading model configs from external sources (database, API)
- Testing with different model configurations
New Pattern (Recommended):
import { BaseAIUseCase, ModelConfigKey, ValidatedLLMModelConfig } from '@loonylabs/llm-middleware';
// Define your custom model configurations
const MY_CUSTOM_MODELS: Record<string, ValidatedLLMModelConfig> = {
'PRODUCTION_MODEL': {
name: 'llama3.2:latest',
baseUrl: 'http://production-server.com:11434',
temperature: 0.7
},
'DEVELOPMENT_MODEL': {
name: 'llama3.2:latest',
baseUrl: 'http://localhost:11434',
temperature: 0.9
}
};
class MyCustomUseCase extends BaseAIUseCase<string, MyRequest, MyResult> {
// Override this method to provide custom model configurations
protected getModelConfigProvider(key: ModelConfigKey): ValidatedLLMModelConfig {
const config = MY_CUSTOM_MODELS[key];
if (!config?.name) {
throw new Error(`Model ${key} not found`);
}
return config;
}
// ... rest of your use case implementation
}Environment-Aware Example:
class EnvironmentAwareUseCase extends BaseAIUseCase<string, MyRequest, MyResult> {
protected getModelConfigProvider(key: ModelConfigKey): ValidatedLLMModelConfig {
const env = process.env.NODE_ENV || 'development';
// Automatically select model based on environment
const modelKey = env === 'production' ? 'PRODUCTION_MODEL' :
env === 'staging' ? 'STAGING_MODEL' :
'DEVELOPMENT_MODEL';
return MY_CUSTOM_MODELS[modelKey];
}
}Old Pattern (Still Supported):
// Legacy approach - still works but not recommended
class LegacyUseCase extends BaseAIUseCase<string, MyRequest, MyResult> {
protected get modelConfig(): ValidatedLLMModelConfig {
return myCustomGetModelConfig(this.modelConfigKey);
}
}See the Custom Config Example for a complete working implementation.
LLM-middleware provides fine-grained control over model parameters to optimize output for different use cases:
import { BaseAIUseCase, ModelParameterOverrides } from '@loonylabs/llm-middleware';
class MyUseCase extends BaseAIUseCase<MyRequest, MyResult> {
protected getParameterOverrides(): ModelParameterOverrides {
return {
temperatureOverride: 0.8, // Control creativity vs. determinism
repeatPenalty: 1.3, // Reduce word repetition
frequencyPenalty: 0.2, // Penalize frequent words
presencePenalty: 0.2, // Encourage topic diversity
topP: 0.92, // Nucleus sampling threshold
topK: 60, // Vocabulary selection limit
repeatLastN: 128 // Context window for repetition
};
}
}Parameter Levels:
- Global defaults: Set in
ModelParameterManagerService - Use-case level: Override via
getParameterOverrides()method - Request level: Pass parameters directly in requests
Available Presets:
import { ModelParameterManagerService } from '@loonylabs/llm-middleware';
// Use curated presets for common use cases
const creativeParams = ModelParameterManagerService.getDefaultParametersForType('creative_writing');
const factualParams = ModelParameterManagerService.getDefaultParametersForType('factual');
const poeticParams = ModelParameterManagerService.getDefaultParametersForType('poetic');
const dialogueParams = ModelParameterManagerService.getDefaultParametersForType('dialogue');
const technicalParams = ModelParameterManagerService.getDefaultParametersForType('technical');
const marketingParams = ModelParameterManagerService.getDefaultParametersForType('marketing');Presets Include:
- 📚 Creative Writing: Novels, stories, narrative fiction
- 📊 Factual: Reports, documentation, journalism
- 🎭 Poetic: Poetry, lyrics, artistic expression
- 💬 Dialogue: Character dialogue, conversational content
- 🔧 Technical: Code documentation, API references
- 📢 Marketing: Advertisements, promotional content
For detailed documentation about all parameters, value ranges, and preset configurations, see: Provider Parameters Guide (Ollama-specific)
🔧 Response Processing Options (v2.8.0)
Starting in v2.8.0, you can customize how responses are processed with ResponseProcessingOptions:
Available Options
interface ResponseProcessingOptions {
extractThinkTags?: boolean; // default: true
extractMarkdown?: boolean; // default: true
validateJson?: boolean; // default: true
cleanJson?: boolean; // default: true
recipeMode?: 'conservative' | 'aggressive' | 'adaptive';
}Usage in Use Cases
Override getResponseProcessingOptions() to customize processing:
// Plain text response (compression, summarization)
class CompressEntityUseCase extends BaseAIUseCase {
protected getResponseProcessingOptions(): ResponseProcessingOptions {
return {
extractThinkTags: true, // YES: Extract <think> tags
extractMarkdown: true, // YES: Extract markdown blocks
validateJson: false, // NO: Skip JSON validation
cleanJson: false // NO: Skip JSON cleaning
};
}
}
// Keep think tags in content
class DebugUseCase extends BaseAIUseCase {
protected getResponseProcessingOptions(): ResponseProcessingOptions {
return {
extractThinkTags: false // Keep <think> tags visible
};
}
}
// Conservative JSON cleaning
class StrictJsonUseCase extends BaseAIUseCase {
protected getResponseProcessingOptions(): ResponseProcessingOptions {
return {
recipeMode: 'conservative' // Minimal JSON fixes
};
}
}Direct Service Usage
You can also use ResponseProcessorService directly:
import { ResponseProcessorService, ResponseProcessingOptions } from '@loonylabs/llm-middleware';
// Plain text (no JSON processing)
const result = await ResponseProcessorService.processResponseAsync(response, {
validateJson: false,
cleanJson: false
});
// Extract markdown but skip JSON
const result2 = await ResponseProcessorService.processResponseAsync(response, {
extractMarkdown: true,
validateJson: false
});Use Cases
- ✅ Plain text responses: Compression, summarization, text generation
- ✅ Pre-validated JSON: Skip redundant validation
- ✅ Debug/analysis: Keep think tags in content
- ✅ Performance: Skip unnecessary processing steps
- ✅ Custom workflows: Mix and match extraction features
Backward Compatibility
All options are optional with sensible defaults. Existing code works without changes:
// Still works exactly as before
const result = await ResponseProcessorService.processResponseAsync(response);🤝 Contributing
We welcome contributions! Please see our Contributing Guidelines for details.
- Fork the repository
- Create your feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add some amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Ollama for the amazing local LLM platform
- The open-source community for inspiration and contributions
🔗 Links
Made with ❤️ for the AI community
