@loonylabs/llm-middleware

v2.18.0

Published

2 days ago

Complete middleware infrastructure for LLM-based backends with multi-provider support (Ollama, Anthropic, OpenAI, Google)

🚀 LLM Middleware

A comprehensive TypeScript middleware library for building robust multi-provider LLM backends. Currently supports Ollama, Anthropic Claude, Google Gemini (Direct API & Vertex AI), and Requesty.AI (300+ models). Features EU data residency via Vertex AI, reasoning control, advanced JSON cleaning, logging, error handling, cost tracking, and more.

✨ Features

🏗️ Clean Architecture: Base classes and interfaces for scalable AI applications
- ✨ v2.11.0: Dynamic system messages via getSystemMessage(request) override
🤖 Multi-Provider Architecture: Extensible provider system with strategy pattern
- ✅ Ollama: Fully supported with comprehensive parameter control
- ✅ Anthropic Claude: Complete support for Claude models (Opus, Sonnet, Haiku)
- ✅ Google Gemini Direct: Complete support for Gemini models via API Key
- ✅ Google Vertex AI: CDPA/GDPR-compliant with EU data residency (Service Account auth)
- ✅ Requesty.AI: 300+ models via unified API, built-in cost tracking
- 🔌 Pluggable: Easy to add custom providers - see LLM Providers Guide
🧠 Reasoning Control: Control model thinking effort via reasoningEffort parameter
- ✨ v2.14.0: Supports Gemini 2.5 (thinkingBudget) and Gemini 3 (thinkingLevel)
- 📊 Track reasoning tokens separately for cost analysis
🧹 JSON Cleaning: Recipe-based JSON repair system with automatic strategy selection
- ✨ v2.4.0: Enhanced array extraction support - properly handles JSON arrays [...] in addition to objects {...}
🎨 FlatFormatter System: Advanced data formatting for LLM consumption
📊 Comprehensive Logging: Multi-level logging with metadata support
⚙️ Configuration Management: Flexible model and application configuration
🛡️ Error Handling: Robust error handling and recovery mechanisms
🔧 TypeScript First: Full type safety throughout the entire stack
📦 Modular Design: Use only what you need
🧪 Testing Ready: Includes example implementations and test utilities

🚀 Quick Start

Installation

Install from npm:

npm install @loonylabs/llm-middleware

Or install directly from GitHub:

npm install github:loonylabs-dev/llm-middleware

Or using a specific version/tag:

npm install github:loonylabs-dev/llm-middleware#v1.3.0

Basic Usage

import { BaseAIUseCase, BaseAIRequest, BaseAIResult, LLMProvider } from '@loonylabs/llm-middleware';

// Define your request/response interfaces
interface MyRequest extends BaseAIRequest<string> {
  message: string;
}

interface MyResult extends BaseAIResult {
  response: string;
}

// Create your use case (uses Ollama by default)
class MyChatUseCase extends BaseAIUseCase<string, MyRequest, MyResult> {
  protected readonly systemMessage = "You are a helpful assistant.";

  // Required: return user message template function
  protected getUserTemplate(): (formattedPrompt: string) => string {
    return (message) => message;
  }

  protected formatUserMessage(prompt: any): string {
    return typeof prompt === 'string' ? prompt : prompt.message;
  }

  protected createResult(content: string, usedPrompt: string, thinking?: string): MyResult {
    return {
      generatedContent: content,
      model: this.modelConfig.name,
      usedPrompt: usedPrompt,
      thinking: thinking,
      response: content
    };
  }
}

// Switch to different provider (optional)
class MyAnthropicChatUseCase extends MyChatUseCase {
  protected getProvider(): LLMProvider {
    return LLMProvider.ANTHROPIC;  // Use Claude instead of Ollama
  }
}

// Dynamic system message based on request data (v2.11.0+)
class DynamicSystemMessageUseCase extends BaseAIUseCase<MyPrompt, MyRequest, MyResult> {
  protected readonly systemMessage = "Default system message";

  // Override to customize system message per-request
  protected getSystemMessage(request?: MyRequest): string {
    const context = request?.prompt?.context;
    if (context === 'technical') {
      return "You are a technical expert. Be precise and detailed.";
    }
    return this.systemMessage;
  }
  // ... other methods
}

import { llmService, LLMProvider, ollamaProvider, anthropicProvider, geminiProvider } from '@loonylabs/llm-middleware';

// Option 1: Use the LLM Service orchestrator (recommended for flexibility)
const response1 = await llmService.call(
  "Write a haiku about coding",
  {
    provider: LLMProvider.OLLAMA,  // Explicitly specify provider
    model: "llama2",
    temperature: 0.7
  }
);

// Use Anthropic Claude
const response2 = await llmService.call(
  "Explain quantum computing",
  {
    provider: LLMProvider.ANTHROPIC,
    model: "claude-3-5-sonnet-20241022",
    authToken: process.env.ANTHROPIC_API_KEY,
    maxTokens: 1024,
    temperature: 0.7
  }
);

// Use Google Gemini (Direct API)
const response3 = await llmService.call(
  "What is machine learning?",
  {
    provider: LLMProvider.GOOGLE,
    model: "gemini-1.5-pro",
    authToken: process.env.GEMINI_API_KEY,
    maxTokens: 1024,
    temperature: 0.7
  }
);

// Use Google Vertex AI (EU data residency, CDPA/GDPR compliant)
const response3b = await llmService.call(
  "Explain GDPR compliance",
  {
    provider: LLMProvider.VERTEX_AI,
    model: "gemini-2.5-flash",
    // Uses Service Account auth (GOOGLE_APPLICATION_CREDENTIALS)
    // Region defaults to europe-west3 (Frankfurt)
    reasoningEffort: 'medium'  // Control thinking effort
  }
);

// Option 2: Use provider directly for provider-specific features
const response4 = await ollamaProvider.callWithSystemMessage(
  "Write a haiku about coding",
  "You are a creative poet",
  {
    model: "llama2",
    temperature: 0.7,
    // Ollama-specific parameters
    repeat_penalty: 1.1,
    top_k: 40
  }
);

// Or use Anthropic provider directly
const response5 = await anthropicProvider.call(
  "Write a haiku about coding",
  {
    model: "claude-3-5-sonnet-20241022",
    authToken: process.env.ANTHROPIC_API_KEY,
    maxTokens: 1024
  }
);

// Or use Gemini provider directly
const response6 = await geminiProvider.call(
  "Write a haiku about coding",
  {
    model: "gemini-1.5-pro",
    authToken: process.env.GEMINI_API_KEY,
    maxOutputTokens: 1024
  }
);

// Set default provider for your application
llmService.setDefaultProvider(LLMProvider.OLLAMA);

// Now calls use Ollama by default
const response7 = await llmService.call("Hello!", { model: "llama2" });

For more details on the multi-provider system, see the LLM Providers Guide.

import { 
  FlatFormatter, 
  personPreset
} from '@loonylabs/llm-middleware';

class ProfileGeneratorUseCase extends BaseAIUseCase {
  protected readonly systemMessage = `You are a professional profile creator.
  
IMPORTANT: Respond with ONLY valid JSON following this schema:
{
  "name": "Person name",
  "title": "Professional title", 
  "summary": "Brief professional overview",
  "skills": "Key skills and expertise",
  "achievements": "Notable accomplishments"
}`;

  // Use FlatFormatter and presets for rich context building
  protected formatUserMessage(prompt: any): string {
    const { person, preferences, guidelines } = prompt;
    
    const contextSections = [
      // Use preset for structured data
      personPreset.formatForLLM(person, "## PERSON INFO:"),
      
      // Use FlatFormatter for custom structures
      `## PREFERENCES:\n${FlatFormatter.flatten(preferences, {
        format: 'bulleted',
        keyValueSeparator: ': '
      })}`,
      
      // Format guidelines with FlatFormatter
      `## GUIDELINES:\n${FlatFormatter.flatten(
        guidelines.map(g => ({ 
          guideline: g,
          priority: "MUST FOLLOW" 
        })),
        {
          format: 'numbered',
          entryTitleKey: 'guideline',
          ignoredKeys: ['guideline']
        }
      )}`
    ];
    
    return contextSections.join('\n\n');
  }
  
  protected createResult(content: string, usedPrompt: string, thinking?: string): MyResult {
    return {
      generatedContent: content,
      model: this.modelConfig.name,
      usedPrompt,
      thinking,
      profile: JSON.parse(content)
    };
  }
}

// Use it
const profileGen = new ProfileGeneratorUseCase();
const result = await profileGen.execute({ 
  prompt: { 
    person: { name: "Alice", occupation: "Engineer" },
    preferences: { tone: "professional", length: "concise" },
    guidelines: ["Highlight technical skills", "Include leadership"]
  },
  authToken: "optional-token"
});

📋 Prerequisites

Node.js 18+
TypeScript 4.9+
LLM Provider configured (e.g., Ollama server for Ollama provider)

⚙️ Configuration

Create a .env file in your project root:

# Server Configuration
PORT=3000
NODE_ENV=development

# Logging
LOG_LEVEL=info

# LLM Provider Configuration
MODEL1_NAME=phi3:mini              # Required: Your model name
MODEL1_URL=http://localhost:11434  # Optional: Defaults to localhost (Ollama)
MODEL1_TOKEN=optional-auth-token   # Optional: For authenticated providers

# Anthropic API Configuration (Optional)
ANTHROPIC_API_KEY=your_anthropic_api_key_here    # Your Anthropic API key
ANTHROPIC_MODEL=claude-3-5-sonnet-20241022       # Default Claude model

# Google Gemini Direct API Configuration (Optional)
GEMINI_API_KEY=your_gemini_api_key_here          # Your Google Gemini API key
GEMINI_MODEL=gemini-1.5-pro                      # Default Gemini model

# Google Vertex AI Configuration (Optional - for CDPA/GDPR compliance)
GOOGLE_CLOUD_PROJECT=your_project_id             # Google Cloud Project ID
VERTEX_AI_REGION=europe-west3                    # EU region (Frankfurt)
VERTEX_AI_MODEL=gemini-2.5-flash                 # Default Vertex AI model
GOOGLE_APPLICATION_CREDENTIALS=./vertex-ai-service-account.json  # Service Account

Multi-Provider Support: The middleware is fully integrated with Ollama, Anthropic Claude, Google Gemini (Direct API & Vertex AI), and Requesty.AI. See the LLM Providers Guide for details on the provider system and how to use or add providers.

🏗️ Architecture

The middleware follows Clean Architecture principles:

src/
├── middleware/
│   ├── controllers/base/     # Base HTTP controllers
│   ├── usecases/base/        # Base AI use cases
│   ├── services/             # External service integrations
│   │   ├── llm/             # LLM provider services (Ollama, OpenAI, etc.)
│   │   ├── json-cleaner/    # JSON repair and validation
│   │   └── response-processor/ # AI response processing
│   └── shared/              # Common utilities and types
│       ├── config/          # Configuration management
│       ├── types/           # TypeScript interfaces
│       └── utils/           # Utility functions
└── examples/               # Example implementations
    └── simple-chat/        # Basic chat example

📖 Documentation

Getting Started Guide
Architecture Overview
LLM Providers Guide - Multi-provider architecture and how to use different LLM services
Reasoning Control Guide - Control model thinking with reasoningEffort parameter
LLM Provider Parameters - Ollama-specific parameter reference and presets
Request Formatting Guide - FlatFormatter vs RequestFormatterService
Performance Monitoring - Metrics and logging
API Reference
Examples
CHANGELOG - Release notes and breaking changes

🧪 Testing

The middleware includes comprehensive test suites covering unit tests, integration tests, robustness tests, and end-to-end workflows.

Quick Start

# Build the middleware first
npm run build

# Run all automated tests
npm run test:all

# Run unit tests only
npm run test:unit

📖 For complete testing documentation, see tests/README.md

The test documentation includes:

📋 Quick reference table for all tests
🚀 Detailed test descriptions and prerequisites
⚠️ Troubleshooting guide
🔬 Development workflow best practices

🐦 Tweet Generator Example

The Tweet Generator example showcases parameter configuration for controlling output length:

import { TweetGeneratorUseCase } from '@loonylabs/llm-middleware';

const tweetGenerator = new TweetGeneratorUseCase();

const result = await tweetGenerator.execute({
  prompt: 'The importance of clean code in software development'
});

console.log(result.tweet);          // Generated tweet
console.log(result.characterCount); // Character count
console.log(result.withinLimit);    // true if ≤ 280 chars

Key Features:

🎯 Token Limiting: Uses maxTokens: 70 to limit output to ~280 characters (provider-agnostic!)
📊 Character Validation: Automatically checks if output is within Twitter's limit
🎨 Marketing Preset: Optimized parameters for engaging, concise content
✅ Testable: Integration test verifies parameter effectiveness

Parameter Configuration:

protected getParameterOverrides(): ModelParameterOverrides {
  return {
    // ✅ NEW in v2.7.0: Provider-agnostic maxTokens (recommended)
    maxTokens: 70,          // Works for Anthropic, OpenAI, Ollama, Google

    // Parameter tuning
    temperatureOverride: 0.7,
    repeatPenalty: 1.3,
    frequencyPenalty: 0.3,
    presencePenalty: 0.2,
    topP: 0.9,
    topK: 50,
    repeatLastN: 32
  };
}

// 💡 Legacy Ollama-specific approach (still works):
protected getParameterOverrides(): ModelParameterOverrides {
  return {
    num_predict: 70,        // Ollama-specific (deprecated)
    // ... other params
  };
}

This example demonstrates:

How to configure parameters for specific output requirements
Token limiting as a practical use case
Validation and testing of parameter effectiveness
Real-world application (social media content generation)

See src/examples/tweet-generator/ for full implementation.

🎯 Example Application

Run the included examples:

# Clone the repository
git clone https://github.com/loonylabs-dev/llm-middleware.git
cd llm-middleware

# Install dependencies
npm install

# Copy environment template
cp .env.example .env

# Start your LLM provider (example for Ollama)
ollama serve

# Run the example
npm run dev

Test the API:

curl -X POST http://localhost:3000/api/chat \
  -H "Content-Type: application/json" \
  -d '{"message": "Hello, how are you?"}'

🔧 Advanced Features

Advanced JSON repair with automatic strategy selection and modular operations:

import { JsonCleanerService, JsonCleanerFactory } from '@loonylabs/llm-middleware';

// Simple usage (async - uses new recipe system with fallback)
const result = await JsonCleanerService.processResponseAsync(malformedJson);
console.log(result.cleanedJson);

// Legacy sync method (still works)
const cleaned = JsonCleanerService.processResponse(malformedJson);

// Advanced: Quick clean with automatic recipe selection
const result = await JsonCleanerFactory.quickClean(malformedJson);
console.log('Success:', result.success);
console.log('Confidence:', result.confidence);
console.log('Changes:', result.totalChanges);

Features:

🎯 Automatic strategy selection (Conservative/Aggressive/Adaptive)
🔧 Modular detectors & fixers for specific problems
✨ Extracts JSON from Markdown/Think-Tags
🔄 Checkpoint/Rollback support for safe repairs
📊 Detailed metrics (confidence, quality, performance)
🛡️ Fallback to legacy system for compatibility

Available Templates:

import { RecipeTemplates } from '@loonylabs/llm-middleware';

const conservativeRecipe = RecipeTemplates.conservative();
const aggressiveRecipe = RecipeTemplates.aggressive();
const adaptiveRecipe = RecipeTemplates.adaptive();

See Recipe System Documentation for details.

For simple data: Use FlatFormatter

const flat = FlatFormatter.flatten({ name: 'Alice', age: 30 });

For complex nested prompts: Use RequestFormatterService

import { RequestFormatterService } from '@loonylabs/llm-middleware';

const prompt = {
  context: { genre: 'sci-fi', tone: 'dark' },
  instruction: 'Write an opening'
};

const formatted = RequestFormatterService.formatUserMessage(
  prompt, (s) => s, 'MyUseCase'
);
// Outputs: ## CONTEXT:\ngenre: sci-fi\ntone: dark\n\n## INSTRUCTION:\nWrite an opening

See Request Formatting Guide for details.

Automatic performance tracking with UseCaseMetricsLoggerService:

// Automatically logged for all use cases:
// - Execution time
// - Token usage (input/output)
// - Generation speed (tokens/sec)
// - Parameters used

Metrics appear in console logs:

✅ Completed AI use case [MyUseCase = phi3:mini] SUCCESS
   Time: 2.5s | Input: 120 tokens | Output: 85 tokens | Speed: 34.0 tokens/sec

See Performance Monitoring Guide for advanced usage.

Multi-level logging with contextual metadata:

import { logger } from '@loonylabs/llm-middleware';

logger.info('Operation completed', {
  context: 'MyService',
  metadata: { userId: 123, duration: 150 }
});

Flexible model management:

import { getModelConfig } from '@loonylabs/llm-middleware';

// MODEL1_NAME is required in .env or will throw error
const config = getModelConfig('MODEL1');
console.log(config.name);     // Value from MODEL1_NAME env variable
console.log(config.baseUrl);  // Value from MODEL1_URL or default localhost

Override the model configuration provider to use your own custom model configurations:

Use Cases:

Multi-environment deployments (dev, staging, production)
Dynamic model selection based on runtime conditions
Loading model configs from external sources (database, API)
Testing with different model configurations

New Pattern (Recommended):

import { BaseAIUseCase, ModelConfigKey, ValidatedLLMModelConfig } from '@loonylabs/llm-middleware';

// Define your custom model configurations
const MY_CUSTOM_MODELS: Record<string, ValidatedLLMModelConfig> = {
  'PRODUCTION_MODEL': {
    name: 'llama3.2:latest',
    baseUrl: 'http://production-server.com:11434',
    temperature: 0.7
  },
  'DEVELOPMENT_MODEL': {
    name: 'llama3.2:latest',
    baseUrl: 'http://localhost:11434',
    temperature: 0.9
  }
};

class MyCustomUseCase extends BaseAIUseCase<string, MyRequest, MyResult> {
  // Override this method to provide custom model configurations
  protected getModelConfigProvider(key: ModelConfigKey): ValidatedLLMModelConfig {
    const config = MY_CUSTOM_MODELS[key];
    if (!config?.name) {
      throw new Error(`Model ${key} not found`);
    }
    return config;
  }

  // ... rest of your use case implementation
}

Environment-Aware Example:

class EnvironmentAwareUseCase extends BaseAIUseCase<string, MyRequest, MyResult> {
  protected getModelConfigProvider(key: ModelConfigKey): ValidatedLLMModelConfig {
    const env = process.env.NODE_ENV || 'development';

    // Automatically select model based on environment
    const modelKey = env === 'production' ? 'PRODUCTION_MODEL' :
                     env === 'staging' ? 'STAGING_MODEL' :
                     'DEVELOPMENT_MODEL';

    return MY_CUSTOM_MODELS[modelKey];
  }
}

Old Pattern (Still Supported):

// Legacy approach - still works but not recommended
class LegacyUseCase extends BaseAIUseCase<string, MyRequest, MyResult> {
  protected get modelConfig(): ValidatedLLMModelConfig {
    return myCustomGetModelConfig(this.modelConfigKey);
  }
}

See the Custom Config Example for a complete working implementation.

LLM-middleware provides fine-grained control over model parameters to optimize output for different use cases:

import { BaseAIUseCase, ModelParameterOverrides } from '@loonylabs/llm-middleware';

class MyUseCase extends BaseAIUseCase<MyRequest, MyResult> {
  protected getParameterOverrides(): ModelParameterOverrides {
    return {
      temperatureOverride: 0.8,      // Control creativity vs. determinism
      repeatPenalty: 1.3,             // Reduce word repetition
      frequencyPenalty: 0.2,          // Penalize frequent words
      presencePenalty: 0.2,           // Encourage topic diversity
      topP: 0.92,                     // Nucleus sampling threshold
      topK: 60,                       // Vocabulary selection limit
      repeatLastN: 128                // Context window for repetition
    };
  }
}

Parameter Levels:

Global defaults: Set in ModelParameterManagerService
Use-case level: Override via getParameterOverrides() method
Request level: Pass parameters directly in requests

Available Presets:

import { ModelParameterManagerService } from '@loonylabs/llm-middleware';

// Use curated presets for common use cases
const creativeParams = ModelParameterManagerService.getDefaultParametersForType('creative_writing');
const factualParams = ModelParameterManagerService.getDefaultParametersForType('factual');
const poeticParams = ModelParameterManagerService.getDefaultParametersForType('poetic');
const dialogueParams = ModelParameterManagerService.getDefaultParametersForType('dialogue');
const technicalParams = ModelParameterManagerService.getDefaultParametersForType('technical');
const marketingParams = ModelParameterManagerService.getDefaultParametersForType('marketing');

Presets Include:

📚 Creative Writing: Novels, stories, narrative fiction
📊 Factual: Reports, documentation, journalism
🎭 Poetic: Poetry, lyrics, artistic expression
💬 Dialogue: Character dialogue, conversational content
🔧 Technical: Code documentation, API references
📢 Marketing: Advertisements, promotional content

For detailed documentation about all parameters, value ranges, and preset configurations, see: Provider Parameters Guide (Ollama-specific)

🔧 Response Processing Options (v2.8.0)

Starting in v2.8.0, you can customize how responses are processed with ResponseProcessingOptions:

Available Options

interface ResponseProcessingOptions {
  extractThinkTags?: boolean;    // default: true
  extractMarkdown?: boolean;     // default: true
  validateJson?: boolean;        // default: true
  cleanJson?: boolean;           // default: true
  recipeMode?: 'conservative' | 'aggressive' | 'adaptive';
}

Usage in Use Cases

Override getResponseProcessingOptions() to customize processing:

// Plain text response (compression, summarization)
class CompressEntityUseCase extends BaseAIUseCase {
  protected getResponseProcessingOptions(): ResponseProcessingOptions {
    return {
      extractThinkTags: true,     // YES: Extract <think> tags
      extractMarkdown: true,      // YES: Extract markdown blocks
      validateJson: false,        // NO: Skip JSON validation
      cleanJson: false           // NO: Skip JSON cleaning
    };
  }
}

// Keep think tags in content
class DebugUseCase extends BaseAIUseCase {
  protected getResponseProcessingOptions(): ResponseProcessingOptions {
    return {
      extractThinkTags: false  // Keep <think> tags visible
    };
  }
}

// Conservative JSON cleaning
class StrictJsonUseCase extends BaseAIUseCase {
  protected getResponseProcessingOptions(): ResponseProcessingOptions {
    return {
      recipeMode: 'conservative'  // Minimal JSON fixes
    };
  }
}

Direct Service Usage

You can also use ResponseProcessorService directly:

import { ResponseProcessorService, ResponseProcessingOptions } from '@loonylabs/llm-middleware';

// Plain text (no JSON processing)
const result = await ResponseProcessorService.processResponseAsync(response, {
  validateJson: false,
  cleanJson: false
});

// Extract markdown but skip JSON
const result2 = await ResponseProcessorService.processResponseAsync(response, {
  extractMarkdown: true,
  validateJson: false
});

Use Cases

✅ Plain text responses: Compression, summarization, text generation
✅ Pre-validated JSON: Skip redundant validation
✅ Debug/analysis: Keep think tags in content
✅ Performance: Skip unnecessary processing steps
✅ Custom workflows: Mix and match extraction features

Backward Compatibility

All options are optional with sensible defaults. Existing code works without changes:

// Still works exactly as before
const result = await ResponseProcessorService.processResponseAsync(response);

🤝 Contributing

We welcome contributions! Please see our Contributing Guidelines for details.

Fork the repository
Create your feature branch (git checkout -b feature/amazing-feature)
Commit your changes (git commit -m 'Add some amazing feature')
Push to the branch (git push origin feature/amazing-feature)
Open a Pull Request

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

Ollama for the amazing local LLM platform
The open-source community for inspiration and contributions

🔗 Links

Made with ❤️ for the AI community