web-llm-middleware

v1.4.0

Published

a year ago

OpenAI-compatible middleware for running WebLLM models locally with offline support

Downloads

0High
0Medium
0Low

ragingwind

webllm llm ai machine-learning openai offline browser middleware local-ai inference

web-llm-middleware

https://github.com/user-attachments/assets/4d5a6160-9985-4e63-b812-fe595e84c0af

🚀 Usage

Basic Middleware Integration

The WebLLM middleware provides an OpenAI-compatible API for running large language models locally in the browser.

Node.js HTTP Server

import { createServer } from 'node:http';
import { parse } from 'node:url';
import { WebLLMMiddleware } from 'web-llm-middleware';

const webllm = new WebLLMMiddleware({
  dev: true, // Enable development logging
  model: 'Llama-3.2-1B-Instruct-q4f32_1-MLC',
});

const handler = webllm.getRequestHandler();

const server = createServer((req, res) => {
  const parsedUrl = parse(req.url ?? '/', true);
  handler(req, res, parsedUrl);
});

server.listen(15408, () => {
  console.log('WebLLM server running on http://localhost:15408');
});

Express.js Integration

import express from 'express';
import { WebLLMMiddleware } from 'web-llm-middleware';

const app = express();
const webllm = new WebLLMMiddleware({
  dev: process.env.NODE_ENV === 'development',
  dir: './public',
  model: 'Llama-3.2-1B-Instruct-q4f32_1-MLC',
});

const handler = webllm.getRequestHandler();

// Use WebLLM middleware for all requests
app.use((req, res, next) => {
  handler(req, res);
});

app.listen(15408, () => {
  console.log('Express + WebLLM server running on http://localhost:15408');
});

Next.js API Route

// pages/api/chat.ts or app/api/chat/route.ts
import { WebLLMMiddleware } from 'web-llm-middleware';

const webllm = new WebLLMMiddleware({
  dev: process.env.NODE_ENV === 'development',
  dir: './public',
  model: 'Llama-3.2-1B-Instruct-q4f32_1-MLC',
});

const handler = webllm.getRequestHandler();

export default function chatHandler(req: any, res: any) {
  return handler(req, res);
}

Configuration Options

interface WebLLMMiddlewareOptions {
  model: string; // Model ID to initialize
  dev?: boolean; // Enable development logging (default: false)
}

Available Models

The middleware supports 36+ models including:

Llama Series: 3, 3.1, 3.2 (1B, 3B, 8B, 70B)
Qwen Series: 1.5, 2, 2.5, 3 with Math/Coder variants
Phi Series: 3, 3.5 mini and vision models
SmolLM: Lightweight 135M, 360M, 1.7B models
Gemma, Hermes, Mistral: Various sizes and specializations

See /v1/models endpoint for the complete list.

🤖 Vercel AI SDK Integration

The middleware is fully compatible with Vercel AI SDK's generateText and streamText functions:

import { generateText, streamText } from 'ai';
import { createOpenAI } from '@ai-sdk/openai';

const openai = createOpenAI({
  baseURL: 'http://localhost:15408/v1',
  apiKey: 'not-needed',
});

// Non-streaming text generation
const { text } = await generateText({
  model: openai('Llama-3.2-1B-Instruct-q4f32_1-MLC'),
  prompt: 'Write a short story about a robot.',
});

// Streaming text generation
const { textStream } = await streamText({
  model: openai('Llama-3.2-1B-Instruct-q4f32_1-MLC'),
  prompt: 'Write a creative story...',
});

for await (const textPart of textStream) {
  process.stdout.write(textPart);
}

Both functions use the standard OpenAI /v1/chat/completions endpoint with automatic streaming detection.

🛠️ Development

This project uses:

TypeScript for type safety
ES Modules for modern JavaScript
tsx for running TypeScript files directly
Strict mode enabled in TypeScript for better type checking

Building

To build the project:

pnpm run build

This will compile TypeScript files from src/ to JavaScript in dist/.

Development Mode

For development with automatic reloading:

pnpm run dev

🧪 Testing

Quick Start Testing

Start the test server:
```
pnpm test:server
```
Test Vercel AI SDK integration:
```
pnpm test:ai-sdk
```

Test chat completions endpoint with curl:

curl -X POST http://localhost:15408/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @./example/hello.json | jq .choices

API Endpoints Testing

1. Health Check

curl -X GET http://localhost:15408/health | jq

Expected response:

{
  "status": "healthy",
  "webllm_initialized": true,
  "timestamp": "2024-06-19T..."
}

2. List Available Models

curl -X GET http://localhost:15408/v1/models | jq .data

Returns array of 36 supported models including Llama, Phi, Qwen, and other series.

3. Chat Completions

Using example file:

curl -X POST http://localhost:15408/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @./example/hello.json

Custom request:

curl -X POST http://localhost:15408/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [
      {"role": "user", "content": "What is the capital of France?"}
    ],
    "model": "Llama-3.2-1B-Instruct-q4f32_1-MLC",
    "max_tokens": 50,
    "temperature": 0.7
  }'

Offline Functionality Verification

Disconnect from internet or block external requests
Start server: pnpm test:server
Verify WebLLM loads: Check that lib/web-llm.js (5.6MB) is served locally
Test completion: Use any of the above curl commands
Check logs: Server should show WebLLM initialization without external requests

Testing Different Models

Test various model families:

# Small model (fast)
curl -X POST http://localhost:15408/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Hi"}], "model": "SmolLM-135M-Instruct-q4f16_1-MLC"}'

# Math specialist
curl -X POST http://localhost:15408/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "What is 15 * 23?"}], "model": "Qwen2-Math-7B-Instruct-q4f16_1-MLC"}'

# Code specialist
curl -X POST http://localhost:15408/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "Write a hello world in Python"}], "model": "Qwen2.5-Coder-7B-Instruct-q4f16_1-MLC"}'

Performance Testing

Monitor initialization and response times:

time curl -X POST http://localhost:15408/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d @./example/hello.json

Terminate Server Process

lsof -ti:15408 | xargs kill -9

📝 License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

web-llm-middleware

🚀 Usage

Basic Middleware Integration

Node.js HTTP Server

Express.js Integration

Next.js API Route

Configuration Options

Available Models

🤖 Vercel AI SDK Integration

🛠️ Development

Building

Development Mode

🧪 Testing

Quick Start Testing

API Endpoints Testing

1. Health Check

2. List Available Models

3. Chat Completions

Offline Functionality Verification

Testing Different Models

Performance Testing

Terminate Server Process

📝 License