@loxia-labs/spitfire
v1.2.1
Published
Standalone Node.js module for running LLMs locally - no external dependencies
Maintainers
Readme
Spitfire
Standalone Node.js module for running LLMs locally. No external dependencies required - includes both WebAssembly and WebGPU inference engines built from scratch.
Features
- Truly Standalone - No Ollama or external runtime required
- WebGPU Acceleration - GPU-accelerated inference with automatic fallback
- WebAssembly Engine - Cross-platform CPU inference, runs anywhere Node.js runs
- Ollama-compatible - Works with existing GGUF models
- Quantization Support - Q4_0, Q4_K, Q5_K, Q6_K, Q8_0, F16, F32 dequantization on GPU
- Chat Templates - Automatic Jinja2 chat template support (Qwen2, Llama, etc.)
- HTTP API - Drop-in replacement for Ollama's REST API
- Programmatic API - Clean TypeScript/JavaScript API
- Streaming - Full streaming support for text generation
Architecture
┌─────────────────────────────────────────────────────────────┐
│ Your Application │
├─────────────────────────────────────────────────────────────┤
│ Spitfire API │ SpitfireServer (HTTP/Fastify) │
├────────────────────┴────────────────────────────────────────┤
│ Engine Factory │
│ (Auto-detects best available) │
├──────────────────────────┬──────────────────────────────────┤
│ WebGPU Engine │ WASM Engine │
│ (GPU Acceleration) │ (llama.cpp via Emscripten) │
│ ┌────────────────────┐ │ ┌────────────────────────────┐ │
│ │ Tensor Ops (WGSL) │ │ │ llama.cpp (WASM) │ │
│ │ Attention/FFN │ │ │ SIMD128 optimized │ │
│ │ Quantization │ │ └────────────────────────────┘ │
│ │ GGUF Loader │ │ │
│ └────────────────────┘ │ │
├──────────────────────────┴──────────────────────────────────┤
│ Node.js / V8 │
└─────────────────────────────────────────────────────────────┘No subprocess, no external binaries - everything runs in-process.
Installation
npm install @loxia-labs/spitfireQuick Start
Automatic Engine Selection
import { createBestEngine } from '@loxia-labs/spitfire';
// Automatically uses WebGPU if available, falls back to WASM
const engine = await createBestEngine();
await engine.loadModel('/path/to/model.gguf');
const result = await engine.generate('Hello, how are you?', {
maxTokens: 100,
temperature: 0.8
});
console.log(result.text);
await engine.shutdown();Using WebGPU Engine (GPU Acceleration)
import { createWebGPUEngine } from '@loxia-labs/spitfire';
const engine = createWebGPUEngine();
// Initialize GPU device
await engine.init();
// Check GPU capabilities
const caps = engine.getCapabilities();
console.log(`Max buffer size: ${caps.maxBufferSize / 1024 / 1024} MB`);
// Load and run model
await engine.loadModel('/path/to/model.gguf', {
contextLength: 2048
});
const result = await engine.generate('Explain quantum computing', {
maxTokens: 200,
temperature: 0.7,
topK: 40
});
console.log(result.text);
await engine.shutdown();Using WASM Engine (CPU)
import { createWasmEngine } from '@loxia-labs/spitfire';
const engine = createWasmEngine();
await engine.init();
await engine.loadModel('/path/to/model.gguf', {
contextLength: 2048,
numThreads: 4
});
const result = await engine.generate('Hello!', {
maxTokens: 100
});
await engine.shutdown();Chat Templates
Spitfire automatically applies chat templates from GGUF metadata. For models like Qwen2, your prompt is automatically formatted:
// Your input:
const result = await engine.generate('What is 2+2?');
// Automatically formatted as:
// <|im_start|>system
// You are Qwen, created by Alibaba Cloud...
// <|im_end|>
// <|im_start|>user
// What is 2+2?<|im_end|>
// <|im_start|>assistantIf you want to handle formatting yourself, use rawPrompt: true:
const prompt = '<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n';
const result = await engine.generate(prompt, {
rawPrompt: true // Skip automatic chat template
});Using the High-Level API
import { Spitfire } from '@loxia-labs/spitfire';
const spitfire = new Spitfire();
// Generate completion
const response = await spitfire.generate({
model: 'llama3.2',
prompt: 'What is the meaning of life?'
});
console.log(response.response);
// Chat
const chat = await spitfire.chat({
model: 'llama3.2',
messages: [
{ role: 'user', content: 'Hello!' }
]
});
console.log(chat.message.content);
await spitfire.shutdown();HTTP Server
import { SpitfireServer } from '@loxia-labs/spitfire';
const server = new SpitfireServer({ port: 11434 });
await server.start();
// Now accessible at http://localhost:11434
// Compatible with Ollama API clientsCLI
# List models
spitfire list
# Run inference
spitfire run /path/to/model.gguf "Hello!"
# Start HTTP server
spitfire serve --port 11434Building from Source
Prerequisites
For WASM Engine: Install Emscripten SDK:
git clone https://github.com/emscripten-core/emsdk.git
cd emsdk
./emsdk install latest
./emsdk activate latest
source ./emsdk_env.shFor WebGPU Engine:
WebGPU support is included via the webgpu npm package (Dawn bindings).
Build
cd spitfire
npm install
npm run build:wasm # Build WASM engine
npm run build:ts # Build TypeScriptAPI Reference
Engine Factory
import {
createEngine,
createBestEngine,
detectBestEngine
} from '@loxia-labs/spitfire';
// Create specific engine type
const engine = createEngine({ type: 'webgpu' }); // or 'wasm'
// Auto-detect and create best engine
const bestEngine = await createBestEngine();
// Just detect without creating
const engineType = await detectBestEngine(); // 'webgpu' or 'wasm'WebGPUEngine
GPU-accelerated inference engine.
const engine = createWebGPUEngine();
await engine.init();
engine.getCapabilities(); // GPU limits and features
await engine.loadModel(path, {
contextLength?: number;
batchSize?: number;
});
await engine.generate(prompt, {
maxTokens?: number;
temperature?: number;
topK?: number;
topP?: number;
stop?: string[];
rawPrompt?: boolean; // Skip chat template formatting
});
await engine.embed(text); // Get embeddings
await engine.unload(); // Unload model
await engine.shutdown(); // Release GPUWasmEngine
CPU-based inference engine.
const engine = createWasmEngine({
wasmPath?: string;
numThreads?: number;
});
await engine.init();
await engine.loadModel(path, options);
await engine.generate(prompt, options);
await engine.embed(text);
await engine.shutdown();Spitfire (High-Level API)
const spitfire = new Spitfire({
modelsPath?: string;
maxLoadedModels?: number;
defaultKeepAlive?: string;
engineType?: 'webgpu' | 'wasm' | 'auto';
});
await spitfire.generate(request);
await spitfire.chat(request);
await spitfire.embed(request);
await spitfire.list();
await spitfire.shutdown();SpitfireServer (HTTP API)
const server = new SpitfireServer({
host?: string; // Default: '127.0.0.1'
port?: number; // Default: 11434
});
await server.start();
await server.stop();Model Compatibility
Spitfire works with GGUF model files. Supported quantization formats:
| Format | WebGPU | WASM | |--------|--------|------| | F32 | ✅ Yes | ✅ Yes | | F16 | ✅ Yes | ✅ Yes | | Q8_0 | ✅ Yes | ✅ Yes | | Q4_0 | ✅ Yes | ✅ Yes | | Q4_K | ✅ Yes | ✅ Yes | | Q5_K | ✅ Yes | ✅ Yes | | Q6_K | ✅ Yes | ✅ Yes |
Tested Models:
- Qwen2.5-Coder-3B-Instruct (Q4_K_M, Q6_K)
- Llama 3.2 (various quantizations)
- Other GGUF-compatible models
Get models from:
- HuggingFace - Many quantized models available
- Ollama models - Copy from
~/.ollama/models/ - Convert your own - Use llama.cpp's conversion tools
Performance
| Metric | WebGPU Engine | WASM Engine | |--------|---------------|-------------| | Speed | ~80% of native | ~70-85% of native | | Memory | GPU VRAM | Up to 4GB | | GPU | Required | Not used | | Threading | GPU parallel | Multi-thread CPU | | SIMD | GPU compute | WASM SIMD128 |
WebGPU Optimizations
- Tiled matrix multiplication (8x8 tiles)
- Shader caching and precompilation
- Buffer pooling for memory reuse
- Numerically stable softmax
- Fused attention kernels
Project Structure
spitfire/
├── src/
│ ├── engine/
│ │ ├── index.ts # Engine factory
│ │ ├── wasm-engine.ts # WASM inference
│ │ ├── webgpu-engine.ts # WebGPU inference
│ │ └── webgpu/
│ │ ├── device.ts # GPU device management
│ │ ├── buffer.ts # Buffer utilities
│ │ ├── shader.ts # WGSL shader compilation
│ │ ├── tensor.ts # GPU Tensor class
│ │ ├── ops/ # Tensor operations
│ │ ├── layers/ # Transformer layers
│ │ ├── quant/ # Quantization support
│ │ ├── model/ # GGUF loading
│ │ └── perf/ # Performance monitoring
│ ├── types/ # TypeScript definitions
│ ├── model/ # Model management
│ ├── server/ # HTTP API
│ └── spitfire.ts # Main API class
├── native/
│ ├── llama.cpp/ # llama.cpp source
│ ├── ggml/ # ggml source
│ └── wasm/ # WASM bindings
├── tests/
│ └── webgpu/ # WebGPU test suite (146 tests)
└── dist/
└── wasm/ # Compiled WASM filesTesting
# Run all tests
npm test
# Run WebGPU tests only
npm test -- --testPathPattern=webgpu
# Run specific test file
npm test -- --testPathPattern=webgpu/tensorTest Coverage: 228 tests across 10 test suites
Requirements
- Node.js 18+
- WebGPU (for GPU acceleration): Supported in Node.js via Dawn bindings
- Emscripten (for building WASM): Only needed if building from source
License
MIT
Credits
Written by Daniel Suissa, Loxia.ai
Visit us at https://autopilot.loxia.ai
