@loxia-labs/spitfire

v1.2.1

Published

6 months ago

Standalone Node.js module for running LLMs locally - no external dependencies

0High
0Medium
0Low

loxia-labs

llm ollama ai machine-learning inference llama gguf webgpu gpu wasm webassembly transformer gpt local-llm quantization

Spitfire

Standalone Node.js module for running LLMs locally. No external dependencies required - includes both WebAssembly and WebGPU inference engines built from scratch.

Features

Truly Standalone - No Ollama or external runtime required
WebGPU Acceleration - GPU-accelerated inference with automatic fallback
WebAssembly Engine - Cross-platform CPU inference, runs anywhere Node.js runs
Ollama-compatible - Works with existing GGUF models
Quantization Support - Q4_0, Q4_K, Q5_K, Q6_K, Q8_0, F16, F32 dequantization on GPU
Chat Templates - Automatic Jinja2 chat template support (Qwen2, Llama, etc.)
HTTP API - Drop-in replacement for Ollama's REST API
Programmatic API - Clean TypeScript/JavaScript API
Streaming - Full streaming support for text generation

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Your Application                         │
├─────────────────────────────────────────────────────────────┤
│    Spitfire API    │    SpitfireServer (HTTP/Fastify)       │
├────────────────────┴────────────────────────────────────────┤
│                      Engine Factory                          │
│              (Auto-detects best available)                   │
├──────────────────────────┬──────────────────────────────────┤
│      WebGPU Engine       │         WASM Engine              │
│   (GPU Acceleration)     │   (llama.cpp via Emscripten)     │
│  ┌────────────────────┐  │  ┌────────────────────────────┐  │
│  │ Tensor Ops (WGSL)  │  │  │    llama.cpp (WASM)        │  │
│  │ Attention/FFN      │  │  │    SIMD128 optimized       │  │
│  │ Quantization       │  │  └────────────────────────────┘  │
│  │ GGUF Loader        │  │                                  │
│  └────────────────────┘  │                                  │
├──────────────────────────┴──────────────────────────────────┤
│                      Node.js / V8                            │
└─────────────────────────────────────────────────────────────┘

No subprocess, no external binaries - everything runs in-process.

Installation

npm install @loxia-labs/spitfire

Quick Start

Automatic Engine Selection

import { createBestEngine } from '@loxia-labs/spitfire';

// Automatically uses WebGPU if available, falls back to WASM
const engine = await createBestEngine();

await engine.loadModel('/path/to/model.gguf');

const result = await engine.generate('Hello, how are you?', {
  maxTokens: 100,
  temperature: 0.8
});
console.log(result.text);

await engine.shutdown();

Using WebGPU Engine (GPU Acceleration)

import { createWebGPUEngine } from '@loxia-labs/spitfire';

const engine = createWebGPUEngine();

// Initialize GPU device
await engine.init();

// Check GPU capabilities
const caps = engine.getCapabilities();
console.log(`Max buffer size: ${caps.maxBufferSize / 1024 / 1024} MB`);

// Load and run model
await engine.loadModel('/path/to/model.gguf', {
  contextLength: 2048
});

const result = await engine.generate('Explain quantum computing', {
  maxTokens: 200,
  temperature: 0.7,
  topK: 40
});

console.log(result.text);
await engine.shutdown();

Using WASM Engine (CPU)

import { createWasmEngine } from '@loxia-labs/spitfire';

const engine = createWasmEngine();

await engine.init();
await engine.loadModel('/path/to/model.gguf', {
  contextLength: 2048,
  numThreads: 4
});

const result = await engine.generate('Hello!', {
  maxTokens: 100
});

await engine.shutdown();

Chat Templates

Spitfire automatically applies chat templates from GGUF metadata. For models like Qwen2, your prompt is automatically formatted:

// Your input:
const result = await engine.generate('What is 2+2?');

// Automatically formatted as:
// <|im_start|>system
// You are Qwen, created by Alibaba Cloud...
// <|im_end|>
// <|im_start|>user
// What is 2+2?<|im_end|>
// <|im_start|>assistant

If you want to handle formatting yourself, use rawPrompt: true:

const prompt = '<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n';
const result = await engine.generate(prompt, {
  rawPrompt: true  // Skip automatic chat template
});

Using the High-Level API

import { Spitfire } from '@loxia-labs/spitfire';

const spitfire = new Spitfire();

// Generate completion
const response = await spitfire.generate({
  model: 'llama3.2',
  prompt: 'What is the meaning of life?'
});
console.log(response.response);

// Chat
const chat = await spitfire.chat({
  model: 'llama3.2',
  messages: [
    { role: 'user', content: 'Hello!' }
  ]
});
console.log(chat.message.content);

await spitfire.shutdown();

HTTP Server

import { SpitfireServer } from '@loxia-labs/spitfire';

const server = new SpitfireServer({ port: 11434 });
await server.start();

// Now accessible at http://localhost:11434
// Compatible with Ollama API clients

CLI

# List models
spitfire list

# Run inference
spitfire run /path/to/model.gguf "Hello!"

# Start HTTP server
spitfire serve --port 11434

Building from Source

Prerequisites

For WASM Engine: Install Emscripten SDK:

git clone https://github.com/emscripten-core/emsdk.git
cd emsdk
./emsdk install latest
./emsdk activate latest
source ./emsdk_env.sh

For WebGPU Engine: WebGPU support is included via the webgpu npm package (Dawn bindings).

Build

cd spitfire
npm install
npm run build:wasm  # Build WASM engine
npm run build:ts    # Build TypeScript

API Reference

Engine Factory

import {
  createEngine,
  createBestEngine,
  detectBestEngine
} from '@loxia-labs/spitfire';

// Create specific engine type
const engine = createEngine({ type: 'webgpu' }); // or 'wasm'

// Auto-detect and create best engine
const bestEngine = await createBestEngine();

// Just detect without creating
const engineType = await detectBestEngine(); // 'webgpu' or 'wasm'

WebGPUEngine

GPU-accelerated inference engine.

const engine = createWebGPUEngine();

await engine.init();
engine.getCapabilities();  // GPU limits and features

await engine.loadModel(path, {
  contextLength?: number;
  batchSize?: number;
});

await engine.generate(prompt, {
  maxTokens?: number;
  temperature?: number;
  topK?: number;
  topP?: number;
  stop?: string[];
  rawPrompt?: boolean;  // Skip chat template formatting
});

await engine.embed(text);  // Get embeddings
await engine.unload();     // Unload model
await engine.shutdown();   // Release GPU

WasmEngine

CPU-based inference engine.

const engine = createWasmEngine({
  wasmPath?: string;
  numThreads?: number;
});

await engine.init();
await engine.loadModel(path, options);
await engine.generate(prompt, options);
await engine.embed(text);
await engine.shutdown();

Spitfire (High-Level API)

const spitfire = new Spitfire({
  modelsPath?: string;
  maxLoadedModels?: number;
  defaultKeepAlive?: string;
  engineType?: 'webgpu' | 'wasm' | 'auto';
});

await spitfire.generate(request);
await spitfire.chat(request);
await spitfire.embed(request);
await spitfire.list();
await spitfire.shutdown();

SpitfireServer (HTTP API)

const server = new SpitfireServer({
  host?: string;  // Default: '127.0.0.1'
  port?: number;  // Default: 11434
});

await server.start();
await server.stop();

Model Compatibility

Spitfire works with GGUF model files. Supported quantization formats:

| Format | WebGPU | WASM | |--------|--------|------| | F32 | ✅ Yes | ✅ Yes | | F16 | ✅ Yes | ✅ Yes | | Q8_0 | ✅ Yes | ✅ Yes | | Q4_0 | ✅ Yes | ✅ Yes | | Q4_K | ✅ Yes | ✅ Yes | | Q5_K | ✅ Yes | ✅ Yes | | Q6_K | ✅ Yes | ✅ Yes |

Tested Models:

Qwen2.5-Coder-3B-Instruct (Q4_K_M, Q6_K)
Llama 3.2 (various quantizations)
Other GGUF-compatible models

Get models from:

HuggingFace - Many quantized models available
Ollama models - Copy from ~/.ollama/models/
Convert your own - Use llama.cpp's conversion tools

Performance

| Metric | WebGPU Engine | WASM Engine | |--------|---------------|-------------| | Speed | ~80% of native | ~70-85% of native | | Memory | GPU VRAM | Up to 4GB | | GPU | Required | Not used | | Threading | GPU parallel | Multi-thread CPU | | SIMD | GPU compute | WASM SIMD128 |

WebGPU Optimizations

Tiled matrix multiplication (8x8 tiles)
Shader caching and precompilation
Buffer pooling for memory reuse
Numerically stable softmax
Fused attention kernels

Project Structure

spitfire/
├── src/
│   ├── engine/
│   │   ├── index.ts           # Engine factory
│   │   ├── wasm-engine.ts     # WASM inference
│   │   ├── webgpu-engine.ts   # WebGPU inference
│   │   └── webgpu/
│   │       ├── device.ts      # GPU device management
│   │       ├── buffer.ts      # Buffer utilities
│   │       ├── shader.ts      # WGSL shader compilation
│   │       ├── tensor.ts      # GPU Tensor class
│   │       ├── ops/           # Tensor operations
│   │       ├── layers/        # Transformer layers
│   │       ├── quant/         # Quantization support
│   │       ├── model/         # GGUF loading
│   │       └── perf/          # Performance monitoring
│   ├── types/                 # TypeScript definitions
│   ├── model/                 # Model management
│   ├── server/                # HTTP API
│   └── spitfire.ts            # Main API class
├── native/
│   ├── llama.cpp/             # llama.cpp source
│   ├── ggml/                  # ggml source
│   └── wasm/                  # WASM bindings
├── tests/
│   └── webgpu/                # WebGPU test suite (146 tests)
└── dist/
    └── wasm/                  # Compiled WASM files

Testing

# Run all tests
npm test

# Run WebGPU tests only
npm test -- --testPathPattern=webgpu

# Run specific test file
npm test -- --testPathPattern=webgpu/tensor

Test Coverage: 228 tests across 10 test suites

Requirements

Node.js 18+
WebGPU (for GPU acceleration): Supported in Node.js via Dawn bindings
Emscripten (for building WASM): Only needed if building from source

License

MIT

Credits

Written by Daniel Suissa, Loxia.ai

Visit us at https://autopilot.loxia.ai

Acknowledgments

llama.cpp - The inference engine
ggml - Tensor library
Ollama - API design inspiration
WebLLM - WebGPU LLM inspiration
Dawn - WebGPU implementation for Node.js