npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@loxia-labs/spitfire

v1.2.1

Published

Standalone Node.js module for running LLMs locally - no external dependencies

Readme

Spitfire

Standalone Node.js module for running LLMs locally. No external dependencies required - includes both WebAssembly and WebGPU inference engines built from scratch.

Features

  • Truly Standalone - No Ollama or external runtime required
  • WebGPU Acceleration - GPU-accelerated inference with automatic fallback
  • WebAssembly Engine - Cross-platform CPU inference, runs anywhere Node.js runs
  • Ollama-compatible - Works with existing GGUF models
  • Quantization Support - Q4_0, Q4_K, Q5_K, Q6_K, Q8_0, F16, F32 dequantization on GPU
  • Chat Templates - Automatic Jinja2 chat template support (Qwen2, Llama, etc.)
  • HTTP API - Drop-in replacement for Ollama's REST API
  • Programmatic API - Clean TypeScript/JavaScript API
  • Streaming - Full streaming support for text generation

Architecture

┌─────────────────────────────────────────────────────────────┐
│                     Your Application                         │
├─────────────────────────────────────────────────────────────┤
│    Spitfire API    │    SpitfireServer (HTTP/Fastify)       │
├────────────────────┴────────────────────────────────────────┤
│                      Engine Factory                          │
│              (Auto-detects best available)                   │
├──────────────────────────┬──────────────────────────────────┤
│      WebGPU Engine       │         WASM Engine              │
│   (GPU Acceleration)     │   (llama.cpp via Emscripten)     │
│  ┌────────────────────┐  │  ┌────────────────────────────┐  │
│  │ Tensor Ops (WGSL)  │  │  │    llama.cpp (WASM)        │  │
│  │ Attention/FFN      │  │  │    SIMD128 optimized       │  │
│  │ Quantization       │  │  └────────────────────────────┘  │
│  │ GGUF Loader        │  │                                  │
│  └────────────────────┘  │                                  │
├──────────────────────────┴──────────────────────────────────┤
│                      Node.js / V8                            │
└─────────────────────────────────────────────────────────────┘

No subprocess, no external binaries - everything runs in-process.

Installation

npm install @loxia-labs/spitfire

Quick Start

Automatic Engine Selection

import { createBestEngine } from '@loxia-labs/spitfire';

// Automatically uses WebGPU if available, falls back to WASM
const engine = await createBestEngine();

await engine.loadModel('/path/to/model.gguf');

const result = await engine.generate('Hello, how are you?', {
  maxTokens: 100,
  temperature: 0.8
});
console.log(result.text);

await engine.shutdown();

Using WebGPU Engine (GPU Acceleration)

import { createWebGPUEngine } from '@loxia-labs/spitfire';

const engine = createWebGPUEngine();

// Initialize GPU device
await engine.init();

// Check GPU capabilities
const caps = engine.getCapabilities();
console.log(`Max buffer size: ${caps.maxBufferSize / 1024 / 1024} MB`);

// Load and run model
await engine.loadModel('/path/to/model.gguf', {
  contextLength: 2048
});

const result = await engine.generate('Explain quantum computing', {
  maxTokens: 200,
  temperature: 0.7,
  topK: 40
});

console.log(result.text);
await engine.shutdown();

Using WASM Engine (CPU)

import { createWasmEngine } from '@loxia-labs/spitfire';

const engine = createWasmEngine();

await engine.init();
await engine.loadModel('/path/to/model.gguf', {
  contextLength: 2048,
  numThreads: 4
});

const result = await engine.generate('Hello!', {
  maxTokens: 100
});

await engine.shutdown();

Chat Templates

Spitfire automatically applies chat templates from GGUF metadata. For models like Qwen2, your prompt is automatically formatted:

// Your input:
const result = await engine.generate('What is 2+2?');

// Automatically formatted as:
// <|im_start|>system
// You are Qwen, created by Alibaba Cloud...
// <|im_end|>
// <|im_start|>user
// What is 2+2?<|im_end|>
// <|im_start|>assistant

If you want to handle formatting yourself, use rawPrompt: true:

const prompt = '<|im_start|>user\nHello<|im_end|>\n<|im_start|>assistant\n';
const result = await engine.generate(prompt, {
  rawPrompt: true  // Skip automatic chat template
});

Using the High-Level API

import { Spitfire } from '@loxia-labs/spitfire';

const spitfire = new Spitfire();

// Generate completion
const response = await spitfire.generate({
  model: 'llama3.2',
  prompt: 'What is the meaning of life?'
});
console.log(response.response);

// Chat
const chat = await spitfire.chat({
  model: 'llama3.2',
  messages: [
    { role: 'user', content: 'Hello!' }
  ]
});
console.log(chat.message.content);

await spitfire.shutdown();

HTTP Server

import { SpitfireServer } from '@loxia-labs/spitfire';

const server = new SpitfireServer({ port: 11434 });
await server.start();

// Now accessible at http://localhost:11434
// Compatible with Ollama API clients

CLI

# List models
spitfire list

# Run inference
spitfire run /path/to/model.gguf "Hello!"

# Start HTTP server
spitfire serve --port 11434

Building from Source

Prerequisites

For WASM Engine: Install Emscripten SDK:

git clone https://github.com/emscripten-core/emsdk.git
cd emsdk
./emsdk install latest
./emsdk activate latest
source ./emsdk_env.sh

For WebGPU Engine: WebGPU support is included via the webgpu npm package (Dawn bindings).

Build

cd spitfire
npm install
npm run build:wasm  # Build WASM engine
npm run build:ts    # Build TypeScript

API Reference

Engine Factory

import {
  createEngine,
  createBestEngine,
  detectBestEngine
} from '@loxia-labs/spitfire';

// Create specific engine type
const engine = createEngine({ type: 'webgpu' }); // or 'wasm'

// Auto-detect and create best engine
const bestEngine = await createBestEngine();

// Just detect without creating
const engineType = await detectBestEngine(); // 'webgpu' or 'wasm'

WebGPUEngine

GPU-accelerated inference engine.

const engine = createWebGPUEngine();

await engine.init();
engine.getCapabilities();  // GPU limits and features

await engine.loadModel(path, {
  contextLength?: number;
  batchSize?: number;
});

await engine.generate(prompt, {
  maxTokens?: number;
  temperature?: number;
  topK?: number;
  topP?: number;
  stop?: string[];
  rawPrompt?: boolean;  // Skip chat template formatting
});

await engine.embed(text);  // Get embeddings
await engine.unload();     // Unload model
await engine.shutdown();   // Release GPU

WasmEngine

CPU-based inference engine.

const engine = createWasmEngine({
  wasmPath?: string;
  numThreads?: number;
});

await engine.init();
await engine.loadModel(path, options);
await engine.generate(prompt, options);
await engine.embed(text);
await engine.shutdown();

Spitfire (High-Level API)

const spitfire = new Spitfire({
  modelsPath?: string;
  maxLoadedModels?: number;
  defaultKeepAlive?: string;
  engineType?: 'webgpu' | 'wasm' | 'auto';
});

await spitfire.generate(request);
await spitfire.chat(request);
await spitfire.embed(request);
await spitfire.list();
await spitfire.shutdown();

SpitfireServer (HTTP API)

const server = new SpitfireServer({
  host?: string;  // Default: '127.0.0.1'
  port?: number;  // Default: 11434
});

await server.start();
await server.stop();

Model Compatibility

Spitfire works with GGUF model files. Supported quantization formats:

| Format | WebGPU | WASM | |--------|--------|------| | F32 | ✅ Yes | ✅ Yes | | F16 | ✅ Yes | ✅ Yes | | Q8_0 | ✅ Yes | ✅ Yes | | Q4_0 | ✅ Yes | ✅ Yes | | Q4_K | ✅ Yes | ✅ Yes | | Q5_K | ✅ Yes | ✅ Yes | | Q6_K | ✅ Yes | ✅ Yes |

Tested Models:

  • Qwen2.5-Coder-3B-Instruct (Q4_K_M, Q6_K)
  • Llama 3.2 (various quantizations)
  • Other GGUF-compatible models

Get models from:

  1. HuggingFace - Many quantized models available
  2. Ollama models - Copy from ~/.ollama/models/
  3. Convert your own - Use llama.cpp's conversion tools

Performance

| Metric | WebGPU Engine | WASM Engine | |--------|---------------|-------------| | Speed | ~80% of native | ~70-85% of native | | Memory | GPU VRAM | Up to 4GB | | GPU | Required | Not used | | Threading | GPU parallel | Multi-thread CPU | | SIMD | GPU compute | WASM SIMD128 |

WebGPU Optimizations

  • Tiled matrix multiplication (8x8 tiles)
  • Shader caching and precompilation
  • Buffer pooling for memory reuse
  • Numerically stable softmax
  • Fused attention kernels

Project Structure

spitfire/
├── src/
│   ├── engine/
│   │   ├── index.ts           # Engine factory
│   │   ├── wasm-engine.ts     # WASM inference
│   │   ├── webgpu-engine.ts   # WebGPU inference
│   │   └── webgpu/
│   │       ├── device.ts      # GPU device management
│   │       ├── buffer.ts      # Buffer utilities
│   │       ├── shader.ts      # WGSL shader compilation
│   │       ├── tensor.ts      # GPU Tensor class
│   │       ├── ops/           # Tensor operations
│   │       ├── layers/        # Transformer layers
│   │       ├── quant/         # Quantization support
│   │       ├── model/         # GGUF loading
│   │       └── perf/          # Performance monitoring
│   ├── types/                 # TypeScript definitions
│   ├── model/                 # Model management
│   ├── server/                # HTTP API
│   └── spitfire.ts            # Main API class
├── native/
│   ├── llama.cpp/             # llama.cpp source
│   ├── ggml/                  # ggml source
│   └── wasm/                  # WASM bindings
├── tests/
│   └── webgpu/                # WebGPU test suite (146 tests)
└── dist/
    └── wasm/                  # Compiled WASM files

Testing

# Run all tests
npm test

# Run WebGPU tests only
npm test -- --testPathPattern=webgpu

# Run specific test file
npm test -- --testPathPattern=webgpu/tensor

Test Coverage: 228 tests across 10 test suites

Requirements

  • Node.js 18+
  • WebGPU (for GPU acceleration): Supported in Node.js via Dawn bindings
  • Emscripten (for building WASM): Only needed if building from source

License

MIT

Credits

Written by Daniel Suissa, Loxia.ai

Visit us at https://autopilot.loxia.ai

Acknowledgments

  • llama.cpp - The inference engine
  • ggml - Tensor library
  • Ollama - API design inspiration
  • WebLLM - WebGPU LLM inspiration
  • Dawn - WebGPU implementation for Node.js