@fugood/node-llama-wasm
v1.7.8
Published
WebAssembly package for llama.node browser inference
Readme
@fugood/node-llama-wasm
WebAssembly package for llama.node browser inference.
This package exposes the same high-level loadModel() and context methods used
by @fugood/llama.node, with browser-specific I/O:
modelstrings are fetched as URLs by default.- URL downloads are saved in browser Cache Storage by default, including model,
session, mmproj, and media URLs. Use
wasm: { cacheDownloads: false }to force network fetches,wasm.cacheNameto isolate a cache, orclearWasmDownloadCache()to clear the default cache. TheloadModel()progress callback receives the percentage plus an optional detail object withsource: 'network' | 'cache' | 'memory' | 'buffer'. saveSession()returns anArrayBuffer.loadSession()accepts a URL,Blob,ArrayBuffer, or typed array.initMultimodal()accepts an mmproj URL,Blob,ArrayBuffer, typed array, or preloaded MEMFS path. Image/audio URL media inmessagesormedia_pathsis staged into the virtual filesystem before inference.- WebGPU can be opted into with
n_gpu_layerswhen the WASM binary is built withGGML_WEBGPU=ON,navigator.gpuis available, and the browser exposes WebAssembly JSPI (WebAssembly.promisingandWebAssembly.Suspending). - The distributed build uses WebAssembly Memory64, same with wllama constraints. Browsers without Memory64 support are not supported.
loadModel()uses a dedicated Web Worker by default so WASM work does not block the browser UI thread. On isolated pages withSharedArrayBuffer, the CPU path selects the pthread artifact and defaultsn_threadstomin(4, navigator.hardwareConcurrency). Usewasm: { threads: false }for the single-thread fallback, or setn_threads/wasm.maxThreadsto tune CPU threading. Usewasm: { worker: false }only for direct Emscripten-module debugging or integration code that must run on the current thread.
Large model files at or above the browser WebAssembly ArrayBuffer limit are
rejected. Split large GGUF files into shards, preferably 512 MB or smaller.
import {
clearWasmDownloadCache,
isWebGpuSupported,
loadModel,
} from '@fugood/node-llama-wasm'
const context = await loadModel({
model: 'https://huggingface.co/Durlabh/gemma-270m-q4-k-m-gguf/resolve/main/gemma3-270m-it-q4_k_m.gguf',
n_ctx: 2048,
n_gpu_layers: isWebGpuSupported() ? 99999 : 0,
})
const tokens = await context.tokenize('Hello')
const text = await context.detokenize(tokens.tokens)
const state = await context.saveSession()
await context.loadSession(new Blob([state]))
const result = await context.completion({ prompt: text, n_predict: 32 })
// Optional when you want to force the next run to fetch URLs again.
await clearWasmDownloadCache()Build from the repository root:
npm run build-wasm
npm run build-wasm-docker
npm run build-wasm -- --webgpu
npm run serve-wasm-testThe build script keeps CPU and WebGPU artifacts in separate build directories,
uses Ninja on fresh build dirs when available, respects JOBS, and enables
ccache automatically when installed. It also stores Emscripten's system-library
cache in build-wasm/emcache unless EM_CACHE is already set. The Docker
helper selects emscripten/emsdk:4.0.14-arm64 on arm64 hosts such as Apple
Silicon Macs, and emscripten/emsdk:4.0.13 on amd64 hosts. Override with
EMSCRIPTEN_IMAGE or EMSCRIPTEN_PLATFORM when a specific image is required.
