inferis-ml
v1.0.3
Published
Worker pool for running AI models in the browser — WebGPU/WASM auto-detection, model lifecycle, streaming, cross-tab dedup
Maintainers
Readme
inferis-ml
Run AI models in the browser. No server, no per-request cost, no data leaving the device.
Live Demo — try it in your browser.
import { createPool } from 'inferis-ml';
import { transformersAdapter } from 'inferis-ml/adapters/transformers';
const pool = await createPool({ adapter: transformersAdapter() });
const model = await pool.load<number[][]>('feature-extraction', {
model: 'mixedbread-ai/mxbai-embed-xsmall-v1',
});
const embeddings = await model.run(['Hello world', 'Another sentence']);Why
Existing browser runtimes (transformers.js, web-llm, onnxruntime-web) give you inference but leave everything else to you — worker management, postMessage boilerplate, model lifecycle, memory budgets, cross-tab dedup, WebGPU fallback, streaming.
inferis-ml handles all of it. You get a clean async API and focus on the product.
| Problem | Without inferis-ml | With inferis-ml |
|---------|-------------------|-----------------|
| UI freezes during inference | Main thread blocked | Runs in Web Workers |
| 5 tabs = 5 model copies | 10 GB RAM, browser crashes | crossTab: true — one shared copy |
| WebGPU not everywhere | Manual detection + swap | defaultDevice: 'auto' |
Install
npm install inferis-ml
# Pick your adapter (peer deps):
npm install @huggingface/transformers # transformersAdapter
npm install @mlc-ai/web-llm # webLlmAdapter
npm install onnxruntime-web # onnxAdapterQuick Start
LLM Streaming
import { createPool } from 'inferis-ml';
import { webLlmAdapter } from 'inferis-ml/adapters/web-llm';
const pool = await createPool({
adapter: webLlmAdapter(),
defaultDevice: 'webgpu',
maxWorkers: 1,
});
const llm = await pool.load<string>('text-generation', {
model: 'Llama-3.2-3B-Instruct-q4f32_1-MLC',
onProgress: ({ phase }) => console.log(phase),
});
const stream = llm.stream({
messages: [
{ role: 'system', content: 'You are a helpful assistant.' },
{ role: 'user', content: 'Explain WebGPU in 3 sentences.' },
],
});
for await (const token of stream) {
output.textContent += token;
}Speech Transcription
const transcriber = await pool.load<{ text: string }>('automatic-speech-recognition', {
model: 'openai/whisper-base',
estimatedMemoryMB: 80,
});
const result = await transcriber.run(audioData);
console.log(result.text);Abort Inference
const ctrl = new AbortController();
stopButton.onclick = () => ctrl.abort();
try {
for await (const token of llm.stream(input, { signal: ctrl.signal })) {
output.textContent += token;
}
} catch (e) {
if (e.name === 'AbortError') output.textContent += ' [stopped]';
}Cross-Tab Deduplication
const pool = await createPool({
adapter: transformersAdapter(),
crossTab: true, // SharedWorker > leader election > per-tab fallback
});Model State Changes
model.onStateChange((state) => {
if (state === 'loading') showSpinner();
if (state === 'ready') hideSpinner();
if (state === 'error') showError('Failed to load model');
if (state === 'disposed') disableUI();
});Features
- Runtime-agnostic — adapters for
@huggingface/transformers,@mlc-ai/web-llm,onnxruntime-web, or your own - Zero framework deps — works with React, Vue, Svelte, or vanilla JS
- WebGPU -> WASM fallback — auto-detected or configured explicitly
- Streaming —
ReadableStream+for awaitfor token-by-token output - Memory budget — LRU eviction when models exceed the configured cap
- Cross-tab dedup — SharedWorker (tier 1), leader election (tier 2), per-tab (tier 3)
- AbortController — cancel any in-flight inference
- TypeScript — full type safety, generic output types
API Reference
createPool(config)
const pool = await createPool({
adapter: transformersAdapter(), // required
workerUrl: new URL('inferis-ml/worker', import.meta.url),
maxWorkers: navigator.hardwareConcurrency - 1,
maxMemoryMB: 2048,
defaultDevice: 'auto', // 'webgpu' | 'wasm' | 'auto'
crossTab: false,
taskTimeout: 120_000,
});pool.load<TOutput>(task, config)
Loads a model and returns a ModelHandle. If already loaded, returns the existing handle.
const model = await pool.load<number[][]>('feature-extraction', {
model: 'mixedbread-ai/mxbai-embed-xsmall-v1',
estimatedMemoryMB: 30,
onProgress: (p) => { ... },
});ModelHandle<TOutput>
| Method | Description |
|--------|-------------|
| run(input, options?) | Non-streaming inference. Returns Promise<TOutput>. |
| stream(input, options?) | Streaming inference. Returns ReadableStream<TOutput>. |
| dispose() | Unload model and free memory. |
| onStateChange(cb) | Subscribe to state changes. Returns unsubscribe function. |
| id | Unique model ID (task:model). |
| state | Current state: idle \| loading \| ready \| inferring \| unloading \| error \| disposed. |
| memoryMB | Approximate memory usage. |
| device | Resolved device: webgpu or wasm. |
InferenceOptions
interface InferenceOptions {
signal?: AbortSignal;
priority?: 'high' | 'normal' | 'low';
}detectCapabilities()
import { detectCapabilities } from 'inferis-ml';
const caps = await detectCapabilities();
if (caps.webgpu.supported) {
console.log('GPU vendor:', caps.webgpu.adapter?.vendor);
} else {
console.log('WASM SIMD:', caps.wasm.simd);
}Custom Adapter
import type { ModelAdapter, ModelAdapterFactory } from 'inferis-ml';
export function myCustomAdapter(): ModelAdapterFactory {
return {
name: 'my-adapter',
async create(): Promise<ModelAdapter> {
const { MyRuntime } = await import('my-runtime');
return {
name: 'my-adapter',
estimateMemoryMB(_task, config) {
return (config.estimatedMemoryMB as number) ?? 50;
},
async load(task, config, device, onProgress) {
onProgress({ phase: 'loading', loaded: 0, total: 1 });
const instance = await MyRuntime.load(config.model as string, { device });
onProgress({ phase: 'done', loaded: 1, total: 1 });
return { instance, memoryMB: 50 };
},
async run(model, input) {
return (model.instance as MyRuntime).infer(input);
},
async stream(model, input, onChunk) {
for await (const chunk of (model.instance as MyRuntime).stream(input)) {
onChunk(chunk);
}
},
async unload(model) {
await (model.instance as MyRuntime).dispose();
},
};
},
};
}Framework Integrations
Official bindings with idiomatic APIs for popular frameworks:
| Package | Install | Docs |
|---------|---------|------|
| inferis-react | npm i inferis-react | README |
| inferis-vue | npm i inferis-vue | README |
| inferis-svelte | npm i inferis-svelte | README |
Each package provides context/provider setup, model lifecycle management, streaming, capability detection, and memory monitoring -- all wired into the framework's reactivity system.
// React
const { text, start } = useStream(model);
// Vue
const { text, start } = useStream(model);
// Svelte
const { text, start } = useStream(model); // $text in templateBundler & Framework Setup
inferis-ml is browser-only. In SSR frameworks, ensure initialization runs only on the client.
Vite
// vite.config.ts
export default {
worker: { format: 'es' },
};webpack 5
// webpack.config.js
module.exports = {
experiments: { asyncWebAssembly: true },
};Next.js
'use client';
import { useEffect, useState } from 'react';
import type { WorkerPoolInterface } from 'inferis-ml';
export default function AI() {
const [pool, setPool] = useState<WorkerPoolInterface | null>(null);
useEffect(() => {
import('inferis-ml').then(({ createPool }) =>
createPool({ adapter: { type: 'transformers' } })
).then(setPool);
}, []);
if (!pool) return <p>Loading...</p>;
// use pool
}Nuxt
<template>
<ClientOnly>
<InferenceComponent />
</ClientOnly>
</template>// composables/useInferis.ts
export async function useInferis() {
const { createPool } = await import('inferis-ml');
return createPool({ adapter: { type: 'transformers' } });
}SvelteKit
import { browser } from '$app/environment';
let pool;
if (browser) {
const { createPool } = await import('inferis-ml');
pool = await createPool({ adapter: { type: 'transformers' } });
}Popular Models
Models download from Hugging Face Hub on first use and are cached in the browser's Cache API. Subsequent loads are instant and work offline.
Embeddings / Semantic Search
| Model | Size | Notes |
|-------|------|-------|
| mixedbread-ai/mxbai-embed-xsmall-v1 | 23 MB | Best quality/size for English |
| Xenova/all-MiniLM-L6-v2 | 23 MB | Popular multilingual |
| Xenova/multilingual-e5-small | 118 MB | 100+ languages |
Text Generation (LLM)
Requires
@mlc-ai/web-llm+defaultDevice: 'webgpu'.
| Model | Size | Notes |
|-------|------|-------|
| Llama-3.2-1B-Instruct-q4f32_1-MLC | 0.8 GB | Fastest |
| Llama-3.2-3B-Instruct-q4f32_1-MLC | 2 GB | Good balance |
| Phi-3.5-mini-instruct-q4f16_1-MLC | 2.2 GB | Strong reasoning |
| gemma-2-2b-it-q4f16_1-MLC | 1.5 GB | Fast on mobile GPU |
Speech Recognition
| Model | Size | Notes |
|-------|------|-------|
| openai/whisper-tiny | 39 MB | Fastest |
| openai/whisper-base | 74 MB | Good balance |
| openai/whisper-small | 244 MB | Better accuracy |
Text Classification
| Model | Size | Notes |
|-------|------|-------|
| Xenova/distilbert-base-uncased-finetuned-sst-2-english | 67 MB | Sentiment |
| Xenova/toxic-bert | 438 MB | Toxicity detection |
Translation
| Model | Size | Notes |
|-------|------|-------|
| Xenova/opus-mt-en-ru | 74 MB | EN -> RU |
| Xenova/opus-mt-ru-en | 74 MB | RU -> EN |
| Xenova/nllb-200-distilled-600M | 600 MB | 200 languages |
Image Classification
| Model | Size | Notes |
|-------|------|-------|
| Xenova/efficientnet-lite4 | 13 MB | Fastest, 1000 classes |
| Xenova/mobilevit-small | 22 MB | Mobile-friendly |
Model Sources
Models are not locked to Hugging Face. Each adapter has its own sources:
- transformers.js — HF Hub ID or any direct URL
- web-llm — MLC registry, or register custom models
- onnxruntime-web — direct URL to
.onnxfile - Custom adapter — load from anywhere (fetch, IndexedDB, bundled)
Caching
First visit: download -> Cache API -> run (5-60s)
Next visits: Cache API -> run (1-3s, no network)
Offline: Cache API -> run (works without internet)Browser Support
| Feature | Chrome | Firefox | Safari | Edge | |---------|--------|---------|--------|------| | Core (Worker + WASM) | 57+ | 52+ | 11+ | 16+ | | WebGPU | 113+ | 141+ | 26+ | 113+ | | WASM SIMD | 91+ | 89+ | 16.4+ | 91+ | | SharedWorker | 4+ | 29+ | 16+ | 79+ | | Leader Election | 69+ | 96+ | 15.4+ | 79+ |
Minimum: Web Workers + WebAssembly (97%+ of browsers). All advanced features are progressive enhancements.
Performance Tips
maxWorkers: 1for GPU-bound workloads (LLMs)defaultDevice: 'webgpu'when targeting modern hardwareestimatedMemoryMBfor accurate LRU evictioncrossTab: truefor multi-tab apps (chat, editors)- Reuse
ModelHandle— re-loading areadymodel is a no-op
When To Use
| Use case | Fit? | |----------|------| | Semantic search, chatbot, speech, classification, translation | Yes | | Private data (never leaves device) | Yes | | Offline after first load | Yes | | Server-side batch processing | No | | Models > 4 GB | No |
License
MIT
