inferis-ml

v1.0.3

Published

2 months ago

Worker pool for running AI models in the browser — WebGPU/WASM auto-detection, model lifecycle, streaming, cross-tab dedup

0High
0Medium
0Low

pashunechka

web-worker worker-pool webgpu wasm ai inference transformers onnx llm browser streaming shared-worker model-management machine-learning

inferis-ml

Run AI models in the browser. No server, no per-request cost, no data leaving the device.

Live Demo — try it in your browser.

import { createPool } from 'inferis-ml';
import { transformersAdapter } from 'inferis-ml/adapters/transformers';

const pool = await createPool({ adapter: transformersAdapter() });
const model = await pool.load<number[][]>('feature-extraction', {
  model: 'mixedbread-ai/mxbai-embed-xsmall-v1',
});

const embeddings = await model.run(['Hello world', 'Another sentence']);

Why

Existing browser runtimes (transformers.js, web-llm, onnxruntime-web) give you inference but leave everything else to you — worker management, postMessage boilerplate, model lifecycle, memory budgets, cross-tab dedup, WebGPU fallback, streaming.

inferis-ml handles all of it. You get a clean async API and focus on the product.

| Problem | Without inferis-ml | With inferis-ml | |---------|-------------------|-----------------| | UI freezes during inference | Main thread blocked | Runs in Web Workers | | 5 tabs = 5 model copies | 10 GB RAM, browser crashes | crossTab: true — one shared copy | | WebGPU not everywhere | Manual detection + swap | defaultDevice: 'auto' |

Install

npm install inferis-ml

# Pick your adapter (peer deps):
npm install @huggingface/transformers   # transformersAdapter
npm install @mlc-ai/web-llm             # webLlmAdapter
npm install onnxruntime-web             # onnxAdapter

Quick Start

LLM Streaming

import { createPool } from 'inferis-ml';
import { webLlmAdapter } from 'inferis-ml/adapters/web-llm';

const pool = await createPool({
  adapter: webLlmAdapter(),
  defaultDevice: 'webgpu',
  maxWorkers: 1,
});

const llm = await pool.load<string>('text-generation', {
  model: 'Llama-3.2-3B-Instruct-q4f32_1-MLC',
  onProgress: ({ phase }) => console.log(phase),
});

const stream = llm.stream({
  messages: [
    { role: 'system', content: 'You are a helpful assistant.' },
    { role: 'user', content: 'Explain WebGPU in 3 sentences.' },
  ],
});

for await (const token of stream) {
  output.textContent += token;
}

Speech Transcription

const transcriber = await pool.load<{ text: string }>('automatic-speech-recognition', {
  model: 'openai/whisper-base',
  estimatedMemoryMB: 80,
});

const result = await transcriber.run(audioData);
console.log(result.text);

Abort Inference

const ctrl = new AbortController();
stopButton.onclick = () => ctrl.abort();

try {
  for await (const token of llm.stream(input, { signal: ctrl.signal })) {
    output.textContent += token;
  }
} catch (e) {
  if (e.name === 'AbortError') output.textContent += ' [stopped]';
}

Cross-Tab Deduplication

const pool = await createPool({
  adapter: transformersAdapter(),
  crossTab: true, // SharedWorker > leader election > per-tab fallback
});

Model State Changes

model.onStateChange((state) => {
  if (state === 'loading')  showSpinner();
  if (state === 'ready')    hideSpinner();
  if (state === 'error')    showError('Failed to load model');
  if (state === 'disposed') disableUI();
});

Features

Runtime-agnostic — adapters for @huggingface/transformers, @mlc-ai/web-llm, onnxruntime-web, or your own
Zero framework deps — works with React, Vue, Svelte, or vanilla JS
WebGPU -> WASM fallback — auto-detected or configured explicitly
Streaming — ReadableStream + for await for token-by-token output
Memory budget — LRU eviction when models exceed the configured cap
Cross-tab dedup — SharedWorker (tier 1), leader election (tier 2), per-tab (tier 3)
AbortController — cancel any in-flight inference
TypeScript — full type safety, generic output types

API Reference

`createPool(config)`

const pool = await createPool({
  adapter: transformersAdapter(),   // required
  workerUrl: new URL('inferis-ml/worker', import.meta.url),
  maxWorkers: navigator.hardwareConcurrency - 1,
  maxMemoryMB: 2048,
  defaultDevice: 'auto',           // 'webgpu' | 'wasm' | 'auto'
  crossTab: false,
  taskTimeout: 120_000,
});

`pool.load<TOutput>(task, config)`

Loads a model and returns a ModelHandle. If already loaded, returns the existing handle.

const model = await pool.load<number[][]>('feature-extraction', {
  model: 'mixedbread-ai/mxbai-embed-xsmall-v1',
  estimatedMemoryMB: 30,
  onProgress: (p) => { ... },
});

`ModelHandle<TOutput>`

| Method | Description | |--------|-------------| | run(input, options?) | Non-streaming inference. Returns Promise<TOutput>. | | stream(input, options?) | Streaming inference. Returns ReadableStream<TOutput>. | | dispose() | Unload model and free memory. | | onStateChange(cb) | Subscribe to state changes. Returns unsubscribe function. | | id | Unique model ID (task:model). | | state | Current state: idle \| loading \| ready \| inferring \| unloading \| error \| disposed. | | memoryMB | Approximate memory usage. | | device | Resolved device: webgpu or wasm. |

`InferenceOptions`

interface InferenceOptions {
  signal?: AbortSignal;
  priority?: 'high' | 'normal' | 'low';
}

`detectCapabilities()`

import { detectCapabilities } from 'inferis-ml';

const caps = await detectCapabilities();
if (caps.webgpu.supported) {
  console.log('GPU vendor:', caps.webgpu.adapter?.vendor);
} else {
  console.log('WASM SIMD:', caps.wasm.simd);
}

Custom Adapter

import type { ModelAdapter, ModelAdapterFactory } from 'inferis-ml';

export function myCustomAdapter(): ModelAdapterFactory {
  return {
    name: 'my-adapter',

    async create(): Promise<ModelAdapter> {
      const { MyRuntime } = await import('my-runtime');

      return {
        name: 'my-adapter',

        estimateMemoryMB(_task, config) {
          return (config.estimatedMemoryMB as number) ?? 50;
        },

        async load(task, config, device, onProgress) {
          onProgress({ phase: 'loading', loaded: 0, total: 1 });
          const instance = await MyRuntime.load(config.model as string, { device });
          onProgress({ phase: 'done', loaded: 1, total: 1 });
          return { instance, memoryMB: 50 };
        },

        async run(model, input) {
          return (model.instance as MyRuntime).infer(input);
        },

        async stream(model, input, onChunk) {
          for await (const chunk of (model.instance as MyRuntime).stream(input)) {
            onChunk(chunk);
          }
        },

        async unload(model) {
          await (model.instance as MyRuntime).dispose();
        },
      };
    },
  };
}

Framework Integrations

Official bindings with idiomatic APIs for popular frameworks:

| Package | Install | Docs | |---------|---------|------| | inferis-react | npm i inferis-react | README | | inferis-vue | npm i inferis-vue | README | | inferis-svelte | npm i inferis-svelte | README |

Each package provides context/provider setup, model lifecycle management, streaming, capability detection, and memory monitoring -- all wired into the framework's reactivity system.

// React
const { text, start } = useStream(model);

// Vue
const { text, start } = useStream(model);

// Svelte
const { text, start } = useStream(model);  // $text in template

Bundler & Framework Setup

inferis-ml is browser-only. In SSR frameworks, ensure initialization runs only on the client.

Vite

// vite.config.ts
export default {
  worker: { format: 'es' },
};

webpack 5

// webpack.config.js
module.exports = {
  experiments: { asyncWebAssembly: true },
};

Next.js

'use client';

import { useEffect, useState } from 'react';
import type { WorkerPoolInterface } from 'inferis-ml';

export default function AI() {
  const [pool, setPool] = useState<WorkerPoolInterface | null>(null);

  useEffect(() => {
    import('inferis-ml').then(({ createPool }) =>
      createPool({ adapter: { type: 'transformers' } })
    ).then(setPool);
  }, []);

  if (!pool) return <p>Loading...</p>;
  // use pool
}

Nuxt

<template>
  <ClientOnly>
    <InferenceComponent />
  </ClientOnly>
</template>

// composables/useInferis.ts
export async function useInferis() {
  const { createPool } = await import('inferis-ml');
  return createPool({ adapter: { type: 'transformers' } });
}

SvelteKit

import { browser } from '$app/environment';

let pool;
if (browser) {
  const { createPool } = await import('inferis-ml');
  pool = await createPool({ adapter: { type: 'transformers' } });
}

Popular Models

Models download from Hugging Face Hub on first use and are cached in the browser's Cache API. Subsequent loads are instant and work offline.

Embeddings / Semantic Search

| Model | Size | Notes | |-------|------|-------| | mixedbread-ai/mxbai-embed-xsmall-v1 | 23 MB | Best quality/size for English | | Xenova/all-MiniLM-L6-v2 | 23 MB | Popular multilingual | | Xenova/multilingual-e5-small | 118 MB | 100+ languages |

Text Generation (LLM)

Requires @mlc-ai/web-llm + defaultDevice: 'webgpu'.

| Model | Size | Notes | |-------|------|-------| | Llama-3.2-1B-Instruct-q4f32_1-MLC | 0.8 GB | Fastest | | Llama-3.2-3B-Instruct-q4f32_1-MLC | 2 GB | Good balance | | Phi-3.5-mini-instruct-q4f16_1-MLC | 2.2 GB | Strong reasoning | | gemma-2-2b-it-q4f16_1-MLC | 1.5 GB | Fast on mobile GPU |

Speech Recognition

| Model | Size | Notes | |-------|------|-------| | openai/whisper-tiny | 39 MB | Fastest | | openai/whisper-base | 74 MB | Good balance | | openai/whisper-small | 244 MB | Better accuracy |

Text Classification

| Model | Size | Notes | |-------|------|-------| | Xenova/distilbert-base-uncased-finetuned-sst-2-english | 67 MB | Sentiment | | Xenova/toxic-bert | 438 MB | Toxicity detection |

Translation

| Model | Size | Notes | |-------|------|-------| | Xenova/opus-mt-en-ru | 74 MB | EN -> RU | | Xenova/opus-mt-ru-en | 74 MB | RU -> EN | | Xenova/nllb-200-distilled-600M | 600 MB | 200 languages |

Image Classification

| Model | Size | Notes | |-------|------|-------| | Xenova/efficientnet-lite4 | 13 MB | Fastest, 1000 classes | | Xenova/mobilevit-small | 22 MB | Mobile-friendly |

Model Sources

Models are not locked to Hugging Face. Each adapter has its own sources:

transformers.js — HF Hub ID or any direct URL
web-llm — MLC registry, or register custom models
onnxruntime-web — direct URL to .onnx file
Custom adapter — load from anywhere (fetch, IndexedDB, bundled)

Caching

First visit:  download -> Cache API -> run  (5-60s)
Next visits:  Cache API -> run              (1-3s, no network)
Offline:      Cache API -> run              (works without internet)

Browser Support

| Feature | Chrome | Firefox | Safari | Edge | |---------|--------|---------|--------|------| | Core (Worker + WASM) | 57+ | 52+ | 11+ | 16+ | | WebGPU | 113+ | 141+ | 26+ | 113+ | | WASM SIMD | 91+ | 89+ | 16.4+ | 91+ | | SharedWorker | 4+ | 29+ | 16+ | 79+ | | Leader Election | 69+ | 96+ | 15.4+ | 79+ |

Minimum: Web Workers + WebAssembly (97%+ of browsers). All advanced features are progressive enhancements.

Performance Tips

maxWorkers: 1 for GPU-bound workloads (LLMs)
defaultDevice: 'webgpu' when targeting modern hardware
estimatedMemoryMB for accurate LRU eviction
crossTab: true for multi-tab apps (chat, editors)
Reuse ModelHandle — re-loading a ready model is a no-op

When To Use

| Use case | Fit? | |----------|------| | Semantic search, chatbot, speech, classification, translation | Yes | | Private data (never leaves device) | Yes | | Offline after first load | Yes | | Server-side batch processing | No | | Models > 4 GB | No |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

inferis-ml

Why

Install

Quick Start

LLM Streaming

Speech Transcription

Abort Inference

Cross-Tab Deduplication

Model State Changes

Features

API Reference

createPool(config)

pool.load<TOutput>(task, config)

ModelHandle<TOutput>

InferenceOptions

detectCapabilities()

Custom Adapter

Framework Integrations

Bundler & Framework Setup

Vite

webpack 5

Next.js

Nuxt

SvelteKit

Popular Models

Embeddings / Semantic Search

Text Generation (LLM)

Speech Recognition

Text Classification

Translation

Image Classification

Model Sources

Caching

Browser Support

Performance Tips

When To Use

License

`createPool(config)`

`pool.load<TOutput>(task, config)`

`ModelHandle<TOutput>`

`InferenceOptions`

`detectCapabilities()`