aether-slm-framework

v1.1.2

Published

2 months ago

Zero-Cost, privacy-first, modular SLM & RAG framework for the browser. Run AI entirely on-device with no API keys and no data leaving the browser sandbox.

0High
0Medium
0Low

vibe-coder-ofek

ai slm llm rag webgpu webnn wasm onnx browser privacy on-device inference transformers orama vector-search zero-cost

Aether-SLM Framework

Browser-native AI with local inference, local RAG, SharedWorker VRAM sharing, and zero API-key cost.

Aether-SLM lets web apps run private AI directly in the browser. It combines a local small language model, a local RAG database, hardware-aware backend selection, OPFS model caching, and a production Hub for model delivery.

import { Aether } from 'aether-slm-framework';

const ai = await Aether.init();
const answer = await ai.say('Explain local-first AI in one sentence.');

console.log(answer);

No API keys. No inference server. No prompt or document upload. The first model download is cached locally for future sessions.

Why Aether

Aether is built for apps that need useful AI without sending user data to a remote model API.

Local inference: generation runs in the browser through ONNX and transformers.js.
Local RAG: documents are embedded, indexed, searched, and retrieved on device.
Smart runtime: uses SharedWorker when cross-origin isolation is available, and Lite Worker fallback when it is not.
Shared VRAM: same-origin tabs share one model host instead of loading a copy per tab.
Smart defaults: Aether.init() chooses the production Hub, runtime mode, and model tier automatically.
Hub delivery: model assets are served from https://vibercoderofek.uk.
Hardware dispatch: picks WebNN, WebGPU, or WASM based on browser capability.
OPFS cache: large model files are cached in the browser's Origin Private File System.

Install

npm install aether-slm-framework

Aether ships its browser workers in the package and loads them automatically. Model assets are delivered by the production Hub at https://vibercoderofek.uk.

Create Aether App

For a new project, start with the generated Vite template:

npm exec --package aether-slm-framework create-aether-app -- my-local-ai
cd my-local-ai
npm install
npm run dev

The template includes Aether.init({ model: 'fast' }), createAetherStatusPanel(...), the production Hub URL https://vibercoderofek.uk, and COOP/COEP headers for SharedWorker mode. If you remove the headers, Aether automatically falls back to Lite mode instead of crashing.

Aether ships ready-to-run browser workers and loads the Transformers runtime on demand. You do not need to install model-runtime packages for the quick start.

Quick Start

Two-line chatbot

import { Aether } from 'aether-slm-framework';

const ai = await Aether.init();
await ai.say('Hello from my local AI app.');

Need a quicker first success on slower devices?

const ai = await Aether.init({ model: 'fast' });
await ai.ready({ capability: 'DRAFT', timeoutMs: 180_000 });

Stream tokens into the page

import { Aether } from 'aether-slm-framework';

const ai = await Aether.init();
const output = document.querySelector('#output')!;

for await (const { chunk, mode } of ai.stream('Write a haiku about WebGPU.', 120)) {
  output.textContent += chunk;
  output.setAttribute('data-model-mode', mode);
}

Copy-Paste Examples

1. Minimal HTML chat

<main>
  <input id="prompt" placeholder="Ask Aether..." />
  <button id="send">Send</button>
  <pre id="output"></pre>
</main>

<script type="module">
  import { Aether } from 'aether-slm-framework';

  const ai = await Aether.init({ debug: true });
  const prompt = document.querySelector('#prompt');
  const output = document.querySelector('#output');

  document.querySelector('#send').addEventListener('click', async () => {
    output.textContent = '';

    for await (const { chunk } of ai.stream(prompt.value, 160)) {
      output.textContent += chunk;
    }
  });
</script>

2. Streaming with model download status

import { Aether, createAetherStatusPanel } from 'aether-slm-framework';
import type { DownloadProgressResponse, SystemStateResponse } from 'aether-slm-framework';

const ai = await Aether.init({
  model: 'fast',
  hubUrl: 'https://vibercoderofek.uk',
  debug: true,
});
const client = ai.client;

createAetherStatusPanel(client, '#aether-status', { showDetails: true });

client.onStateChange = (msg) => {
  if (msg.type === 'SYSTEM_STATE') {
    const state = (msg as SystemStateResponse).state;
    console.log('Runtime state:', state);
  }

  if (msg.type === 'DOWNLOAD_PROGRESS') {
    const progress = msg as DownloadProgressResponse;
    console.log(
      `${progress.modelRole}: ${progress.status} ${progress.progress}%`,
      progress.file ?? '',
    );
  }
};

let answer = '';
for await (const { chunk, mode } of ai.stream('What is OPFS?', 140)) {
  answer += chunk;
  console.log(mode, chunk);
}

Framework Guides

3. Local RAG search

import { AetherRAGClient } from 'aether-slm-framework';

const rag = new AetherRAGClient(
  { hubUrl: 'https://vibercoderofek.uk' },
  {
    onStatus: console.log,
    onProgress: ({ indexed, total, filename }) =>
      console.log(`Indexed ${indexed}/${total}: ${filename}`),
  },
);

await rag.indexText(
  'product-docs',
  'Aether stores embeddings locally and never uploads user documents.',
  { namespace: 'docs', persist: true },
);

const results = await rag.query('Where are embeddings stored?', {
  namespace: 'docs',
  topK: 3,
});

console.log(results.map((r) => r.text));

4. Grounded local answer with RAG

import { Aether, AetherRAGClient } from 'aether-slm-framework';

const ai = await Aether.init();

const rag = new AetherRAGClient({
  mode: 'BOTH',
  hubUrl: 'https://vibercoderofek.uk',
});

await rag.indexText(
  'app-facts',
  'The VibeCoder framework is powered by local RAG and browser-side inference.',
  { namespace: 'facts' },
);

const question = 'What powers the framework?';
const hits = await rag.query(question, { namespace: 'facts', topK: 4 });
const context = hits.map((hit) => hit.text).join('\n---\n');

let answer = '';
for await (const { chunk } of ai.stream(`Context:\n${context}\n\nQuestion: ${question}`, 160)) {
  answer += chunk;
}

console.log(answer);

5. React-style hook

import { useEffect, useRef, useState } from 'react';
import { Aether } from 'aether-slm-framework';
import type { AetherSession } from 'aether-slm-framework';

export function LocalChat() {
  const ai = useRef<AetherSession | null>(null);
  const [prompt, setPrompt] = useState('');
  const [answer, setAnswer] = useState('');
  const [ready, setReady] = useState(false);

  useEffect(() => {
    Aether.init().then((session) => {
      ai.current = session;
      setReady(true);
    });
  }, []);

  async function send() {
    if (!ai.current) return;
    setAnswer('');

    for await (const { chunk } of ai.current.stream(prompt, 160)) {
      setAnswer((value) => value + chunk);
    }
  }

  return (
    <section>
      <textarea value={prompt} onChange={(event) => setPrompt(event.target.value)} />
      <button disabled={!ready} onClick={send}>Send</button>
      <pre>{answer}</pre>
    </section>
  );
}

Configuration

Aether.init() accepts the same config as AetherClient.

import { Aether } from 'aether-slm-framework';

const ai = await Aether.init({
  mode: 'BOTH',
  hubUrl: 'https://vibercoderofek.uk',
  model: 'auto',
  runtimeMode: 'auto',
  deviceTier: 'auto',
  debug: true,
  maxContextTokens: 2048,
  vramHardLimitMB: 4096,
  downloadConcurrency: 6,
});

| Option | Default | Description | | --- | --- | --- | | mode | 'SLM' for AetherClient, 'BOTH' for Aether.init() | 'SLM' loads inference only, 'DB' loads RAG only, 'BOTH' loads both. | | hubUrl | https://vibercoderofek.uk | Production model Hub origin. Do not change unless you operate your own Hub. | | model | 'auto' | Friendly model preset or custom model object. See Models. | | runtimeMode | 'auto' | 'auto', 'shared', or 'lite'. Auto uses SharedWorker when possible and Lite Worker otherwise. | | deviceTier | 'auto' | 'auto', 'mobile', or 'pc'. Controls default model choice. | | ragEmbeddingMode | 'lite' | 'lite', 'semantic', or 'auto'. Lite gives instant local RAG; semantic loads a local embedding model. | | debug | false | Emit detailed runtime logs. | | allowConcurrent | false | Allow more than one generate() call on the same client instance. | | maxContextTokens | 2048 | Approximate prompt budget before middle truncation. | | vramHardLimitMB | 4096 | Safety ceiling used by the runtime. | | downloadConcurrency | 6 | Parallel range-request count, clamped from 1 to 16. | | bypassCache | false | Force fresh model downloads instead of OPFS cache hits. | | queuingStrategy | 'ROUND_ROBIN' | SharedWorker scheduling: 'ROUND_ROBIN' or 'FIFO'. | | thermalThrottleMs | 1500 | Adds a small delay after long inference work. | | sharedWorkerUrl | auto | Advanced worker URL override. | | liteWorkerUrl | auto | Advanced Lite Worker URL override. | | ragWorkerUrl | auto | Advanced RAG Worker URL override. |

Runtime Modes

SharedWorker mode

Used when the page is cross-origin isolated. This is the best runtime for production apps.

One model host per same-origin app.
Multiple tabs share the same model instance.
Best for VRAM deduplication and long sessions.
Requires COOP/COEP headers.

Lite mode

Used when headers are missing or SharedWorker is unavailable.

Runs in a dedicated Worker.
Works with zero server configuration.
Does not provide cross-tab VRAM sharing.
The big target model may be On Demand on constrained devices.

You can force a runtime:

await Aether.init({ runtimeMode: 'lite' });
await Aether.init({ runtimeMode: 'shared' });

Models

Aether chooses a model plan automatically, but users can choose.

await Aether.init(); // default: model: 'auto'
await Aether.init({ model: 'fast' });
await Aether.init({ model: 'llama-3.2-1b' });
await Aether.init({ model: 'qwen2.5-1.5b' });
await Aether.init({ model: 'qwen2.5-coder-3b' });
await Aether.init({ model: 'llama-3.1-8b', deviceTier: 'pc' });

await Aether.init({
  model: {
    id: 'your-org/your-onnx-model',
  },
});

await Aether.init({
  model: {
    draft: 'your-org/small-draft-onnx',
    target: 'your-org/big-target-onnx',
  },
});

console.table(Aether.models());
console.table(Aether.models({ hubOnly: true }));
console.table(await Aether.modelsWithHubStatus());

Aether.models({ hubOnly: true }) returns the presets already known to be Hub-backed. Aether.modelsWithHubStatus() also asks the live Hub for a model manifest when the Hub exposes one, then merges that with the built-in catalog. Custom Hugging Face/ONNX model IDs remain allowed through { id } or { draft, target }.

| Preset | Model plan | Best fit | Hub status | | --- | --- | --- | --- | | auto | Mobile: Llama 3.2 1B. PC: Llama 3.2 1B draft + Llama 3.1 8B target. | Default smart plan | Available | | fast | Llama 3.2 1B only | First-run demos, mobile | Available | | llama-3.2-1b | onnx-community/Llama-3.2-1B-Instruct-ONNX | Mobile/local chat | Available | | llama-3.1-8b | 1B draft + llmware/llama-3.1-instruct-onnx target | PC high-reasoning | Available | | qwen2.5-0.5b | onnx-community/Qwen2.5-0.5B-Instruct | Tiny multilingual demos | Candidate | | qwen2.5-1.5b | onnx-community/Qwen2.5-1.5B-Instruct | Balanced small chat | Candidate | | qwen2.5-coder-3b | 1.5B draft + onnx-community/Qwen2.5-Coder-3B-Instruct target | Coding/repo assistants | Candidate | | phi-3.5-mini | 1B draft + onnx-community/Phi-3.5-mini-instruct-onnx-web target | Compact reasoning | Candidate | | smollm2-1.7b | HuggingFaceTB/SmolLM2-1.7B-Instruct | Lightweight general chat | Candidate | | gemma-3-1b | onnx-community/gemma-3-1b-it-ONNX | Gemma-family mobile flows | Candidate |

Model status events include:

queued: waiting for the small model to finish.
checking: resolving model files and cache state.
downloading: actively fetching model assets.
cached: loading from OPFS.
ready: loaded in the active worker.
on-demand: not preloaded; loaded during generation in serial-swap mode.
error: load failed.

Local RAG

The RAG pipeline runs in a Worker and uses Orama plus local embeddings. By default it uses ragEmbeddingMode: 'lite', so indexing and querying work immediately without downloading an embedding model.

import { AetherRAGClient } from 'aether-slm-framework';

const rag = new AetherRAGClient();

await rag.indexText(
  'facts',
  'Aether answers from private local context in the browser.',
  { namespace: 'demo' },
);

const hits = await rag.query('Where does Aether get context?', {
  namespace: 'demo',
  topK: 3,
});

Available methods:

| Method | Use | | --- | --- | | indexText(source, text, options) | Index one text blob. | | indexFiles(files, options) | Index browser File[] values. | | indexEntries(entries, options) | Batch-index structured records. | | query(text, options) | Hybrid BM25/vector search. | | upsert(id, text, meta, options) | Replace a stable record. | | delete(id, options) | Delete one record. | | clear(options) | Clear a namespace or all local RAG data. |

RAG options:

{
  namespace: 'support-docs',
  persist: true,
  embeddingMode: 'lite'
}

namespace isolates data domains.
persist: true stores raw entries in IndexedDB and rehydrates them on reload.
embeddingMode: 'lite' is instant and default. Use 'semantic' when you want model-backed local embeddings.

Aether Hub

The production Hub is:

https://vibercoderofek.uk

Aether uses it for model asset delivery and cross-origin coordination. In application code, use:

const ai = await Aether.init({
  hubUrl: 'https://vibercoderofek.uk',
});

The Hub handshake is automatic in AetherClient, Aether.init(), and AetherRAGClient when hubUrl is set.

Feature Gallery App

This repository includes a full UI examples app.

npm install
npm run gallery

Open:

http://127.0.0.1:5200/

The gallery includes:

Repo Roaster: GitHub URL ingestion plus local RAG.
Knowledge Nexus: local vector database visualization.
Sovereign Chat: token streaming and draft/speculative visibility.
Stress Test: multi-tab SharedWorker and VRAM deduplication test.

The sidebar shows:

Active backend: WebNN, WebGPU, or WASM.
Shared VRAM usage.
SharedWorker connection count.
Small model download/readiness.
Big model queued/downloading/on-demand/readiness.
System state.

Browser Headers

For the best runtime, serve these headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp
Cross-Origin-Resource-Policy: cross-origin
Permissions-Policy: cross-origin-isolated=(self)

Vite example:

// vite.config.ts
import { defineConfig } from 'vite';

export default defineConfig({
  server: {
    headers: {
      'Cross-Origin-Opener-Policy': 'same-origin',
      'Cross-Origin-Embedder-Policy': 'require-corp',
      'Cross-Origin-Resource-Policy': 'cross-origin',
      'Permissions-Policy': 'cross-origin-isolated=(self)',
    },
  },
});

If these headers are missing, Aether falls back to Lite mode instead of crashing.

Troubleshooting

| Symptom | Meaning | Fix | | --- | --- | --- | | Big model stays Queued | Small model is still downloading. | Wait for the small model to become Ready. | | Big model becomes On Demand | Runtime cannot safely preload both models. | This is expected in Lite mode, WASM fallback, or constrained WebGPU. | | self.crossOriginIsolated is false | COOP/COEP headers are missing. | Add the headers above or use Lite mode. | | App logs Falling back to LITE mode | SharedWorker/SAB path is unavailable. | Add headers for SharedWorker mode, or accept Lite mode. | | No available adapters | WebGPU adapter is unavailable in that browser/session. | Use Chrome/Edge with WebGPU enabled, update GPU drivers, or use WASM fallback. | | Model download is slow | First load fetches large ONNX files. | Leave the tab open; OPFS cache makes later loads faster. | | RAG query works but generation is slow | Embeddings are smaller than SLM weights. | Wait for model readiness, reduce maxTokens, or use mobile tier. | | Hub request fails | Origin is blocked or offline. | Confirm https://vibercoderofek.uk/hub.html returns 200 and CORS is allowed. | | Storage quota error | Browser storage is full. | Clear site data for the app origin and reload. |

Architecture

Browser origin
  Tabs
    AetherClient
      SharedWorker or Lite Worker
        Multiplexer
        ONNXEngine
        UMADispatcher: WebNN -> WebGPU -> WASM
        OPFS model cache
        Hub fetch interceptor

  AetherRAGClient
    RAG Worker
      Lite embeddings or optional gte-small semantic embeddings
      Orama BM25/vector index
      optional IndexedDB persistence

Data flow:

Your app creates Aether.init(), AetherClient, or AetherRAGClient.
Aether connects to the production Hub at https://vibercoderofek.uk.
Model assets download once and are cached in OPFS.
Prompts and documents remain in the browser.
The runtime streams tokens or RAG results back to your UI.

Development

npm install
npm run typecheck
npm run build
npm run test
npm run gallery

Common scripts:

| Script | Description | | --- | --- | | npm run dev | Run the root Vite demo. | | npm run gallery | Run the examples app on port 5200. | | npm run build | Build library and workers. | | npm run typecheck | Run TypeScript checks. | | npm run test | Run unit tests. | | npm run test:e2e | Run Playwright tests. |

License

ISC. See LICENSE.

Project Promise

Aether-SLM is for local-first AI applications: chat, RAG, document search, private copilots, offline tools, and multi-tab browser apps that should not require a model API bill.

The framework downloads model files, but user prompts, indexed documents, embeddings, and retrieval context stay local to the browser runtime.