mellon-stt

v1.1.0

Published

21 days ago

Offline, in-browser hotword detection powered by EfficientWord-Net (ResNet-50 ArcFace). Works as a standalone app or npm library.

0High
0Medium
0Low

comicscrip

hotword wake-word speech onnx offline browser wasm audio voice

mellon-stt

Offline, fully in-browser hotword / wake-word detection powered by EfficientWord-Net (ResNet-50 ArcFace). Works as a zero-dependency npm library or as a standalone PWA.

100% offline — ONNX inference runs in the browser via WebAssembly; no server, no cloud.
Speaker-independent — the model generalises across voices out of the box.
Custom words — enroll any phrase with ≥ 3 audio samples; no retraining.
TypeScript-ready — ships with full .d.ts declarations.
Tiny API surface — one class for simple use, low-level primitives for advanced use.

Browser requirements

mellon-stt uses ONNX Runtime's multi-threaded WebAssembly backend, which requires SharedArrayBuffer. This in turn requires the page to be served with the following HTTP headers:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

See Server / bundler configuration for ready-to-use snippets.

Additionally:

The page must be served over HTTPS (or localhost).
Microphone permission is requested when start() is called.

Installation

npm install mellon-stt

The package ships with the ONNX model (~88 MB) and all ORT WASM runtime files. Copy them to your public directory before your first deployment — see Asset setup.

Quick start

import { MellonStt } from 'mellon-stt'

const stt = new MellonStt({
  // Tell the library where you copied the assets (see Asset setup below)
  wasmBasePath: '/mellon-assets/wasm/',
  modelUrl:     '/mellon-assets/model.onnx',
})

// Optional: show a progress bar while the 88 MB model loads
await stt.init(pct => console.log(`Loading model: ${Math.round(pct * 100)}%`))

// Request mic and start listening for the built-in words
await stt.start()

stt.addEventListener('match', (e) => {
  console.log(`Detected "${e.detail.name}" (confidence ${(e.detail.confidence * 100).toFixed(1)}%)`)
})

Built-in words: suivant (French: "next") and precedent (French: "previous"). You can enroll any custom word — see Enrolling custom words.

Asset setup

The WASM runtime and model cannot be bundled into JavaScript — they must be served as static files. After installing, run the provided helper to copy them to your project's public directory:

# Copy to public/mellon-assets/  (adjust --dest as needed)
node node_modules/mellon-stt/scripts/copy-assets.js --dest ./public/mellon-assets

Or copy manually:

cp -r node_modules/mellon-stt/dist/assets/wasm  public/mellon-assets/wasm
cp    node_modules/mellon-stt/dist/assets/model.onnx  public/mellon-assets/model.onnx

Then pass the serving paths to the constructor:

new MellonStt({
  wasmBasePath: '/mellon-assets/wasm/',   // trailing slash required
  modelUrl:     '/mellon-assets/model.onnx',
})

Vite projects

Add the copy step to your Vite config using the vite-plugin-static-copy plugin:

// vite.config.js
import { defineConfig }   from 'vite'
import { viteStaticCopy } from 'vite-plugin-static-copy'

export default defineConfig({
  server: {
    headers: {
      'Cross-Origin-Opener-Policy':  'same-origin',
      'Cross-Origin-Embedder-Policy': 'require-corp',
    },
  },
  plugins: [
    viteStaticCopy({
      targets: [
        { src: 'node_modules/mellon-stt/dist/assets/wasm/*',       dest: 'mellon-assets/wasm' },
        { src: 'node_modules/mellon-stt/dist/assets/model.onnx',   dest: 'mellon-assets' },
      ],
    }),
  ],
})

API reference

`MellonStt` (high-level)

The easiest way to use the library. Wraps mic access, AudioWorklet wiring, and detector management into a single class.

class MellonStt extends EventTarget {
  static BUILTIN_WORDS: string[]          // ['suivant', 'precedent']

  constructor(opts?: MellonSttOptions)
  readonly isInitialized: boolean
  readonly isRunning:     boolean

  init(onProgress?: (pct: number) => void): Promise<void>
  start(words?: string[]): Promise<void>
  stop(): void
  addCustomWord(refData: RefData): void
  enrollWord(wordName: string): EnrollmentSession
}

`MellonSttOptions`

| Option | Type | Default | Description | |---|---|---|---| | words | string[] | BUILTIN_WORDS | Words to detect | | threshold | number | 0.65 | Detection threshold (0–1) | | relaxationMs | number | 2000 | Min ms between match events | | inferenceGapMs | number | 300 | Min ms between inference runs | | wasmBasePath | string | — | Base URL for ORT WASM (trailing /) | | modelUrl | string | — | URL to model.onnx |

Events

| Event | Detail type | Fired when | |---|---|---| | ready | — | init() completes | | match | { name, confidence, timestamp } | A word is detected | | error | { error: Error } | Model load or mic access fails |

`HotwordDetector`

Stateful, single-word detector. Wire it to your own AudioWorklet pipeline.

class HotwordDetector extends EventTarget {
  constructor(opts: DetectorOptions)

  readonly name:      string
  readonly lastScore: number       // most recent similarity score
  threshold:      number
  relaxationMs:   number
  inferenceGapMs: number

  scoreFrame(audioBuffer: Float32Array): Promise<number | null>
}

`DetectorOptions`

| Option | Type | Default | Description | |---|---|---|---| | name | string | — | Label for this word | | refEmbeddings | number[][] | — | N × 256 embedding vectors | | threshold | number | 0.65 | Detection threshold | | relaxationMs | number | 2000 | Cooldown between matches | | inferenceGapMs | number | 300 | Rate-limit on scoreFrame() |

Example

import { loadModel, configure, HotwordDetector, BUILTIN_REFS } from 'mellon-stt'

configure({ wasmBasePath: '/assets/wasm/', modelUrl: '/assets/model.onnx' })
await loadModel()

const ref = BUILTIN_REFS['suivant']
const detector = new HotwordDetector({ name: 'suivant', refEmbeddings: ref.embeddings })

detector.addEventListener('match', e => {
  console.log(e.detail)  // { name: 'suivant', confidence: 0.72, timestamp: 1711234567890 }
})

// In your AudioWorklet onmessage handler:
workletNode.port.onmessage = async (e) => {
  await detector.scoreFrame(e.data)   // e.data is Float32Array[24000]
}

`EnrollmentSession`

Records audio samples from the mic (or uploaded files) and generates reference embeddings for a new custom word.

class EnrollmentSession extends EventTarget {
  constructor(wordName: string)

  readonly wordName:    string
  readonly sampleCount: number
  readonly samples:     { audioBuffer: Float32Array; name: string }[]

  recordSample():            Promise<number>   // → 1-based sample index
  addAudioFile(file: File):  Promise<number>   // → 1-based sample index
  removeSample(idx: number): void
  clearSamples():            void
  generateRef():             Promise<RefData>  // requires ≥ 3 samples
}

Events

| Event | Detail | |---|---| | recording-start | — | | sample-added | { count: number; name: string } | | samples-changed | { count: number } | | generating | { total: number } | | progress | { done: number; total: number } |

Engine functions

// Configure asset paths (once, before loadModel)
configure({ wasmBasePath?: string, modelUrl?: string }): void

// Load (or return cached) ONNX inference session
loadModel(onProgress?: (pct: number) => void): Promise<void>

// Run inference — returns 256-dim L2-normalised embedding
embed(spectrogram: Float32Array): Promise<Float32Array>

Audio features

// Compute log-mel spectrogram — input: 24 000 samples at 16 kHz
// Output: Float32Array[149 × 64]
logfbank(signal: Float32Array): Float32Array

Similarity helpers

// Cosine similarity normalised to [0, 1]
cosineSim(a: Float32Array | number[], b: Float32Array | number[]): number

// Maximum cosine similarity against an array of reference embeddings
maxSimilarity(embedding: Float32Array, refs: number[][]): number

Storage helpers

// Constants
BUILTIN_WORDS: string[]                              // ['suivant', 'precedent']
BUILTIN_REFS:  Record<string, RefData>               // bundled, no fetch needed

// Network-based fetch (demo app / server usage)
fetchBuiltinRef(word: string): Promise<RefData>

// localStorage persistence
loadCustomRefs():                  RefData[]
saveCustomRef(refData: RefData):   void
deleteCustomRef(wordName: string): void

// File I/O
exportRef(refData: RefData):         void           // triggers browser download
importRefFile(file: File): Promise<RefData>

`RefData` shape

interface RefData {
  word_name:  string           // e.g. 'hello'
  model_type: 'resnet_50_arc'
  embeddings: number[][]       // N × 256 vectors
}

Compatible with the EfficientWord-Net _ref.json format — you can import reference files generated by the Python toolkit directly.

Enrolling custom words

import { MellonStt, saveCustomRef } from 'mellon-stt'

const stt = new MellonStt({ wasmBasePath: '/assets/wasm/', modelUrl: '/assets/model.onnx' })
await stt.init()

// 1. Create an enrollment session
const session = stt.enrollWord('hey computer')

session.addEventListener('recording-start', () => console.log('Recording…'))
session.addEventListener('sample-added', e => console.log(`Sample ${e.detail.count} recorded`))

// 2. Record at least 3 samples (1.5 s each)
await session.recordSample()
await session.recordSample()
await session.recordSample()

// 3. Generate reference embeddings
session.addEventListener('progress', e => console.log(`Embedding ${e.detail.done}/${e.detail.total}`))
const ref = await session.generateRef()

// 4a. Use immediately in the running detector
stt.addCustomWord(ref)

// 4b. Persist for future sessions
saveCustomRef(ref)

You can also enroll from pre-recorded audio files:

const file = document.querySelector('input[type=file]').files[0]
await session.addAudioFile(file)

Server / bundler configuration

SharedArrayBuffer (required by multi-threaded WASM) is only available when the page is served with:

Cross-Origin-Opener-Policy: same-origin
Cross-Origin-Embedder-Policy: require-corp

Vite dev server

Already configured in the demo app's vite.config.js. For your own project:

// vite.config.js
export default {
  server:  { headers: { 'Cross-Origin-Opener-Policy': 'same-origin', 'Cross-Origin-Embedder-Policy': 'require-corp' } },
  preview: { headers: { 'Cross-Origin-Opener-Policy': 'same-origin', 'Cross-Origin-Embedder-Policy': 'require-corp' } },
}

Express

app.use((req, res, next) => {
  res.setHeader('Cross-Origin-Opener-Policy',  'same-origin')
  res.setHeader('Cross-Origin-Embedder-Policy', 'require-corp')
  next()
})

Nginx

add_header Cross-Origin-Opener-Policy  "same-origin";
add_header Cross-Origin-Embedder-Policy "require-corp";

Netlify (`public/_headers`)

/*
  Cross-Origin-Opener-Policy: same-origin
  Cross-Origin-Embedder-Policy: require-corp

Browser support

| Browser | Supported | Notes | |---|---|---| | Chrome / Edge 89+ | ✅ | Full support | | Firefox 79+ | ✅ | Full support | | Safari 15.2+ | ✅ | SharedArrayBuffer re-enabled with COOP/COEP | | Safari < 15.2 | ❌ | SharedArrayBuffer not available | | iOS Safari 15.2+ | ✅ | Works over HTTPS | | Node.js | ❌ | Browser-only (AudioContext, getUserMedia) |

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

mellon-stt

Table of contents

Browser requirements

Installation

Quick start

Asset setup

Vite projects

API reference

MellonStt (high-level)

MellonSttOptions

Events

HotwordDetector

DetectorOptions

Example

EnrollmentSession

Events

Engine functions

Audio features

Similarity helpers

Storage helpers

RefData shape

Enrolling custom words

Server / bundler configuration

Vite dev server

Express

Nginx

Netlify (public/_headers)

Browser support

License

`MellonStt` (high-level)

`MellonSttOptions`

`HotwordDetector`

`DetectorOptions`

`EnrollmentSession`

`RefData` shape

Netlify (`public/_headers`)