@jsilvanus/embedeer

v1.7.3

Published

8 days ago

A node.js embedding tool with optional GPU acceleration

0High
0Medium
0Low

jsilvanus

embeddings huggingface nlp transformers parallel gpu cuda onnxruntime

embedeer

A Node.js Embedding Tool

A Node.js tool for generating text embeddings using transformers.js with ONNX models from Hugging Face.

Supports batched input, parallel execution, isolated child-process workers (default), in-process threads, a shared socket daemon (one model across multiple OS processes), and a gRPC server (HTTP/2 + protobuf, remote-ready), quantization, optional GPU acceleration, and Hugging Face auth.

Features

Downloads any Hugging Face feature-extraction model on first use (cached in ~/.embedeer/models)
Isolated processes (default) — a worker crash cannot bring down the caller
In-process threads — opt-in via mode: 'thread' for lower overhead
Socket daemon — mode: 'socket' runs one persistent server shared across multiple OS processes; one model copy in RAM regardless of client count
gRPC server — mode: 'grpc' exposes the model as a typed HTTP/2 service; works locally or remotely, supports server-streaming for large batches
Multi-server load balancing — point a WorkerPool at multiple servers (e.g. 2 GPU + 1 CPU); the idle-worker queue distributes work naturally
Model idle offload — servers optionally release GPU/CPU memory after inactivity (--idle-timeout) and reload on next request
Sequential execution when concurrency: 1
Configurable batch size and concurrency
GPU acceleration — optional CUDA (Linux x64) and DirectML (Windows x64), no extra packages needed
Hugging Face API token support (--token / HF_TOKEN env var)
Quantization via dtype (fp32 · fp16 · q8 · q4 · q4f16 · auto)
Rich CLI: pull model, embed from file, dump output as JSON / TXT / SQL

How it works

embed(texts)
  │
  ├─ split into batches of batchSize
  │
  └─ Promise.all(batches) ──► WorkerPool
                                 │
                                 ├─ [process mode] ChildProcessWorker 0  → own model copy
                                 ├─ [process mode] ChildProcessWorker 1  → own model copy
                                 │
                                 ├─ [thread mode]  ThreadWorker 0        → own model copy
                                 │
                                 ├─ [socket mode]  SocketWorker 0  ──┐
                                 ├─ [socket mode]  SocketWorker 1  ──┼──► socket-model-server (one shared model)
                                 │                                   │    also connectable from other OS processes
                                 │
                                 └─ [grpc mode]    GrpcWorker 0  ──┐
                                    [grpc mode]    GrpcWorker 1  ──┼──► grpc-model-server (one shared model)
                                                                   │    works locally or over the network

In process and thread modes, each worker loads its own model copy — N workers means N models in memory. In socket and grpc modes, one server process holds the model and all workers are lightweight client connections to it.

Installation

TypeScript: The package includes TypeScript declarations so imports are typed automatically.

GPU acceleration: (CUDA on Linux x64, DirectML on Windows x64) is built into onnxruntime-node which ships as a transitive dependency. No additional packages are required. For CUDA on Linux x64 you also need the CUDA 12 system libraries: sudo apt install cuda-toolkit-12-6 libcudnn9-cuda-12

gRPC:@grpc/grpc-js and @grpc/proto-loader are listed as optionalDependencies — installed by default but skippable with --omit=optional in npm install. They are only loaded when mode: 'grpc' is actually used (lazy import at runtime).

Using the package

Embed texts

import { Embedder } from '@jsilvanus/embedeer';

// The default is CPU embedder
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  batchSize:   32,          // texts per worker task   (default: 32)
  concurrency: 2,           // parallel workers        (default: 2)
  mode:       'process',    // 'process' | 'thread'    (default: 'process')
  pooling:    'mean',       // 'mean' | 'cls' | 'none' (default: 'mean')
  normalize:   true,        // L2-normalise vectors    (default: true)
  token:      'hf_...',     // HF API token (optional; also reads HF_TOKEN env)
  dtype:      'q8',         // quantization dtype      (optional)
  cacheDir:   '/my/cache',  // override model cache    (default: ~/.embedeer/models)
});

// OR: Auto-detect GPU (falls back to CPU if no provider is installed)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  device: 'auto',
});

// OR: Require GPU (throws if no provider is available)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  device: 'gpu',
});

// OR: Explicitly select an execution provider
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  provider: 'cuda',  // 'cuda' | 'dml'
});

// .embed is the way to get the embeddings.
const vectors = await embedder.embed(['Hello world', 'Foo bar baz']);
// → number[][]  (one 384-dim vector per text for all-MiniLM-L6-v2)

await embedder.destroy(); // shut down worker processes

Socket daemon mode

Run one persistent model server shared across multiple OS processes. Any process that knows the socket path can connect — useful when several services on the same machine all need embeddings.

# Start the daemon — default model (Xenova/all-MiniLM-L6-v2), CPU, no idle offload
npm run daemon

# Pass arguments after --
npm run daemon -- --model nomic-ai/nomic-embed-text-v1

| Argument | Default | Description | |---|---|---| | --model | Xenova/all-MiniLM-L6-v2 | Hugging Face model identifier | | --socket | auto (/tmp/embedeer-<model>.sock) | Unix socket path | | --pooling | mean | mean | cls | none | | --normalize / --no-normalize | enabled | L2-normalise output vectors | | --dtype | — | Quantization: fp32 | fp16 | q8 | q4 | q4f16 | auto | | --device | cpu | cpu | gpu | auto | | --provider | — | cuda | dml | | --token | — | Hugging Face API token (also reads HF_TOKEN) | | --cache-dir | ~/.embedeer/models | Model cache directory | | --idle-timeout | — | Offload model after N ms of inactivity; reload on next request |

Connect from any number of processes:

// In process A (web server) and process B (background worker) — same API
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'socket',
  socketPath: '/tmp/emb.sock',
  autoStartServer: false,   // daemon is already running
});
const vectors = await embedder.embed(['Hello world']);
await embedder.destroy();   // closes connection; daemon keeps running

One model in RAM serves all connected processes. autoStartServer: true (the default) spawns the server automatically and shuts it down with embedder.destroy().

gRPC server mode

Run the model as a gRPC service (HTTP/2 + Protocol Buffers). Works locally or over a network.

# Start the server — default model (Xenova/all-MiniLM-L6-v2), localhost:50051
npm run server

# Pass arguments after --
npm run server -- --address 0.0.0.0:50051        # listen on all interfaces

| Argument | Default | Description | |---|---|---| | --model | Xenova/all-MiniLM-L6-v2 | Hugging Face model identifier | | --address | localhost:50051 | Bind address (host:port) | | --pooling | mean | mean | cls | none | | --normalize / --no-normalize | enabled | L2-normalise output vectors | | --dtype | — | Quantization: fp32 | fp16 | q8 | q4 | q4f16 | auto | | --device | cpu | cpu | gpu | auto | | --provider | — | cuda | dml | | --token | — | Hugging Face API token (also reads HF_TOKEN) | | --cache-dir | ~/.embedeer/models | Model cache directory | | --idle-timeout | — | Offload model after N ms of inactivity; reload on next request |

Connect from Node.js using this package's client:

// Auto-start a local server (dies with process) and you can then send it data
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'grpc',
  grpcAddress: 'localhost:50051',
});

// Connect this client into a remote server
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'grpc',
  grpcAddress: '10.0.1.42:50051',
  autoStartServer: false,
});

const vectors = await embedder.embed(['Hello world']);
await embedder.destroy();

Multi-server load balancing

Point a WorkerPool at multiple servers. The idle-worker queue acts as a natural load balancer — workers on faster servers (GPU) finish sooner and pick up proportionally more tasks.

const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'grpc',               // or 'socket'
  dtype: 'fp16',              // uniform across all servers
  servers: [
    { address: 'localhost:50051', workers: 6, device: 'cuda', provider: 'cuda' },
    { address: 'localhost:50052', workers: 6, device: 'cuda', provider: 'cuda' },
    { address: 'localhost:50053', workers: 2, device: 'cpu' },
  ],
  autoStartServer: true,
});
// 14 workers total; GPU servers receive ~5× more requests than the CPU server

Note that autoStartServer: true means that it will create those servers and then send worker clients to them. These three servers are all local and tied to this process.

Model management

Model compatibility (ONNX)

Embedeer runs models via onnxruntime-node. Models chosen from Hugging Face must provide an ONNX export compatible with ONNX Runtime, or be convertible to ONNX (see Optimum).

Pulling/pre-caching models

Embedeer supports pre-caching and managing downloaded models.

Pull (pre-cache) a model via the CLI: npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2
Programmatic pre-cache using loadModel()
Cache location: default is ~/.embedeer/models. Override with the CLI --cache-dir option or the cacheDir argument to loadModel().

Local models

Use a local model path directly (no copying): npx @jsilvanus/embedeer --use-local /path/to/local-model --data "Hello world"
Copy a local model into the cache and give it a stable name: npx @jsilvanus/embedeer --load-local /path/to/local-model --name my-local-model
Use local modal after copying: npx @jsilvanus/embedeer --model my-local-model or npx @jsilvanus/embedeer --model ~/.embedeer/models/my-local-model

Programmatic helpers

Some helpers are available to the public API:

importLocalModel(src, { name?, cacheDir? }) — copy a local model into the cache and return { modelName, path }.
await Embedder.create('/path/to/local-model', { cacheDir: '/my/cache' }); — use a model from a custom path
getCacheDir() — return the resolved cache directory used by embedeer (useful when you want to manage files yourself).
isModelDownloaded(name) / listModels() / getCachedModels() — inspect the cache.
deleteModel(name) — remove a cached model directory.
getLoadedModels() — returns an array of model names currently loaded by active worker pools.
deleteModel(modelName, { cacheDir? }) — remove cached model directories matching modelName.

These functions are exported from the public package entry (src/index.js) so you can import them from @jsilvanus/embedeer.

CLI

Full command-line documentation moved to CLI.md.

GPU Acceleration

GPU support is built into onnxruntime-node (a dependency of @huggingface/transformers):

| Platform | Provider | Requirement | |----------------|-----------|--------------------------------------------------------| | Linux x64 | CUDA | NVIDIA GPU + driver ≥ 525, CUDA 12 toolkit, cuDNN 9 | | Windows x64 | DirectML | Any DirectX 12 GPU (most GPUs since 2016), Windows 10+ |

Provider selection logic

| device | provider | Behavior | |----------|-----------|----------| | cpu (default) | — | Always CPU | | auto | — | Try GPU providers for the platform in order; silent CPU fallback | | gpu | — | Try GPU providers; throw if none available | | any | cuda | Load CUDA provider; throw if not available or not supported | | any | dml | Load DirectML provider; throw if not available or not supported | | any | cpu | Always CPU |

On Linux x64: GPU order is cuda.
On Windows x64: GPU order is cuda → dml.

Testing

CI is enabled via GitHub Actions (.github/workflows/ci.yml) which runs tests and collects coverage on push and pull requests.

Performance Optimizations

How to tune performance?

Embedeer exposes runtime knobs and helper scripts to tune throughput for your host.

Pre-load models: run Embedder.loadModel(model, { dtype, cacheDir }) or use the bench scripts so workers start instantly without re-downloading models.
Reuse Embedder instances: create a single Embedder and call embed() repeatedly instead of creating and destroying instances per batch.
Batch size vs concurrency:
- CPU: moderate batch sizes (16–64) with multiple workers (concurrency ≥ 2) usually give best throughput.
- GPU: larger batches (64–256) with low concurrency (1–2) are typically fastest.
BLAS threading: avoid oversubscription by setting OMP_NUM_THREADS and MKL_NUM_THREADS to Math.floor(cpu_cores / concurrency) before starting workers.
Device/provider: use cuda on Linux and dml (DirectML) on Windows when available; device: 'auto' will try providers and fall back to CPU.

Automatic performance tuning

Automatic tuning: use bench/grid-search.js to sweep batchSize, concurrency, and dtype for your host and save results. You can generate and persist a per-user profile and apply it automatically via the Embedder APIs.

Examples:

# CPU quick grid
node bench/grid-search.js --device cpu --sample-size 200 --out bench/grid-results-cpu.json

# GPU quick grid
node bench/grid-search.js --device gpu --sample-size 100 --out bench/grid-results-gpu.json

Programmatic performance tuning

You can generate and save a per-user performance profile which Embedder.create() will automatically apply. This is useful to pick the best batchSize / concurrency for your machine without manual tuning.

import { Embedder } from '@jsilvanus/embedeer';

// Quick profile generation (writes ~/.embedeer/perf-profile.json)
await Embedder.generateAndSaveProfile({ mode: 'quick', device: 'cpu', sampleSize: 100 });
// Subsequent calls to Embedder.create() will auto-apply the saved profile by default.

Server mode benchmark

Compare socket and gRPC server throughput against the process/thread baseline:

npm run server-bench
# or with options:
node bench/server-bench.js --model Xenova/all-MiniLM-L6-v2 --batch-size 32 --sample-size 500

The benchmark starts each server as a subprocess, waits for it to load the model, runs embeddings, then shuts it down. Reports startup time (spawn → ready) separately from embedding throughput so you can see the fixed cost of model loading vs. steady-state performance.

Options:
  --model       <name>   HF model identifier  (default: Xenova/all-MiniLM-L6-v2)
  --batch-size  <n>      Texts per request     (default: 32)
  --dtype       <type>   Quantization dtype    (default: none)
  --sample-size <n>      Number of texts       (default: 200)
  --skip-socket          Skip socket runner
  --skip-grpc            Skip gRPC runner
  --skip-baseline        Skip process/thread baseline

License

MIT