npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@jsilvanus/embedeer

v1.7.3

Published

A node.js embedding tool with optional GPU acceleration

Readme

embedeer

Embedeer Logo: a deer with vector numbers between antlers. Logo generated by ChatGPT. Public Domain.

A Node.js Embedding Tool

npm version npm downloads release downloads

A Node.js tool for generating text embeddings using transformers.js with ONNX models from Hugging Face.

Supports batched input, parallel execution, isolated child-process workers (default), in-process threads, a shared socket daemon (one model across multiple OS processes), and a gRPC server (HTTP/2 + protobuf, remote-ready), quantization, optional GPU acceleration, and Hugging Face auth.


Features

  • Downloads any Hugging Face feature-extraction model on first use (cached in ~/.embedeer/models)
  • Isolated processes (default) — a worker crash cannot bring down the caller
  • In-process threads — opt-in via mode: 'thread' for lower overhead
  • Socket daemonmode: 'socket' runs one persistent server shared across multiple OS processes; one model copy in RAM regardless of client count
  • gRPC servermode: 'grpc' exposes the model as a typed HTTP/2 service; works locally or remotely, supports server-streaming for large batches
  • Multi-server load balancing — point a WorkerPool at multiple servers (e.g. 2 GPU + 1 CPU); the idle-worker queue distributes work naturally
  • Model idle offload — servers optionally release GPU/CPU memory after inactivity (--idle-timeout) and reload on next request
  • Sequential execution when concurrency: 1
  • Configurable batch size and concurrency
  • GPU acceleration — optional CUDA (Linux x64) and DirectML (Windows x64), no extra packages needed
  • Hugging Face API token support (--token / HF_TOKEN env var)
  • Quantization via dtype (fp32 · fp16 · q8 · q4 · q4f16 · auto)
  • Rich CLI: pull model, embed from file, dump output as JSON / TXT / SQL

How it works

embed(texts)
  │
  ├─ split into batches of batchSize
  │
  └─ Promise.all(batches) ──► WorkerPool
                                 │
                                 ├─ [process mode] ChildProcessWorker 0  → own model copy
                                 ├─ [process mode] ChildProcessWorker 1  → own model copy
                                 │
                                 ├─ [thread mode]  ThreadWorker 0        → own model copy
                                 │
                                 ├─ [socket mode]  SocketWorker 0  ──┐
                                 ├─ [socket mode]  SocketWorker 1  ──┼──► socket-model-server (one shared model)
                                 │                                   │    also connectable from other OS processes
                                 │
                                 └─ [grpc mode]    GrpcWorker 0  ──┐
                                    [grpc mode]    GrpcWorker 1  ──┼──► grpc-model-server (one shared model)
                                                                   │    works locally or over the network

In process and thread modes, each worker loads its own model copy — N workers means N models in memory. In socket and grpc modes, one server process holds the model and all workers are lightweight client connections to it.


Installation

TypeScript: The package includes TypeScript declarations so imports are typed automatically.

GPU acceleration: (CUDA on Linux x64, DirectML on Windows x64) is built into onnxruntime-node which ships as a transitive dependency. No additional packages are required. For CUDA on Linux x64 you also need the CUDA 12 system libraries: sudo apt install cuda-toolkit-12-6 libcudnn9-cuda-12

gRPC:@grpc/grpc-js and @grpc/proto-loader are listed as optionalDependencies — installed by default but skippable with --omit=optional in npm install. They are only loaded when mode: 'grpc' is actually used (lazy import at runtime).

Using the package

Embed texts

import { Embedder } from '@jsilvanus/embedeer';

// The default is CPU embedder
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  batchSize:   32,          // texts per worker task   (default: 32)
  concurrency: 2,           // parallel workers        (default: 2)
  mode:       'process',    // 'process' | 'thread'    (default: 'process')
  pooling:    'mean',       // 'mean' | 'cls' | 'none' (default: 'mean')
  normalize:   true,        // L2-normalise vectors    (default: true)
  token:      'hf_...',     // HF API token (optional; also reads HF_TOKEN env)
  dtype:      'q8',         // quantization dtype      (optional)
  cacheDir:   '/my/cache',  // override model cache    (default: ~/.embedeer/models)
});

// OR: Auto-detect GPU (falls back to CPU if no provider is installed)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  device: 'auto',
});

// OR: Require GPU (throws if no provider is available)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  device: 'gpu',
});

// OR: Explicitly select an execution provider
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  provider: 'cuda',  // 'cuda' | 'dml'
});

// .embed is the way to get the embeddings.
const vectors = await embedder.embed(['Hello world', 'Foo bar baz']);
// → number[][]  (one 384-dim vector per text for all-MiniLM-L6-v2)

await embedder.destroy(); // shut down worker processes

Socket daemon mode

Run one persistent model server shared across multiple OS processes. Any process that knows the socket path can connect — useful when several services on the same machine all need embeddings.

# Start the daemon — default model (Xenova/all-MiniLM-L6-v2), CPU, no idle offload
npm run daemon

# Pass arguments after --
npm run daemon -- --model nomic-ai/nomic-embed-text-v1

| Argument | Default | Description | |---|---|---| | --model | Xenova/all-MiniLM-L6-v2 | Hugging Face model identifier | | --socket | auto (/tmp/embedeer-<model>.sock) | Unix socket path | | --pooling | mean | mean | cls | none | | --normalize / --no-normalize | enabled | L2-normalise output vectors | | --dtype | — | Quantization: fp32 | fp16 | q8 | q4 | q4f16 | auto | | --device | cpu | cpu | gpu | auto | | --provider | — | cuda | dml | | --token | — | Hugging Face API token (also reads HF_TOKEN) | | --cache-dir | ~/.embedeer/models | Model cache directory | | --idle-timeout | — | Offload model after N ms of inactivity; reload on next request |

Connect from any number of processes:

// In process A (web server) and process B (background worker) — same API
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'socket',
  socketPath: '/tmp/emb.sock',
  autoStartServer: false,   // daemon is already running
});
const vectors = await embedder.embed(['Hello world']);
await embedder.destroy();   // closes connection; daemon keeps running

One model in RAM serves all connected processes. autoStartServer: true (the default) spawns the server automatically and shuts it down with embedder.destroy().

gRPC server mode

Run the model as a gRPC service (HTTP/2 + Protocol Buffers). Works locally or over a network.

# Start the server — default model (Xenova/all-MiniLM-L6-v2), localhost:50051
npm run server

# Pass arguments after --
npm run server -- --address 0.0.0.0:50051        # listen on all interfaces

| Argument | Default | Description | |---|---|---| | --model | Xenova/all-MiniLM-L6-v2 | Hugging Face model identifier | | --address | localhost:50051 | Bind address (host:port) | | --pooling | mean | mean | cls | none | | --normalize / --no-normalize | enabled | L2-normalise output vectors | | --dtype | — | Quantization: fp32 | fp16 | q8 | q4 | q4f16 | auto | | --device | cpu | cpu | gpu | auto | | --provider | — | cuda | dml | | --token | — | Hugging Face API token (also reads HF_TOKEN) | | --cache-dir | ~/.embedeer/models | Model cache directory | | --idle-timeout | — | Offload model after N ms of inactivity; reload on next request |

Connect from Node.js using this package's client:

// Auto-start a local server (dies with process) and you can then send it data
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'grpc',
  grpcAddress: 'localhost:50051',
});

// Connect this client into a remote server
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'grpc',
  grpcAddress: '10.0.1.42:50051',
  autoStartServer: false,
});

const vectors = await embedder.embed(['Hello world']);
await embedder.destroy();

Multi-server load balancing

Point a WorkerPool at multiple servers. The idle-worker queue acts as a natural load balancer — workers on faster servers (GPU) finish sooner and pick up proportionally more tasks.

const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
  mode: 'grpc',               // or 'socket'
  dtype: 'fp16',              // uniform across all servers
  servers: [
    { address: 'localhost:50051', workers: 6, device: 'cuda', provider: 'cuda' },
    { address: 'localhost:50052', workers: 6, device: 'cuda', provider: 'cuda' },
    { address: 'localhost:50053', workers: 2, device: 'cpu' },
  ],
  autoStartServer: true,
});
// 14 workers total; GPU servers receive ~5× more requests than the CPU server

Note that autoStartServer: true means that it will create those servers and then send worker clients to them. These three servers are all local and tied to this process.


Model management

Model compatibility (ONNX)

Embedeer runs models via onnxruntime-node. Models chosen from Hugging Face must provide an ONNX export compatible with ONNX Runtime, or be convertible to ONNX (see Optimum).

Pulling/pre-caching models

Embedeer supports pre-caching and managing downloaded models.

  • Pull (pre-cache) a model via the CLI: npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2
  • Programmatic pre-cache using loadModel()
  • Cache location: default is ~/.embedeer/models. Override with the CLI --cache-dir option or the cacheDir argument to loadModel().

Local models

  • Use a local model path directly (no copying): npx @jsilvanus/embedeer --use-local /path/to/local-model --data "Hello world"
  • Copy a local model into the cache and give it a stable name: npx @jsilvanus/embedeer --load-local /path/to/local-model --name my-local-model
  • Use local modal after copying: npx @jsilvanus/embedeer --model my-local-model or npx @jsilvanus/embedeer --model ~/.embedeer/models/my-local-model

Programmatic helpers

Some helpers are available to the public API:

  • importLocalModel(src, { name?, cacheDir? }) — copy a local model into the cache and return { modelName, path }.
  • await Embedder.create('/path/to/local-model', { cacheDir: '/my/cache' }); — use a model from a custom path
  • getCacheDir() — return the resolved cache directory used by embedeer (useful when you want to manage files yourself).
  • isModelDownloaded(name) / listModels() / getCachedModels() — inspect the cache.
  • deleteModel(name) — remove a cached model directory.
  • getLoadedModels() — returns an array of model names currently loaded by active worker pools.
  • deleteModel(modelName, { cacheDir? }) — remove cached model directories matching modelName.

These functions are exported from the public package entry (src/index.js) so you can import them from @jsilvanus/embedeer.


CLI

Full command-line documentation moved to CLI.md.

GPU Acceleration

GPU support is built into onnxruntime-node (a dependency of @huggingface/transformers):

| Platform | Provider | Requirement | |----------------|-----------|--------------------------------------------------------| | Linux x64 | CUDA | NVIDIA GPU + driver ≥ 525, CUDA 12 toolkit, cuDNN 9 | | Windows x64 | DirectML | Any DirectX 12 GPU (most GPUs since 2016), Windows 10+ |

Provider selection logic

| device | provider | Behavior | |----------|-----------|----------| | cpu (default) | — | Always CPU | | auto | — | Try GPU providers for the platform in order; silent CPU fallback | | gpu | — | Try GPU providers; throw if none available | | any | cuda | Load CUDA provider; throw if not available or not supported | | any | dml | Load DirectML provider; throw if not available or not supported | | any | cpu | Always CPU |

On Linux x64: GPU order is cuda.
On Windows x64: GPU order is cuda → dml.


Testing

CI is enabled via GitHub Actions (.github/workflows/ci.yml) which runs tests and collects coverage on push and pull requests.


Performance Optimizations

How to tune performance?

Embedeer exposes runtime knobs and helper scripts to tune throughput for your host.

  • Pre-load models: run Embedder.loadModel(model, { dtype, cacheDir }) or use the bench scripts so workers start instantly without re-downloading models.
  • Reuse Embedder instances: create a single Embedder and call embed() repeatedly instead of creating and destroying instances per batch.
  • Batch size vs concurrency:
    • CPU: moderate batch sizes (16–64) with multiple workers (concurrency ≥ 2) usually give best throughput.
    • GPU: larger batches (64–256) with low concurrency (1–2) are typically fastest.
  • BLAS threading: avoid oversubscription by setting OMP_NUM_THREADS and MKL_NUM_THREADS to Math.floor(cpu_cores / concurrency) before starting workers.
  • Device/provider: use cuda on Linux and dml (DirectML) on Windows when available; device: 'auto' will try providers and fall back to CPU.

Automatic performance tuning

  • Automatic tuning: use bench/grid-search.js to sweep batchSize, concurrency, and dtype for your host and save results. You can generate and persist a per-user profile and apply it automatically via the Embedder APIs.

Examples:

# CPU quick grid
node bench/grid-search.js --device cpu --sample-size 200 --out bench/grid-results-cpu.json

# GPU quick grid
node bench/grid-search.js --device gpu --sample-size 100 --out bench/grid-results-gpu.json

Programmatic performance tuning

You can generate and save a per-user performance profile which Embedder.create() will automatically apply. This is useful to pick the best batchSize / concurrency for your machine without manual tuning.

import { Embedder } from '@jsilvanus/embedeer';

// Quick profile generation (writes ~/.embedeer/perf-profile.json)
await Embedder.generateAndSaveProfile({ mode: 'quick', device: 'cpu', sampleSize: 100 });
// Subsequent calls to Embedder.create() will auto-apply the saved profile by default.

Server mode benchmark

Compare socket and gRPC server throughput against the process/thread baseline:

npm run server-bench
# or with options:
node bench/server-bench.js --model Xenova/all-MiniLM-L6-v2 --batch-size 32 --sample-size 500

The benchmark starts each server as a subprocess, waits for it to load the model, runs embeddings, then shuts it down. Reports startup time (spawn → ready) separately from embedding throughput so you can see the fixed cost of model loading vs. steady-state performance.

Options:
  --model       <name>   HF model identifier  (default: Xenova/all-MiniLM-L6-v2)
  --batch-size  <n>      Texts per request     (default: 32)
  --dtype       <type>   Quantization dtype    (default: none)
  --sample-size <n>      Number of texts       (default: 200)
  --skip-socket          Skip socket runner
  --skip-grpc            Skip gRPC runner
  --skip-baseline        Skip process/thread baseline

License

MIT