@jsilvanus/embedeer
v1.4.0
Published
A node.js embedding tool with optional GPU acceleration
Maintainers
Readme
embedeer

A Node.js Embedding Tool
A Node.js tool for generating text embeddings using models from Hugging Face.
Supports batched input, parallel execution, isolated child-process workers (default) or in-process threads, quantization, optional GPU acceleration, and Hugging Face auth.
Features
- Downloads any Hugging Face feature-extraction model on first use (cached in
~/.embedeer/models) - Isolated processes (default) — a worker crash cannot bring down the caller
- In-process threads — opt-in via
mode: 'thread'for lower overhead - Sequential execution when
concurrency: 1 - Configurable batch size and concurrency
- GPU acceleration — optional CUDA (Linux x64) and DirectML (Windows x64), no extra packages needed
- Hugging Face API token support (
--token/HF_TOKENenv var) - Quantization via
dtype(fp32·fp16·q8·q4·q4f16·auto) - Rich CLI: pull model, embed from file, dump output as JSON / TXT / SQL
Installation
npm install @jsilvanus/embedeerGPU acceleration (CUDA on Linux x64, DirectML on Windows x64) is built into onnxruntime-node
which ships as a transitive dependency. No additional packages are required.
For CUDA on Linux x64 you also need the CUDA 12 system libraries:
# Ubuntu / Debian
sudo apt install cuda-toolkit-12-6 libcudnn9-cuda-12
## Programmatic APIModel management
Embedeer supports pre-caching and managing downloaded models.
- Pull (pre-cache) a model via the CLI:
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2- Programmatic pre-cache using
loadModel():
import { loadModel } from '@jsilvanus/embedeer';
const { modelName, cacheDir } = await loadModel('Xenova/all-MiniLM-L6-v2', {
token: 'hf_...', // optional HF token
dtype: 'q8', // optional quantization
cacheDir: '/my/cache', // optional override
});Cache location: default is
~/.embedeer/models. Override with the CLI--cache-diroption or thecacheDirargument toloadModel().Removing cached models: delete the model directory from the cache. Example:
# Unix
rm -rf ~/.embedeer/models/Xenova-all-MiniLM-L6-v2
# PowerShell (Windows)
Remove-Item -Recurse -Force $env:USERPROFILE\.embedeer\models\Xenova-all-MiniLM-L6-v2- Advanced: see
src/model-management.jsfor low-level cache helpers.
Explainer — deterministic LLM interface
This was deprecated and moved to npm package @jsilvanus/chattydeer in 1.3.0.
Input Sources
Embed texts (CPU — default)
import { Embedder } from '@jsilvanus/embedeer';
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
batchSize: 32, // texts per worker task (default: 32)
concurrency: 2, // parallel workers (default: 2)
mode: 'process', // 'process' | 'thread' (default: 'process')
pooling: 'mean', // 'mean' | 'cls' | 'none' (default: 'mean')
normalize: true, // L2-normalise vectors (default: true)
token: 'hf_...', // HF API token (optional; also reads HF_TOKEN env)
dtype: 'q8', // quantization dtype (optional)
cacheDir: '/my/cache', // override model cache (default: ~/.embedeer/models)
});
const vectors = await embedder.embed(['Hello world', 'Foo bar baz']);
// → number[][] (one 384-dim vector per text for all-MiniLM-L6-v2)
await embedder.destroy(); // shut down worker processesTypeScript example
The package includes TypeScript declarations so imports are typed automatically.
import { Embedder } from '@jsilvanus/embedeer';
async function main() {
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', { batchSize: 32, concurrency: 2 });
const vectors = await embedder.embed(['Hello world', 'Foo bar baz']);
// vectors: number[][]
await embedder.destroy();
}
main().catch(console.error);Programmatic profile generation (optional)
You can generate and save a per-user performance profile which Embedder.create() will
automatically apply. This is useful to pick the best batchSize / concurrency for your
machine without manual tuning.
import { Embedder } from '@jsilvanus/embedeer';
// Quick profile generation (writes ~/.embedeer/perf-profile.json)
await Embedder.generateAndSaveProfile({ mode: 'quick', device: 'cpu', sampleSize: 100 });
// Subsequent calls to Embedder.create() will auto-apply the saved profile by default.Embed texts with GPU
import { Embedder } from '@jsilvanus/embedeer';
// Auto-detect GPU (falls back to CPU if no provider is installed)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
device: 'auto',
});
// Require GPU (throws if no provider is available)
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
device: 'gpu',
});
// Explicitly select an execution provider
const embedder = await Embedder.create('Xenova/all-MiniLM-L6-v2', {
provider: 'cuda', // 'cuda' | 'dml'
});Pull (pre-cache) a model
Like ollama pull — downloads the model once so workers start instantly:
import { loadModel } from '@jsilvanus/embedeer';
const { modelName, cacheDir } = await loadModel('Xenova/all-MiniLM-L6-v2', {
token: 'hf_...', // optional
dtype: 'q8', // optional
});CLI
npx @jsilvanus/embedeer [options]
Model management (pull / cache model):
npx @jsilvanus/embedeer --model <name>
Embed texts (batch):
npx @jsilvanus/embedeer --model <name> --data "text1" "text2" ...
npx @jsilvanus/embedeer --model <name> --data '["text1","text2"]'
npx @jsilvanus/embedeer --model <name> --file texts.txt
echo '["t1","t2"]' | npx @jsilvanus/embedeer --model <name>
printf 'a\0b\0c' | npx @jsilvanus/embedeer --model <name> --delimiter '\0'
Interactive / streaming line-reader:
npx @jsilvanus/embedeer --model <name> --interactive --dump out.jsonl
cat big.txt | npx @jsilvanus/embedeer --model <name> -i --output csv --dump out.csv
Options:
-m, --model <name> Hugging Face model (default: Xenova/all-MiniLM-L6-v2)
-d, --data <text...> Text(s) or JSON array to embed
--file <path> Input file: JSON array or delimited texts
-D, --delimiter <str> Record separator for stdin/file (default: \n)
Escape sequences supported: \0 \n \t \r
-i, --interactive Interactive line-reader (see below)
--dump <path> Write output to file instead of stdout
--output <format> Output: json|jsonl|csv|txt|sql (default: json)
--with-text Include source text alongside each embedding
-b, --batch-size <n> Texts per worker batch (default: 32)
-c, --concurrency <n> Parallel workers (default: 2)
--mode process|thread Worker mode (default: process)
-p, --pooling <mode> mean|cls|none (default: mean)
--no-normalize Disable L2 normalisation
--dtype <type> Quantization: fp32|fp16|q8|q4|q4f16|auto
--token <tok> Hugging Face API token (or set HF_TOKEN env)
--cache-dir <path> Model cache directory (default: ~/.embedeer/models)
--device <mode> Compute device: auto|cpu|gpu (default: cpu)
--provider <name> Execution provider override: cpu|cuda|dml
-h, --help Show this helpInput Sources
Texts can be provided in any of these ways (checked in order):
| Source | How |
|--------|-----|
| Inline args | --data "text1" "text2" "text3" |
| Inline JSON | --data '["text1","text2"]' |
| File | --file texts.txt (JSON array or one record per line) |
| Stdin | Pipe or redirect — auto-detected; TTY is skipped |
| Interactive | --interactive / -i — line-reader, embeds as you type |
Stdin auto-detection: when stdin is not a TTY (i.e. data is piped or redirected), embedeer reads it before deciding what to do. JSON arrays are accepted directly; otherwise records are split on the delimiter.
Interactive Line-Reader Mode (-i / --interactive)
The interactive mode opens a line-by-line reader that starts embedding as records arrive — ideal for pasting large datasets into a terminal or streaming data from another process.
# Open an interactive session (paste lines, Ctrl+D when done)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --interactive --dump embeddings.jsonl
# Stream a large file through interactive mode with CSV output
cat big.txt | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 \
--interactive --output csv --dump embeddings.csv
# Interactive with GPU, custom batch size, txt output
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 \
--interactive --device auto --batch-size 16 --output txt --dump vecs.txtHow it works:
| Event | What happens |
|-------|-------------|
| Type a line, press Enter | Record is buffered |
| Buffer reaches --batch-size | Auto-flush: embed + append to output |
| Type an empty line | Manual flush: embed whatever is buffered |
| Ctrl+D (EOF) | Flush remaining records and exit |
| Ctrl+C | Flush remaining records and exit |
Behaviour notes:
- Progress messages (
Batch N: M record(s) → file) always go to stderr — they never pollute piped output. - When stdin is a TTY, a
>prompt is shown on stderr. - Output defaults to stdout if
--dumpis omitted; a tip is printed when running in TTY mode. --output jsonand--output sqlare automatically promoted tojsonlsince they produce complete documents that cannot be appended to incrementally.--output csvwrites the dimension header (text,dim_0,dim_1,...) on the first batch only; subsequent batches append data rows.- Each interactive session clears the
--dumpfile on start so you always get a fresh output file.
Configurable delimiter (-D / --delimiter)
By default records in stdin and files are split on newline (\n). Use --delimiter to change it:
# Newline-delimited (default)
printf 'Hello\nWorld\n' | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2
# Null-byte delimited — safe with filenames/texts that contain newlines
printf 'Hello\0World\0' | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --delimiter '\0'
# Tab-delimited
printf 'Hello\tWorld' | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --delimiter '\t'
# Custom multi-character delimiter
printf 'Hello|||World|||Foo' | npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --delimiter '|||'
# File with null-byte delimiter
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --file records.bin --delimiter '\0'
# Integrate with find -print0 (handles filenames with spaces / newlines)
find ./docs -name '*.txt' -print0 | \
xargs -0 cat | \
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --delimiter '\0'Supported escape sequences in --delimiter:
| Sequence | Character |
|----------|-----------|
| \0 | Null byte (U+0000) |
| \n | Newline (U+000A) |
| \t | Tab (U+0009) |
| \r | Carriage return (U+000D) |
Output Formats
| Format | Description |
|--------|-------------|
| json (default) | JSON array of float arrays: [[0.1,0.2,...],[...]] |
| json --with-text | JSON array of objects: [{"text":"...","embedding":[...]}] |
| jsonl | Newline-delimited JSON, one object per line: {"text":"...","embedding":[...]} |
| csv | CSV with header: text,dim_0,dim_1,...,dim_N |
| txt | Space-separated floats, one vector per line |
| txt --with-text | Tab-separated: <original text>\t<float float ...> |
| sql | INSERT INTO embeddings (text, vector) VALUES ...; |
Use --dump <path> to write the output to a file instead of stdout. Progress messages always go to stderr so they never interfere with piped output.
Piping examples
MODEL=Xenova/all-MiniLM-L6-v2
# --- json (default) ---
# Embed and pretty-print with jq
echo '["Hello","World"]' | npx @jsilvanus/embedeer --model $MODEL | jq '.[0] | length'
# --- jsonl ---
# One object per line — pipe to jq, grep, awk, etc.
npx @jsilvanus/embedeer --model $MODEL --data "foo" "bar" --output jsonl
# Filter by similarity: extract embedding for downstream processing
npx @jsilvanus/embedeer --model $MODEL --data "query text" --output jsonl \
| jq -c '.embedding'
# Stream a large file and store as JSONL
npx @jsilvanus/embedeer --model $MODEL --file big.txt --output jsonl --dump out.jsonl
# --- json --with-text ---
# Keep the source text next to each vector (useful for building a search index)
npx @jsilvanus/embedeer --model $MODEL --output json --with-text \
--data "cat" "dog" "fish" \
| jq '.[] | {text, dims: (.embedding | length)}'
# --- csv ---
# Embed then open in Python/pandas
npx @jsilvanus/embedeer --model $MODEL --file texts.txt --output csv --dump vectors.csv
python3 -c "import pandas as pd; df = pd.read_csv('vectors.csv'); print(df.shape)"
# --- txt ---
# Raw floats — useful for awk/paste/numpy text loading
npx @jsilvanus/embedeer --model $MODEL --data "Hello" "World" --output txt \
| awk '{print NF, "dimensions"}'
# txt --with-text: original text + tab + floats, easy to parse
npx @jsilvanus/embedeer --model $MODEL --file texts.txt --output txt --with-text \
| while IFS=$'\t' read -r text vec; do echo "TEXT: $text"; done
# --- sql ---
# Generate INSERT statements for a vector DB or SQLite
npx @jsilvanus/embedeer --model $MODEL --file texts.txt --output sql --dump inserts.sql
sqlite3 mydb.sqlite < inserts.sql
# --- Chaining with other tools ---
# Embed stdin from another command
cat docs/*.txt | npx @jsilvanus/embedeer --model $MODEL --output jsonl > embeddings.jsonl
# Null-byte input from find (handles any filename or text with newlines)
find ./corpus -name '*.txt' -print0 \
| xargs -0 cat \
| npx @jsilvanus/embedeer --model $MODEL --delimiter '\0' --output jsonlCLI Examples
# Pull a model (like ollama pull)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2
# Embed a few strings, output JSON (CPU)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --data "Hello" "World"
# Auto-detect GPU, fall back to CPU if unavailable
# (uses CUDA on Linux, DirectML on Windows, CPU everywhere else)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device auto --data "Hello"
# Require GPU (throws with install instructions if no provider found)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device gpu --data "Hello GPU"
# Explicit CUDA (Linux x64 — requires CUDA 12 system libraries)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --provider cuda --data "Hello CUDA"
# Explicit DirectML (Windows x64)
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --provider dml --data "Hello DML"
# Embed from a file, dump SQL to disk
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 \
--file texts.txt --output sql --dump out.sql
# Use quantized model, in-process threads, private model with token
npx @jsilvanus/embedeer --model my-org/private-model \
--token hf_xxx --dtype q8 --mode thread \
--data "embed me"Using GPU
No additional packages are needed — onnxruntime-node (installed with @jsilvanus/embedeer) already
bundles the CUDA provider on Linux x64 and DirectML on Windows x64.
Linux x64 — NVIDIA CUDA:
# One-time: install CUDA 12 system libraries (Ubuntu/Debian)
sudo apt install cuda-toolkit-12-6 libcudnn9-cuda-12
# Auto-detect: uses CUDA here, CPU fallback on any other machine
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device auto --data "Hello"
# Hard-require CUDA (throws with diagnostic error if unavailable):
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device gpu --data "Hello GPU"
# Explicit CUDA provider:
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --provider cuda --data "Hello CUDA"Windows x64 — DirectML (any GPU: NVIDIA / AMD / Intel):
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device auto --data "Hello"
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --device gpu --data "Hello GPU"
npx @jsilvanus/embedeer --model Xenova/all-MiniLM-L6-v2 --provider dml --data "Hello DML"GPU Acceleration
GPU support is built into onnxruntime-node (a dependency of @huggingface/transformers):
| Platform | Provider | Requirement | |----------------|-----------|--------------------------------------------------------| | Linux x64 | CUDA | NVIDIA GPU + driver ≥ 525, CUDA 12 toolkit, cuDNN 9 | | Windows x64 | DirectML | Any DirectX 12 GPU (most GPUs since 2016), Windows 10+ |
Provider selection logic
Testing
Run the project's tests locally:
# install deps
pnpm install
# run tests
pnpm test
# run tests with coverage
pnpm run coverageCI is enabled via GitHub Actions (.github/workflows/ci.yml) which runs tests and collects coverage on push and pull requests.
| device | provider | Behavior |
|----------|-----------|----------|
| cpu (default) | — | Always CPU |
| auto | — | Try GPU providers for the platform in order; silent CPU fallback |
| gpu | — | Try GPU providers; throw if none available |
| any | cuda | Load CUDA provider; throw if not available or not supported |
| any | dml | Load DirectML provider; throw if not available or not supported |
| any | cpu | Always CPU |
On Linux x64: GPU order is cuda.
On Windows x64: GPU order is cuda → dml.
Performance Optimizations
Embedeer exposes runtime knobs and helper scripts to tune throughput for your host.
- Pre-load models: run
Embedder.loadModel(model, { dtype, cacheDir })or use thebenchscripts so workers start instantly without re-downloading models. - Reuse
Embedderinstances: create a singleEmbedderand callembed()repeatedly instead of creating and destroying instances per batch. - Batch size vs concurrency:
- CPU: moderate batch sizes (16–64) with multiple workers (concurrency ≥ 2) usually give best throughput.
- GPU: larger batches (64–256) with low concurrency (1–2) are typically fastest.
- BLAS threading: avoid oversubscription by setting
OMP_NUM_THREADSandMKL_NUM_THREADStoMath.floor(cpu_cores / concurrency)before starting workers. - Device/provider: use
cudaon Linux anddml(DirectML) on Windows when available;device: 'auto'will try providers and fall back to CPU. - Automatic tuning: use
bench/grid-search.jsto sweepbatchSize,concurrency, anddtypefor your host and save results. You can generate and persist a per-user profile and apply it automatically via theEmbedderAPIs.
Examples:
# CPU quick grid
node bench/grid-search.js --device cpu --sample-size 200 --out bench/grid-results-cpu.json
# GPU quick grid
node bench/grid-search.js --device gpu --sample-size 100 --out bench/grid-results-gpu.jsonProgrammatic profile generation (writes ~/.embedeer/perf-profile.json):
import { Embedder } from '@jsilvanus/embedeer';
await Embedder.generateAndSaveProfile({ mode: 'quick', device: 'cpu', sampleSize: 100 });
// Embedder.create() will auto-apply a saved per-user profile by defaultHow it works
embed(texts)
│
├─ split into batches of batchSize
│
└─ Promise.all(batches) ──► WorkerPool
│
├─ [process mode] ChildProcessWorker 0
│ resolveProvider(device, provider)
│ → pipeline('feature-extraction', model, { device: 'cuda' })
│ → embed batch A
│
└─ [process mode] ChildProcessWorker 1
resolveProvider(device, provider)
→ pipeline(...) → embed batch BWorkers load the model once at startup and reuse it for all batches.
Provider activation happens per-worker before the pipeline is created.
E2E-testing
Note: HF authentication has not been tested.
Collaboration
You are welcome to suggest additions or open a PR, especially if you have performance-related assistance. Opened issues are also accepted with thanks.
License
MIT
