npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@seifkhaled/auto-embed

v1.0.0

Published

Zero-config CLI that ingests files into vector databases for RAG projects. Parse, chunk, embed, upsert — one command.

Downloads

86

Readme

auto-embed

Zero-config CLI that ingests files into vector databases for RAG projects. Parse, chunk, embed, upsert — one command.

auto-embed is the file-side of a RAG pipeline as a single command. It parses your docs (Markdown, PDF, HTML, DOCX, CSV, JSON, code), chunks them with a token-aware splitter, calls an embedding provider, and upserts the vectors into your vector DB. Re-runs are idempotent: unchanged files cost zero API calls, edited files re-embed only what changed.

  • One command, full pipeline. Parse → chunk → embed → upsert.
  • Zero-key prototype. --local runs end-to-end with fastembed + Chroma — no API keys.
  • Idempotent. A .auto-embed/ lockfile makes re-runs cheap; editing a paragraph re-embeds only the affected chunks.
  • 5 embedding providers, 4 vector DBs. OpenAI · Google · Voyage · Cohere · fastembed. pgvector · Pinecone · Qdrant · Chroma.
  • Provider-agnostic. Same flags, same UX, same lockfile shape.
  • CI-friendly. Deterministic chunk IDs, exit codes that mean things, no interactive prompts in non-TTY.

Sibling of auto-seed — same opinionated, one-command philosophy.


Install

Published on npm as @seifkhaled/auto-embed. The CLI binary is auto-embed.

# One-off (no install):
npx @seifkhaled/auto-embed embed ./README.md --local

# Or install globally and use the short name:
npm i -g @seifkhaled/auto-embed
auto-embed embed ./README.md --local

Quick start

# Zero-key prototype: fastembed + Chroma, no setup needed
npx @seifkhaled/auto-embed embed ./README.md --local

# Or: configure a real provider + DB once
npx @seifkhaled/auto-embed init
npx @seifkhaled/auto-embed embed ./docs/handbook.pdf
npx @seifkhaled/auto-embed embed "./docs/**/*.md" --collection handbook

The init flow asks you to pick an embedding provider, paste a key, pick a vector DB, and paste a connection. The config lives in ~/.auto-embed/config.json with mode 0600 and is masked on display.


Supported inputs and outputs

| Input file | Parser | Default splitter | |---|---|---| | .md / .mdx | unified + remark | header-aware (H1/H2/H3) | | .pdf | unpdf | per-page + recursive within | | .html / .htm | cheerio | strips nav/aside/footer/script | | .docx | mammoth → markdown | header-aware | | .csv | built-in | one chunk per row, header in meta | | .json / .jsonl | built-in | one chunk per element / line | | .txt, code (.ts/.js/.py/.go/.rs/.java) | built-in | recursive (language-tuned for code) |

| Embedding provider | Default model | Dim | |---|---|---| | OpenAI | text-embedding-3-small | 1536 | | Google | gemini-embedding-001 | 3072 | | Voyage | voyage-3 | 1024 | | Cohere | embed-english-v3.0 | 1024 | | Local (fastembed) | BAAI/bge-small-en-v1.5 | 384 |

| Vector DB | Connection | |---|---| | Chroma | local path (./chroma) or HTTP URL | | pgvector | DATABASE_URL | | Pinecone | PINECONE_API_KEY + index name | | Qdrant | URL + optional API key |

See docs/providers/ and docs/vector-dbs/ for per-target setup notes.


Common flags

| Flag | Default | Description | |---|---|---| | --collection <name> | derived from filename | Target collection / index / table. | | --provider <name> | from config | openai · google · voyage · cohere · local. | | --model <id> | provider default | Override embedding model. | | --db <name> | from config | pgvector · pinecone · qdrant · chroma. | | --local | off | Shortcut: --provider local --db chroma with auto-spawned local Chroma. | | --chunk-size <n> | 800 (tokens) | Target chunk size. | | --overlap <n> | 100 (tokens) | Token overlap between chunks. | | --splitter <type> | from extension | recursive · markdown · pdf · html · code · jsonl · csv. | | --metadata <k=v,…> | none | Static metadata stamped onto every chunk. | | --plan | off | One LLM call to tune the plan (cheap; reuse via --plan plan.json). | | --plan <path> | — | Reuse a saved plan; skips the LLM call entirely. | | --plan-only | off | Write the plan and stop; no embedding. | | --out <path> | plan.json | Where to write the plan when --plan-only is set. | | --batch-size <n> | provider default | Batch size for embedding API calls. | | --concurrency <n> | 4 | Parallel embedding requests. | | --force | off | Ignore the lockfile; re-embed and replace. | | --dry-run | off | Print the plan + chunk table + USD cost estimate; embed nothing. | | --verbose | off | Debug logging. |

Run auto-embed embed --help for the complete list.


Examples

# Smallest possible flow: embed the README locally
npx @seifkhaled/auto-embed embed ./README.md --local

# Real flow: PDFs into Pinecone
PINECONE_API_KEY=… npx @seifkhaled/auto-embed embed ./docs/handbook.pdf \
  --provider openai --db pinecone --collection handbook

# Preview a plan + cost without making API calls
npx @seifkhaled/auto-embed embed ./docs/handbook.pdf --dry-run

# Tune the plan with one LLM call, then run offline forever
ANTHROPIC_API_KEY=sk-ant-… npx @seifkhaled/auto-embed embed ./docs/handbook.pdf --plan --plan-only
npx @seifkhaled/auto-embed embed ./docs/handbook.pdf --plan plan.json --provider openai --db chroma

# Glob ingestion
npx @seifkhaled/auto-embed embed "./docs/**/*.md" --collection handbook --concurrency 8

# CI: deterministic, non-interactive
DATABASE_URL=… npx @seifkhaled/auto-embed embed ./docs/handbook.md \
  --provider openai --db pgvector --collection handbook --yes

More patterns in examples/.


Config & secrets

~/.auto-embed/config.json (file mode 0600):

{
  "defaults": { "provider": "openai", "db": "chroma" },
  "models":   { "openai": "text-embedding-3-small" },
  "apiKeys":  { "openai": "sk-…", "pinecone": "pcsk-…" },
  "dbs":      { "pgvector": { "url": "postgres://…" }, "chroma": { "url": "./chroma" } }
}

Environment variables take precedence over the file (recommended in CI):

  • Embedding providers: OPENAI_API_KEY · GOOGLE_API_KEY · VOYAGE_API_KEY · COHERE_API_KEY
  • Vector DBs: DATABASE_URL · PINECONE_API_KEY · QDRANT_URL · QDRANT_API_KEY · CHROMA_URL
  • Selection: AUTO_EMBED_PROVIDER · AUTO_EMBED_DB · AUTO_EMBED_MODEL
  • Optional LLM planner: ANTHROPIC_API_KEY (alternative: OPENAI_API_KEY or GOOGLE_API_KEY), AUTO_EMBED_PLAN_PROVIDER

Useful one-liners:

auto-embed providers                              # who's configured, who's missing keys
auto-embed config list                            # all stored values, API keys masked
auto-embed config get defaults.provider
auto-embed config set defaults.db pinecone
auto-embed config path                            # absolute path to the config file

auto-embed never logs a full API key and never writes a key into a generated file or lockfile.


How it works

┌──────────┐   ┌──────────┐   ┌──────────────┐   ┌──────────┐   ┌──────────┐
│ 1. PARSE │   │ 2. CHUNK │   │ 3. LOCKFILE  │   │ 4. EMBED │   │ 5. UPSERT│
│ file →   │──▶│ token-   │──▶│ diff vs.     │──▶│ provider │──▶│ vector DB│
│ sections │   │ aware    │   │ prior chunks │   │ batches  │   │ adapter  │
└──────────┘   └──────────┘   └──────┬───────┘   └──────────┘   └──────────┘
                                     │
                                fast-path:
                              "up to date" if
                             nothing changed,
                              0 API calls.

The chunker uses js-tiktoken for token counts and a port of LangChain's recursive splitter (no LangChain dependency). Each chunk gets a deterministic ID derived from sha256(sourcePath + index + chunkerVersion + text) — same input file + same plan + same model → byte-identical chunk IDs across machines.

A per-file lockfile lives at ./.auto-embed/<hash>.lock.json. Commit .auto-embed/ to share idempotency state with CI.

The EmbedPlan

The plan is a small JSON document describing how to chunk one file: splitter type, chunk size, overlap, target collection, embedding model. By default it's derived heuristically from the file extension. --plan opts into a single LLM call that inspects the first ~4KB of the file and tunes the plan. --plan <file> reuses a saved plan and never calls the LLM — free, offline, reproducible.

Re-run semantics

  • File hash unchanged, plan unchanged, model unchanged → skip, up to date, zero API calls.
  • File content changed → diff chunk IDs; embed only the new ones; delete the removed ones.
  • Embedding model or dimensions changed → refuse to write (would corrupt the collection with mixed-dim vectors). Use --force or pick a fresh --collection.

Exit codes

| Code | Meaning | |---|---| | 0 | Success | | 1 | User / config error (missing key, invalid flag, missing file) | | 2 | Parser error (broken PDF, malformed JSON, etc.) | | 3 | Provider API error (auth, rate limit, LLM JSON failure after retry) | | 4 | Vector DB error (connection failure, schema error) | | 5 | Integrity error (dimension mismatch, plan-hash mismatch, etc.) |

Errors print a single human-readable line plus an actionable hint:. Use --verbose for stack traces.


--local notes

--local spawns a Chroma server in the background (node_modules/.bin/chroma run --path ./chroma) the first time it's needed and leaves it running so subsequent CLI invocations reuse it. Stop it with:

pkill -f "chroma run"

The fastembed model (BAAI/bge-small-en-v1.5, ~30 MB) is downloaded once to a local cache on first use.


License

MIT.