@lojban/semantic-search-mcp
v1.0.16
Published
Local-first MCP server for semantic search using transformers.js and SQLite
Maintainers
Readme
Semantic Local MCP
A local-first MCP (Model Context Protocol) server for semantic search over your documents. Index text files (e.g. TSV, CSV, TXT) line-by-line, then search or filter by meaning using embeddings—all on your machine, no API keys required.
Use it in Cursor, Claude Code, or any IDE that supports MCP to search through dictionaries, glossaries, and corpora by semantic similarity.
Use cases
- Lojban (or any) dictionary: Index a TSV where each line is a word/definition. Find entries similar to a phrase or concept, or discover gaps—word combinations or concepts your dictionary doesn't cover yet.
- Glossaries & term bases: "Find entries that mean something like …" without exact keyword match.
- Corpora & line-based data: Any file where each line is a record (TSV, CSV, one-sentence-per-line TXT). Index once, query by meaning.
How it works
- Indexing: On startup, the server indexes content in the background. If
SEMANTIC_SEARCH_INDEX_DIRSis set (comma-separated paths), it scans those directories. If it is not set, the server downloads the lojban/sampu_vlaste repository from GitHub and indexes that instead. In both cases, the server looks for.txt,.md,.tsv,.csvfiles. Each non-empty line gets a vector embedding (via Hugging Face Transformers.js, modelXenova/all-MiniLM-L6-v2) and is stored in a local SQLite database with @dao-xyz/sqlite3-vec (SQLite + sqlite-vec for Node and browser). Indexing runs asynchronously so the server stays responsive and uses bounded memory. - Search: You send a natural-language query; the server embeds it and returns the closest lines by cosine similarity.
- Storage: Index is stored in your project's
.semantic-search/data/(or setSEMANTIC_SEARCH_DATA_DIR). No cloud, no API keys.
Requirements
- Node.js 18+ (20+ recommended)
- npm or pnpm
First run will download the embedding model (~80MB) and cache it locally.
Use in Cursor IDE
There is no build step and no need to run npm install yourself. The server runs only via npx tsx (TypeScript is run directly). Add a single command to MCP; on first run, npx will download the package and its dependencies, and the server will download the embedding model (~80MB) when you first index or search.
The package is published as @lojban/semantic-search-mcp. (To run from source before/without publishing, see the From source setup in the Development section.)
Add the MCP server in Cursor:
- Open Settings → Cursor Settings → MCP (or edit
~/.cursor/mcp.json). - Add:
{ "mcpServers": { "semantic-search": { "command": "npx", "args": ["-y", "@lojban/semantic-search-mcp"] } } }No
cwdneeded: the server stores its index in your project directory (.semantic-search/data/), so open your project in Cursor and the index is per-workspace. To use a fixed data directory instead, add"env": { "SEMANTIC_SEARCH_DATA_DIR": "/path/to/data" }. To have the server index specific directories on startup, set"env": { "SEMANTIC_SEARCH_INDEX_DIRS": "./dictionary,./glossary" }(comma-separated paths). If you omitSEMANTIC_SEARCH_INDEX_DIRS, the server will download and index the lojban/sampu_vlaste repo automatically.- Open Settings → Cursor Settings → MCP (or edit
Restart Cursor (or reload the window). Indexing starts automatically in the background: from your configured
SEMANTIC_SEARCH_INDEX_DIRS, or from the downloaded sampu_vlaste repo if that env is not set.In chat or Composer, ask the AI to use the tools:
- Search: "Use semantic-search tool: find combinations of words that can express the concept of …", "Use semantic-search tool: search the index for …" or "Use semantic-search tool: Find entries similar to …"
- Stats: "use semantic-search mcp. run get_index_stats" — stats include progress and start time (locale-formatted) when indexing is in progress.
The AI will call search and get_index_stats for you.
Use in other AI IDEs (Claude Code, etc.)
Any environment that supports MCP over stdio can use this server. Run:
- One-liner:
npx -y @lojban/semantic-search-mcp— dependencies are installed on first run; index is stored in the current working directory's.semantic-search/data/. Set envSEMANTIC_SEARCH_INDEX_DIRS(comma-separated paths) to index those directories on startup; if unset, the server downloads and indexes lojban/sampu_vlaste from GitHub. Tools:search,get_index_stats.
From source: Clone the repo, run npm install once, then use "command": "npx", "args": ["tsx", "src/index.ts"], "cwd": "/path/to/semantic-search-mcp" or "command": "node", "args": ["/path/to/semantic-search-mcp/run.mjs"] (no cwd needed with the latter). See MCP_SETUP.md for details.
MCP tools
| Tool | Description |
|------|-------------|
| search | Semantic search: query (string), optional limit (default 10). Returns file path, line number, content, and similarity score. |
| get_index_stats | Returns total number of indexed files and lines. When indexing is running in the background, also returns progress: indexing.started_at (locale-formatted), lines_indexed_so_far, files_indexed_so_far, and in_progress. |
Indexing on startup
- With your own dirs: Set the environment variable
SEMANTIC_SEARCH_INDEX_DIRSto a comma-separated list of directories to index. When the MCP server starts, it begins indexing those directories in the background (async). - Default (no env set): If
SEMANTIC_SEARCH_INDEX_DIRSis not set, the server downloads the lojban/sampu_vlaste repository from GitHub (as a zip), extracts it under.semantic-search/sampu_vlaste/, and indexes that. The download is cached; subsequent starts reuse the cached copy.
The index is cleared and rebuilt each time the server starts. Use absolute paths or paths relative to the server's working directory when setting SEMANTIC_SEARCH_INDEX_DIRS. The server reads and indexes all supported .txt, .md, .tsv, .csv files under each directory recursively. Indexing uses bounded memory and yields to the event loop so the OS stays responsive.
Example: Lojban dictionary gaps
- Put your dictionary TSV (e.g.
jbo-eng.tsv) in a folder (e.g../dictionary). - Set
SEMANTIC_SEARCH_INDEX_DIRS=./dictionaryin your MCP config (or in the environment). Restart the server; indexing runs in the background. - In Cursor: "Search for entries similar to 'to cause to become warm' and limit 20."
- Or: "Search for 'emotional state of joy' and show me what we have; then suggest word combinations the dictionary might be missing."
The index is stored in .semantic-search/data/vectors.db (or your project root). Restart the server to re-index when you add or change files.
Development
The server is not built to JavaScript; it runs via npx tsx src/index.ts or node run.mjs. No tsc or node dist/ usage.
From source (e.g. before publishing to npm):
- Run
npm installonce in the repo. - In MCP config use either:
"command": "npx", "args": ["tsx", "src/index.ts"], "cwd": "/path/to/semantic-search-mcp", or"command": "node", "args": ["/path/to/semantic-search-mcp/run.mjs"](run.mjs setscwdautomatically; see MCP_SETUP.md).
To run the server from the repo: npm run dev or npx tsx src/index.ts.
License
MIT
