@lojban/semantic-search-mcp

v1.0.18

Published

2 months ago

Local-first MCP server for semantic search using transformers.js and SQLite

0High
0Medium
0Low

lojban

mcp model-context-protocol semantic-search embeddings transformers sqlite lojban

Semantic Local MCP

A local-first MCP (Model Context Protocol) server for semantic search over your documents. Index text files (e.g. TSV, CSV, TXT) line-by-line, then search or filter by meaning using embeddings—all on your machine, no API keys required.

Use it in Cursor, Claude Code, or any IDE that supports MCP to search through dictionaries, glossaries, and corpora by semantic similarity.

Use cases

Lojban (or any) dictionary: Index a TSV where each line is a word/definition. Find entries similar to a phrase or concept, or discover gaps—word combinations or concepts your dictionary doesn't cover yet.
Glossaries & term bases: "Find entries that mean something like …" without exact keyword match.
Corpora & line-based data: Any file where each line is a record (TSV, CSV, one-sentence-per-line TXT). Index once, query by meaning.

How it works

Indexing: On startup, the server indexes content in the background. If SEMANTIC_SEARCH_INDEX_DIRS is set (comma-separated paths), it scans those directories. If it is not set, the server downloads the lojban/sampu_vlaste repository from GitHub and indexes that instead. In both cases, the server looks for .txt, .md, .tsv, .csv files. .txt, .tsv, .csv: each non-empty line is one record. .md: chunks by paragraphs and blocks—merged multi-line > blockquotes (e.g. Lojban + glosses), whole HTML <table>...</table> blocks, and blank-line-separated prose (including consecutive list items). Latest ## / ### titles are prepended as Context: … on each chunk for better retrieval. Each chunk gets one embedding (via Hugging Face Transformers.js, model Xenova/all-MiniLM-L6-v2) and is stored in a local SQLite database with @dao-xyz/sqlite3-vec (SQLite + sqlite-vec for Node and browser). The line field in search results is the start line of that chunk in the file. After upgrading to a version that changes chunking, restart the server so files are re-indexed (mtime/content hash refresh).
Search: You send a natural-language query; the server embeds it and returns the closest lines by cosine similarity.
Storage: Index is stored in your project's .semantic-search/data/ (or set SEMANTIC_SEARCH_DATA_DIR). No cloud, no API keys.

Requirements

Node.js 18+ (20+ recommended)
npm or pnpm

First run will download the embedding model (~80MB) and cache it locally.

Use in Cursor IDE

There is no build step and no need to run npm install yourself. The server runs only via npx tsx (TypeScript is run directly). Add a single command to MCP; on first run, npx will download the package and its dependencies, and the server will download the embedding model (~80MB) when you first index or search.

The package is published as @lojban/semantic-search-mcp. (To run from source before/without publishing, see the From source setup in the Development section.)

Add the MCP server in Cursor:
- Open Settings → Cursor Settings → MCP (or edit ~/.cursor/mcp.json).
- Add:
```
{
  "mcpServers": {
    "semantic-search": {
      "command": "npx",
      "args": ["-y", "@lojban/semantic-search-mcp"]
    }
  }
}
```
No cwd needed: the server stores its index in your project directory (.semantic-search/data/), so open your project in Cursor and the index is per-workspace. To use a fixed data directory instead, add "env": { "SEMANTIC_SEARCH_DATA_DIR": "/path/to/data" }. To have the server index specific directories on startup, set "env": { "SEMANTIC_SEARCH_INDEX_DIRS": "./dictionary,./glossary" } (comma-separated paths). If you omit SEMANTIC_SEARCH_INDEX_DIRS, the server will download and index the lojban/sampu_vlaste repo automatically.
Restart Cursor (or reload the window). Indexing starts automatically in the background: from your configured SEMANTIC_SEARCH_INDEX_DIRS, or from the downloaded sampu_vlaste repo if that env is not set.
In chat or Composer, ask the AI to use the tools:
- Search: "Use semantic-search tool: find combinations of words that can express the concept of …", "Use semantic-search tool: search the index for …" or "Use semantic-search tool: Find entries similar to …"
- Stats: "use semantic-search mcp. run get_index_stats" — stats include progress and start time (locale-formatted) when indexing is in progress.

The AI will call search and get_index_stats for you.

Use in other AI IDEs (Claude Code, etc.)

Any environment that supports MCP over stdio can use this server. Run:

One-liner: npx -y @lojban/semantic-search-mcp — dependencies are installed on first run; index is stored in the current working directory's .semantic-search/data/. Set env SEMANTIC_SEARCH_INDEX_DIRS (comma-separated paths) to index those directories on startup; if unset, the server downloads and indexes lojban/sampu_vlaste from GitHub. Tools: search, get_index_stats.

From source: Clone the repo, run npm install once, then use "command": "npx", "args": ["tsx", "src/index.ts"], "cwd": "/path/to/semantic-search-mcp" or "command": "node", "args": ["/path/to/semantic-search-mcp/run.mjs"] (no cwd needed with the latter). See MCP_SETUP.md for details.

MCP tools

| Tool | Description | |------|-------------| | search | Semantic search: query (string), optional limit (default 10). Returns file path, line number, content, and similarity score. | | get_index_stats | Returns total number of indexed files and lines. When indexing is running in the background, also returns progress: indexing.started_at (locale-formatted), lines_indexed_so_far, files_indexed_so_far, and in_progress. |

Indexing on startup

With your own dirs: Set the environment variable SEMANTIC_SEARCH_INDEX_DIRS to a comma-separated list of directories to index. When the MCP server starts, it begins indexing those directories in the background (async).
Default (no env set): If SEMANTIC_SEARCH_INDEX_DIRS is not set, the server downloads the lojban/sampu_vlaste repository from GitHub (as a zip), extracts it under .semantic-search/sampu_vlaste/, and indexes that. The download is cached; subsequent starts reuse the cached copy.

The index is cleared and rebuilt each time the server starts. Use absolute paths or paths relative to the server's working directory when setting SEMANTIC_SEARCH_INDEX_DIRS. The server reads and indexes all supported .txt, .md, .tsv, .csv files under each directory recursively. Indexing uses bounded memory and yields to the event loop so the OS stays responsive.

Example: Lojban dictionary gaps

Put your dictionary TSV (e.g. jbo-eng.tsv) in a folder (e.g. ./dictionary).
Set SEMANTIC_SEARCH_INDEX_DIRS=./dictionary in your MCP config (or in the environment). Restart the server; indexing runs in the background.
In Cursor: "Search for entries similar to 'to cause to become warm' and limit 20."
Or: "Search for 'emotional state of joy' and show me what we have; then suggest word combinations the dictionary might be missing."

The index is stored in .semantic-search/data/vectors.db (or your project root). Restart the server to re-index when you add or change files.

Development

The server is not built to JavaScript; it runs via npx tsx src/index.ts or node run.mjs. No tsc or node dist/ usage.

From source (e.g. before publishing to npm):

Run npm install once in the repo.
In MCP config use either:
- "command": "npx", "args": ["tsx", "src/index.ts"], "cwd": "/path/to/semantic-search-mcp", or
- "command": "node", "args": ["/path/to/semantic-search-mcp/run.mjs"] (run.mjs sets cwd automatically; see MCP_SETUP.md).

To run the server from the repo: npm run dev or npx tsx src/index.ts.

Run tests: npm test.

License

MIT