@wb200/mgrep

v0.1.18

Published

10 hours ago

Local semantic code search with LanceDB indexing and DeepInfra-powered retrieval

Downloads

0High
0Medium
0Low

wb200

semantic-search embeddings lancedb code-search grep search

Why mgrep?

Ask your repo questions in natural language instead of guessing exact symbols.
Keep a local LanceDB index on disk under ~/.mgrep/lancedb/.
Combine vector retrieval, full-text search, reranking, and optional answer synthesis.
Work directly in the CLI or wire it into coding agents.
Use it as a hybrid semantic complement to rg, grep, and ast-grep, not as a replacement for exact or structural search.

mgrep is for local repository search. It does not do web search in this fork.

# index a project
mgrep watch

# search semantically
mgrep "where do we set up auth?"

# synthesize an answer from retrieved local results
mgrep -a "how does the sync pipeline work?"

Quick Start

Install
```
npm install -g @wb200/mgrep
```
Set required API key
```
export DEEPINFRA_API_KEY=your_deepinfra_key
```
- DEEPINFRA_API_KEY is used for embeddings, reranking, synthesized answers, and agentic query planning.
- It is required for normal use in this fork.
Validate configuration
```
mgrep validate
```
Know where config lives
- Project-local: .mgreprc.yaml or .mgreprc.yml in the directory you are indexing/searching from
- Global: ~/.config/mgrep/config.yaml or ~/.config/mgrep/config.yml
Index a project
```
cd path/to/repo
mgrep watch
```
Inspect the effective indexing rules
```
mgrep rules
```

Search

mgrep "where do we set up auth?"
mgrep -m 25 "store schema"
mgrep -a "how is rate limiting implemented?"

What It Does

mgrep keeps a local searchable index of your repository.

Indexing is allowlist-first. A file must match an allowed extension, exact basename, or exact hidden basename before it is eligible for indexing.
Configured blockedPaths can exclude path prefixes regardless of filename.
After allowlist admission, .gitignore, .mgrepignore, built-in deny patterns, hidden-directory blocking, and text-file detection can still exclude it.
Indexed content is chunked and stored locally in LanceDB.
Embeddings, reranking, answer synthesis, and agentic planning are done through DeepInfra.

This means the index itself is local, but text chunks are sent to DeepInfra during embedding, reranking, and answer-generation flows.

Search Strategy

mgrep works best as the semantic layer in a local-search toolkit:

Use mgrep for intent-based discovery, architecture questions, and unfamiliar codebases.
Use rg or grep for exact strings, regexes, and exhaustive audits.
Use ast-grep for syntax-aware structural matches and refactor prep.

A common workflow is to use mgrep first to find candidate files or concepts, then confirm exact implementation details with rg or ast-grep.

Commands

Top-level commands:

mgrep or mgrep search <pattern> [path]
mgrep rules [path]
mgrep watch
mgrep validate
mgrep install-claude-code
mgrep uninstall-claude-code
mgrep install-codex
mgrep uninstall-codex
mgrep install-opencode
mgrep uninstall-opencode
mgrep install-droid
mgrep uninstall-droid
mgrep mcp

Global options:

--store <string>: logical store name to use, default mgrep

Understanding Stores

The --store flag controls which logical index mgrep reads from and writes to.

This is one of the most important things to understand when using mgrep across multiple folders.

The default store name

If you do not pass --store, mgrep always uses the store named:

mgrep

That means these are equivalent:

mgrep watch
mgrep --store mgrep watch

and:

mgrep "query"
mgrep --store mgrep "query"

You can also change the default for a shell session with:

export MGREP_STORE=my-store

After that, plain mgrep ... commands in that shell use my-store unless you override them with --store.

One command searches exactly one store

mgrep does not search all stores automatically.

Each command uses exactly one logical store:

--store some-name
or MGREP_STORE
or the built-in default mgrep

So if you indexed a folder with:

mgrep --store factory-specs watch

and later run:

mgrep "query"

you are not searching factory-specs. You are searching the default store mgrep.

To search the store you indexed, you must use:

mgrep --store factory-specs "query"

or:

export MGREP_STORE=factory-specs
mgrep "query"

What `watch` indexes

mgrep watch indexes:

the current working directory
all subdirectories under it
only files that match the current allowlist, then survive .gitignore, .mgrepignore, built-in deny patterns, hidden-directory blocking, and text-file detection

So this:

cd /path/to/project
mgrep --store my-project watch

indexes /path/to/project and all of its eligible subfolders into store my-project.

Stores are additive across multiple folders

If you run watch in two different folders with the same store name, the index is additive.

Example:

cd /path/project-a
mgrep --store shared watch

cd /path/project-b
mgrep --store shared watch

After that, store shared contains indexed content for both:

/path/project-a/...
/path/project-b/...

The second watch does not wipe the first one.

Deletions are scoped to the watched folder

When watch or search --sync removes stale entries, it only deletes files inside the current folder subtree being synced.

That means if project-a and project-b both live in store shared:

syncing project-a can remove stale entries from project-a
syncing project-a does not remove project-b

This is what makes additive multi-root stores possible.

Search is store-scoped and path-scoped

Search always works in two layers:

it selects a single store
it filters results to the current directory or the path argument you pass

So if you are inside project-a and search against a shared store, mgrep still scopes results to your current path by default.

Examples:

cd /path/project-a
mgrep --store shared "auth middleware"

searches store shared, but only for content under /path/project-a/....

And:

mgrep --store shared "auth middleware" /path/project-b

searches the same shared store, but only under /path/project-b/....

Recommended usage patterns

One store per project is the easiest model to reason about:

cd ~/code/project-a
mgrep --store project-a watch

cd ~/code/project-b
mgrep --store project-b watch

Then search with the matching store name:

mgrep --store project-a "query"
mgrep --store project-b "query"

Shared store across multiple roots is also supported if you want it intentionally:

cd ~/notes
mgrep --store personal watch

cd ~/specs
mgrep --store personal watch

This combines both roots into one logical store named personal.

Practical rule of thumb

If you are unsure, use this rule:

for one project: default mgrep is fine
for multiple unrelated projects: give each project its own --store
if you want one combined multi-root index: reuse the same --store deliberately

`mgrep search`

mgrep search is the default command. It searches the current directory unless you pass a path.

Arguments:

<pattern>: natural-language query
[path]: optional search root or scoped path

Options:

-m, --max-count <max_count>: maximum number of results, default 10
-c, --content: include matched chunk content in output
-a, --answer: synthesize an answer from retrieved local results
-s, --sync: sync files before searching
-d, --dry-run: preview sync work without uploading or deleting
--no-rerank: disable reranking
--max-file-size <bytes>: override upload size limit for sync
--max-file-count <count>: override sync file-count limit
--agentic: enable multi-query planning before retrieval

Examples:

mgrep "Where is the auth middleware configured?"
mgrep "How are chunks defined?" src/lib
mgrep -m 5 "maximum concurrent workers"
mgrep -c "How does caching work?"
mgrep -a "How is rate limiting implemented?"
mgrep --agentic -a "How does authentication work and where is it configured?"
mgrep --sync "Where is the API server started?"
mgrep --sync --dry-run "search query"

`mgrep watch`

mgrep watch performs an initial sync, then keeps the current project directory in sync via file watching.

Options:

-d, --dry-run: preview what would be uploaded or deleted
--max-file-size <bytes>: override upload size limit
--max-file-count <count>: override sync file-count limit

Examples:

mgrep watch
mgrep watch --dry-run
mgrep watch --max-file-size 1048576
mgrep watch --max-file-count 5000

`mgrep rules`

mgrep rules shows the effective allow/block indexing logic for the current directory after merging defaults, global config, and local config.

Arguments:

[path]: optional directory or file path to inspect

Options:

--json: emit the effective rules as JSON

Examples:

mgrep rules
mgrep rules --json
mgrep rules src

Use this when you want to confirm:

the effective allowlists for extensions, exact names, and dotfiles
the effective ignorePatterns and blockedPaths
which local and global config files are currently being applied

`mgrep validate`

Validates the DeepInfra configuration by exercising embeddings, rerank, and chat completions.

mgrep validate

Agent Integration Commands

mgrep includes helper installers for several agent environments:

mgrep install-claude-code
mgrep uninstall-claude-code
mgrep install-codex
mgrep uninstall-codex
mgrep install-opencode
mgrep uninstall-opencode
mgrep install-droid
mgrep uninstall-droid

These integrations are focused on local search plus background indexing. After installation, mgrep warns that background sync will run automatically for supported agent flows.

`mgrep mcp`

Starts the internal MCP server process used by some integrations.

This command is not needed for normal CLI use.

Configuration

Configuration sources, highest precedence first:

CLI flags
Environment variables
Local config file: .mgreprc.yaml or .mgreprc.yml
Global config file: ~/.config/mgrep/config.yaml or ~/.config/mgrep/config.yml
Built-in defaults

Config Locations

Project-local: .mgreprc.yaml or .mgreprc.yml in the directory you are indexing/searching from
Global: ~/.config/mgrep/config.yaml or ~/.config/mgrep/config.yml

Use the project-local file when you want rules that only apply to one repo or workspace. Use the global file when you want defaults applied across projects.

Config File

Example override:

This example narrows indexing further than the built-in defaults. For the full current default allowlist and a full ready-to-paste .mgreprc.yaml, see guides/README.md.

maxFileSize: 5242880
maxFileCount: 5000
syncConcurrency: 10
blockedPaths:
  - private
  - ~/scratch/generated-docs
ignorePatterns:
  - "*.csv"
  - "*.jsonl"

blockedPaths is for path-prefix exclusions that should always be skipped. Entries may be:

relative paths, resolved relative to the config file that defines them
absolute paths
~/... paths expanded against the current home directory

Use ignorePatterns for glob-style filename/path filtering after allowlist admission. Use blockedPaths when you want to exclude a directory subtree or specific path prefix regardless of file naming.

Defaults:

maxFileSize: 4194304 bytes
maxFileCount: 10000
syncConcurrency: 20
lancedbPath: ~/.mgrep/lancedb
embedModel: Qwen/Qwen3-Embedding-4B
embedDimensions: 2560
rerankModel: Qwen/Qwen3-Reranker-4B
llmModel: MiniMaxAI/MiniMax-M2.5
blockedPaths: empty by default

Environment Variables

Provider key:

DEEPINFRA_API_KEY

Store:

MGREP_STORE
MGREP_LANCEDB_PATH

Search behavior:

MGREP_MAX_COUNT
MGREP_CONTENT
MGREP_ANSWER
MGREP_AGENTIC
MGREP_AGENT
MGREP_SYNC
MGREP_DRY_RUN
MGREP_RERANK

Sync behavior:

MGREP_MAX_FILE_SIZE
MGREP_MAX_FILE_COUNT
MGREP_SYNC_CONCURRENCY

Model overrides:

MGREP_EMBED_MODEL
MGREP_EMBED_DIMENSIONS
MGREP_RERANK_MODEL
MGREP_LLM_MODEL

Search Behavior and Limits

mgrep is text-first and allowlist-first.
A file must match an allowed extension, exact basename, or exact hidden basename before it is eligible for indexing.
Configured blockedPaths are always excluded from indexing.
Hidden directories remain excluded by default, even if a file inside them would otherwise match the allowlist.
Default indexed content spans config and structured text, developer artifacts, docs and notes, exact filenames, exact hidden basenames, markup and templates, Python and related files, queries and infra, shell and automation, and web or mixed-language repo text.
Built-in deny patterns include *.bin, *.lock, *.pt, *.pyc, *.safetensors, and *.sqlite.
.gitignore, .mgrepignore, and configured ignorePatterns still apply after allowlist admission.
Non-text and binary files are skipped even if they match the allowlist.
For the exhaustive default inventory, see guides/README.md.
watch and search --sync refuse to operate on the home directory or parent directories of it.
Sync is bounded by maxFileSize and maxFileCount.

Output

Search results are printed as:

./path/to/file:line-start-line-end (score% match)

With --content, chunk text is included below each result.

With --answer, mgrep prints the synthesized answer and the cited local source chunks it used.

Architecture

Local storage: LanceDB under ~/.mgrep/lancedb/
Retrieval: vector similarity + full-text search
Fusion: reciprocal-rank fusion
Reranking: DeepInfra
Answer synthesis and agentic planning: DeepInfra chat completions

Development

pnpm install
pnpm build
pnpm test
pnpm format
pnpm typecheck

The built CLI entrypoint is dist/index.js.

Troubleshooting

Missing API keys: run mgrep validate
Sync blocked at home directory: run from a specific project subdirectory
Inspect current indexing rules: run mgrep rules or mgrep rules --json
Why a file may not be indexed: it may be outside the allowlist, inside a hidden directory, excluded by blockedPaths, .gitignore, .mgrepignore, or ignorePatterns, or rejected as binary or non-text
Store incompatibility after changing embedding settings: delete the affected store under ~/.mgrep/lancedb/<store-name>/ and re-index
Slow initial indexing: lower syncConcurrency if you are rate-limited, or tune file limits for very large repos

License

Apache-2.0. See LICENSE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Why mgrep?

Quick Start

What It Does

Search Strategy

Commands

Understanding Stores

The default store name

One command searches exactly one store

What watch indexes

Stores are additive across multiple folders

Deletions are scoped to the watched folder

Search is store-scoped and path-scoped

Recommended usage patterns

Practical rule of thumb

mgrep search

mgrep watch

mgrep rules

mgrep validate

Agent Integration Commands

mgrep mcp

Configuration

Config Locations

Config File

Environment Variables

Search Behavior and Limits

Output

Architecture

Development

Troubleshooting

License

What `watch` indexes

`mgrep search`

`mgrep watch`

`mgrep rules`

`mgrep validate`

`mgrep mcp`