codesift-mcp

v0.8.2

Published

5 days ago

MCP server for code intelligence — 146 tools for symbol search, call graph, semantic search, route tracing, community detection, LSP bridge, secret detection, conversation search, and Hono framework intelligence

0High
0Medium
0Low

greglas75

mcp code-search tree-sitter semantic-search claude anthropic

CodeSift -- Token-efficient code intelligence for AI agents

CodeSift indexes your codebase with tree-sitter AST parsing and gives AI agents 150 MCP tools (55 core + 95 discoverable) via CLI or MCP server. It uses 61-95% fewer tokens than raw grep/Read workflows on typical code navigation tasks.

Works with: Claude Code, Cursor, Codex, Gemini CLI, Zed, Aider, Continue — any MCP client.

Install

Bulletproof one-liner (clears stale cache, installs latest, auto-configures all platforms):

npm cache clean --force && npm i -g codesift-mcp@latest

npm cache clean --force clears stale registry metadata that can cause ETARGET errors. The postinstall script then runs codesift setup all automatically.

Restart your AI client (close + reopen) so the new MCP server is picked up. New terminal sessions in your IDE work fine — no need to quit the IDE itself.

To configure individual platforms manually:

codesift setup claude    # Claude Code — config + rules + hooks + CLAUDE.md
codesift setup codex     # Codex CLI — config + AGENTS.md rules
codesift setup cursor    # Cursor IDE — config + .cursor/rules
codesift setup gemini    # Gemini CLI — config + GEMINI.md rules
codesift setup antigravity # Google Antigravity — config only
codesift setup all       # All platforms at once

Verify installed version:

codesift --version

What setup installs (all by default):

| Component | What it does | Opt-out | |-----------|-------------|---------| | MCP config | Registers codesift-mcp server | (required) | | Rules file | Tool mapping, hints, ALWAYS/NEVER rules for your AI agent | --no-rules | | Hooks (where supported) | Auto-index after Edit/Write, redirect large Read/Bash flows to CodeSift | --no-hooks |

Additionally, every MCP client receives ~800 tokens of compact guidance automatically via the MCP instructions field — zero setup needed.

Update

npm update -g codesift-mcp
codesift setup all              # Updates rules files to latest version
codesift setup all --force      # Force-update even if you modified rules

If you use npx -y codesift-mcp (the default), each platform automatically picks up the latest published version on next session start. Re-run setup to update rules files to the latest version.

Quick start

# Index a project
codesift index /path/to/project

# Search for a function
codesift symbols local/my-project "createUser" --kind function --include-source

# Semantic search (requires embedding provider)
codesift retrieve local/my-project \
  --queries '[{"type":"semantic","query":"how does caching work?"}]'

Benchmark results

Combo benchmark (real-world tool sequences)

772 real tasks from usage.jsonl — exact query sequences agents used across 33+ repos. Native (grep/find/read) vs CodeSift.

| Sequence | Runs | Tok native | Tok Sift | Delta | Wins | |----------|------|-----------|----------|-------|------| | pat→st→pat→st (4-gram) | 37 | 377,258 | 36,758 | -90% | 28/37 | | pat→st→pat | 39 | 186,436 | 20,500 | -89% | 31/39 | | st→pat→st→pat | 35 | 307,490 | 35,905 | -88% | 25/35 | | ss→st | 78 | 202,837 | 36,408 | -82% | 35/78 | | st→pat→st | 40 | 250,240 | 44,424 | -82% | 27/40 | | st→tree→st | 28 | 262,703 | 61,093 | -77% | 22/28 | | tree→st | 57 | 380,324 | 133,578 | -65% | 44/57 | | AGGREGATE | 772 | 5,130,240 | 1,994,825 | -61% | 542/772 |

Per-tool (single-tool benchmark)

| Tool | Tok native | Tok Sift | Delta | |------|-----------|----------|-------| | search_text vs rg | 1,015,245 | 49,718 | -95% | | search_symbols vs rg | 192,486 | 34,186 | -82% | | get_file_outline vs Read | 91,796 | 58,229 | -37% |

Performance features

| Feature | Description | Impact | |---------|-------------|--------| | mtime-based incremental indexing | Skip files with unchanged mtime on reindex | 5.6x faster reindex (57s → 10s on 778-file repo) | | index_file | Re-index a single file without full repo walk | 9ms (unchanged) / 153ms (changed) vs 3-8s full folder | | detail_level on search_symbols | compact (~15 tok/result), standard, full | compact is 63% fewer tokens than standard | | token_budget on search_symbols | Pack results to token limit instead of guessing top_k | Precise budget control | | Centrality bonus in BM25 | Symbols in frequently-imported files rank higher | Core utilities surface first in search | | Response dedup cache | Identical calls within 30s return cached result | Eliminates duplicate API calls | | In-flight dedup | Parallel identical requests coalesce into one | Prevents race condition duplicates | | Auto-grouping | Force group_by_file when output exceeds 80K chars | Prevents 100K+ token responses | | Relevance-gap filtering | Cut search results below 15% of top score | 50→21 results (cleaner output) | | Semantic chunking | Chunk by symbol boundaries, not fixed lines | Functions stay intact for semantic search | | Token savings display | "Saved ~X tokens ($Y)" on every response | Visible ROI per call | | Framework-aware dead code | Whitelist React hooks, NestJS lifecycle, Next.js handlers | <10% false positives (was ~40%) | | Mermaid diagrams | detect_communities, get_knowledge_map, trace_route output Mermaid | Paste-ready architecture diagrams | | HTML report | generate_report → standalone browser report | Complexity, dead code, hotspots, communities | | Progressive cascade | >15K tok → compact format, >25K → counts only, >30K → truncate | Auto-adjusting response size | | Tool visibility | Non-core tools hidden via MCP disable(), discoverable on demand | ~10K fewer tokens in system prompt | | MCP instructions | ~800 tok of agent guidance sent automatically to every client | Zero-setup onboarding | | Ranked search | search_text(ranked=true) classifies hits by containing symbol, deduplicates | Saves 1-3 follow-up calls | | PreToolUse hooks | Redirect large-file Read to CodeSift outline/search | Prevents 5K+ token file dumps | | PostToolUse hooks | Auto-reindex after Edit/Write | Always-fresh index | | Sequential hints | Prepended hints (H1-H9) suggest batching after 3+ consecutive calls | Guides agents toward efficient usage | | Wiki generation | generate_wiki produces markdown wiki from code topology | Architecture docs from Louvain communities + hubs + surprises | | Lens HTML | Self-contained HTML dashboard with D3 chord diagram | Visual architecture overview in one file | | Wiki hook inject | PreToolUse injects community context on file Read | Agent gets architectural context automatically |

CLI commands

Indexing

| Command | Description | |---------|-------------| | codesift index <path> | Index a local folder (mtime-based incremental — skips unchanged files) | | codesift index-repo <url> | Clone and index a remote git repository | | codesift repos | List all indexed repositories | | codesift invalidate <repo> | Clear index cache for a repository |

Search

| Command | Description | |---------|-------------| | codesift search <repo> <query> | Full-text search across all files | | codesift symbols <repo> <query> | Search symbols by name/signature (supports --detail compact\|standard\|full and --token-budget N) |

Outline

| Command | Description | |---------|-------------| | codesift tree <repo> | File tree with symbol counts | | codesift outline <repo> <file> | Symbol outline of a single file | | codesift repo-outline <repo> | High-level repository outline |

Symbol retrieval

| Command | Description | |---------|-------------| | codesift symbol <repo> <id> | Get a single symbol by ID | | codesift symbols-batch <repo> <ids...> | Get multiple symbols by ID | | codesift find <repo> <query> | Find symbol and show source | | codesift refs <repo> <name> | Find all references to a symbol | | codesift context-bundle <repo> <name> | Symbol + imports + siblings + types used in one call |

Graph & analysis

| Command | Description | |---------|-------------| | codesift trace <repo> <name> | Trace call chain (callers/callees). Supports --format mermaid for flowchart output. | | codesift impact <repo> --since <ref> | Blast radius of git changes + affected tests + risk scores per file | | codesift context <repo> <query> | Assemble relevant code context. Supports --level L0\|L1\|L2\|L3 for compression. | | codesift knowledge-map <repo> | Module dependency map with circular dependency detection | | codesift trace-route <repo> <path> | Trace HTTP route → handler → service → DB calls (NestJS/Next.js/Express/Ktor/Spring Boot Kotlin) | | codesift communities <repo> | Louvain community detection — discover code clusters from import graph |

Code analysis

| Command | Description | |---------|-------------| | codesift dead-code <repo> | Find exported symbols with zero external references | | codesift complexity <repo> | Cyclomatic complexity + nesting depth per function | | codesift clones <repo> | Copy-paste detection (hash bucketing + line similarity) | | codesift hotspots <repo> | Git churn x complexity = risk-ranked file list | | codesift patterns <repo> <pattern> | Structural anti-pattern search (33 built-in + custom regex) |

Wiki & Lens

| Command | Description | |---------|-------------| | codesift wiki-generate | Generate wiki pages + Lens HTML from code topology (communities, hubs, surprises, hotspots) | | codesift wiki-generate --focus src/tools | Scope wiki to a specific directory | | codesift wiki-generate --no-lens | Skip Lens HTML generation | | codesift wiki-lint <wiki-dir> | Check wiki for broken links, orphan pages, stale content |

Output goes to .codesift/wiki/ in the repo root. Includes markdown pages with [[wikilinks]], backlinks, community summaries, and a self-contained codesift-lens.html with D3 chord diagram and force-directed graph.

Cross-repo

| Command | Description | |---------|-------------| | codesift cross-search <query> | Search symbols across ALL indexed repositories | | codesift cross-refs <name> | Find references across ALL indexed repositories |

Diff

| Command | Description | |---------|-------------| | codesift diff <repo> --since <ref> | Structural diff between git refs | | codesift changed <repo> --since <ref> | List changed symbols between refs |

Batch & utility

| Command | Description | |---------|-------------| | codesift retrieve <repo> --queries <json> | Batch multiple queries in one call | | codesift stats | Show usage statistics | | codesift generate-claude-md <repo> | Generate CLAUDE.md project summary | | codesift list-patterns | List all built-in anti-pattern names |

MCP tools (146 total — 55 core + 95 discoverable)

When running as an MCP server, CodeSift exposes 51 core tools directly. The remaining 95 niche tools are discoverable via discover_tools and describe_tools, or via plan_turn which routes a natural-language task to the best-fit tools and auto-reveals any hidden ones.

| Category | Tools | |----------|-------| | Indexing | index_folder (mtime skip, dirty propagation), index_repo, index_file (single-file reindex, 9ms), list_repos, invalidate_cache | | Search | search_symbols (detail_level: compact/standard/full, token_budget, kind filter incl. component/hook), search_text (auto_group, group_by_file, ranked) | | Outline | get_file_tree, get_file_outline, get_repo_outline, suggest_queries (React-aware: suggests component/hook queries when detected) | | Symbol retrieval | get_symbol, get_symbols, find_and_show, get_context_bundle (React enrichment: hooks_used, child_components, parent_components, wrapper pattern) | | References & graph | find_references (LSP-enhanced), trace_call_chain (JSX-aware: <Component> = call edge; filter_react_hooks option), impact_analysis, trace_route (HTTP route → handler → DB — NestJS/Next.js/Express/Ktor/Spring Boot/Yii2/Laravel) | | React | trace_component_tree (BFS JSX composition tree with Mermaid output), analyze_hooks (hook inventory, Rule of Hooks violations, custom hook composition), analyze_renders (re-render risk: inline props, missing memo, children-aware threshold, markdown output), analyze_context_graph (createContext → Provider → useContext consumer mapping) | | LSP bridge | go_to_definition (LSP + index fallback), get_type_info (hover), rename_symbol (cross-file type-safe rename) | | Context & knowledge | assemble_context (level: L0/L1/L2/L3), get_knowledge_map, detect_communities (Louvain) | | Conversation search | index_conversations, search_conversations, find_conversations_for_symbol | | Diff | diff_outline, changed_symbols | | Batch retrieval | codebase_retrieval (batch multiple sub-queries with shared token budget, incl. type: "conversation") | | Security | scan_secrets (AST-aware secret detection, ~1,100 rules, masked output) | | PHP / Yii2 | resolve_php_namespace (PSR-4 FQCN→file), trace_php_event (event→listener chain), find_php_views (render→view mapping), resolve_php_service (Yii::$app→concrete class), php_security_scan (compound: SQL injection, XSS, eval, exec, unserialize), php_project_audit (meta-tool — includes ActiveRecord analysis, N+1 detection, god-model detection via checks= parameter) | | Analysis | find_dead_code (framework-aware incl. React/Next.js route entry points), analyze_complexity (React: hook_count, state_count, effect_count, jsx_depth), find_clones, analyze_hotspots, search_patterns (33 built-in: JS/TS ×9, React ×20, Kotlin ×6, PHP ×4), list_patterns, frequency_analysis (AST subtree clustering), find_perf_hotspots (6 perf anti-patterns: unbounded queries, sync I/O, N+1 loops, unbounded parallel, missing pagination, expensive recompute), explain_query (Prisma→SQL with EXPLAIN ANALYZE), audit_scan (5-gate composite: dead code + clones + patterns + complexity + hotspots) | | Architecture | classify_roles (symbol role classification via call graph), check_boundaries (architecture boundary enforcement), ast_query (structural grep via tree-sitter), fan_in_fan_out (import graph coupling: most-imported, most-dependent, hub files, coupling score 0-100), co_change_analysis (temporal coupling from git history: Jaccard similarity, cluster detection), architecture_summary (one-call composite: stack + communities + coupling + circular deps + LOC + entry points, Mermaid output) | | Cross-repo | cross_repo_search, cross_repo_refs | | Report | generate_report (standalone HTML with complexity, dead code, hotspots, communities), generate_wiki (markdown wiki pages + Lens HTML from code topology — communities, hubs, surprises, hotspots, framework pages) | | Tool discovery | discover_tools (keyword search across hidden tools), describe_tools (full schema on demand, optional reveal) | | Discovery | plan_turn(query=...) — route natural-language task description to best-fit tools, symbols, and files; returns ranked recommendations with confidence scores, reveal_required hints, and gap analysis | | Meta | index_status (check if repo is indexed: file/symbol counts, language breakdown, text_stub languages), analyze_project (stack + conventions detection), get_extractor_versions (parser language support) | | Utility | generate_claude_md (architecture + behavioral guidance), usage_stats (with token savings tracking) |

Conversation search

Search past Claude Code conversation history — the decisions, rationale, and debugging sessions that shaped your code.

# Index conversations for current project (auto-detected from cwd)
# Also runs automatically at startup via auto-discovery
index_conversations()

# Index a specific project's conversations
index_conversations(project_path="/Users/me/.claude/projects/-Users-me-DEV-my-project")

# Search past conversations
search_conversations(query="auth middleware bug", limit=5)

# Find conversations that discussed a specific code symbol
find_conversations_for_symbol(symbol_name="processPayment", repo="local/my-project")

# In codebase_retrieval batch queries
codebase_retrieval(repo, queries=[
  {"type": "semantic", "query": "how does auth work"},
  {"type": "conversation", "query": "why we chose Redis over Postgres cache"}
])

Features:

Auto-discovery at startup (zero config)
Session-end hook for immediate re-indexing
Noise filtering: tool_result dumps stripped, tool_use truncated, images → [image]
Compaction-aware: skips summary injections, indexes last summary as meta-doc
Cross-reference: link code symbols to the conversations that discussed them

Secret scanning

Detect hardcoded secrets (API keys, JWT tokens, passwords, connection strings) in your indexed codebase. Uses ~1,100 detection rules from TruffleHog via @sanity-labs/secret-scan, with CodeSift's tree-sitter AST for false-positive reduction.

# Scan entire repo for secrets
scan_secrets(repo="local/my-project")

# Filter by severity
scan_secrets(repo="local/my-project", severity="critical")

# Only high-confidence findings, including test files
scan_secrets(repo="local/my-project", min_confidence="high", exclude_tests=false)

# Scope to specific directory
scan_secrets(repo="local/my-project", file_pattern="src/config/**")

Features:

Eager scanning on file change — results are cached and instant on query
AST-aware confidence: test files, docs, placeholder variables auto-demoted to low
Masked output — secrets shown as sk-p***hijk, raw values never in cache or logs
Inline allowlist — add // codesift:allow-secret to suppress a finding
Config files indexed — .env, .yaml, .toml, .json, .ini, .properties scanned
Severity mapping: cloud keys (AWS, GCP) = critical, API keys (OpenAI, GitHub) = high
Inline warnings in index_file responses when secrets detected

Wiki & Lens — auto-generated architecture documentation

Generate browsable wiki pages and an interactive HTML dashboard from your codebase's topology — zero manual writing.

# Generate wiki for current repo
codesift wiki-generate

# Scope to a directory
codesift wiki-generate --focus src/tools

# Check wiki integrity
codesift wiki-lint .codesift/wiki

What it generates (in .codesift/wiki/):

Community pages — one per Louvain community (module), with members, cohesion score, cross-boundary edges
Hubs page — top symbols by fan-in (load-bearing code)
Surprises page — unexpected cross-community connections (structural + temporal coupling)
Hotspots page — files ranked by git churn × complexity
Framework pages — conditional pages for Next.js routes, Hono middleware, Astro islands (when detected)
Index page — links to all pages with [[wikilinks]] and auto-generated backlinks
Summaries — compact *.summary.md files (~400 tokens) for AI agent context injection

Lens HTML dashboard (codesift-lens.html):

Self-contained single HTML file — open in any browser, no server needed
D3 chord diagram showing cross-community connections
D3 force-directed graph with community nodes
5 tabs: Overview, Communities, Hubs, Surprises, Wiki browser
Dark/light theme, responsive

AI agent integration:

Hook inject via handlePrecheckRead — when an agent reads a file, it automatically receives the file's community wiki summary as context
Configurable token budget (2000 chars default)
Staleness detection — warns when wiki is outdated vs current index

MCP tool:

generate_wiki(repo, focus?, output_dir?, include_lens?)

PHP / Yii2 support

Full PHP code intelligence with first-on-market Yii2 framework awareness. No other general-purpose MCP tool provides static Yii2 intelligence.

Symbol extraction (tree-sitter-based):

Namespaces, classes, interfaces, traits, enums (PHP 8.1), functions, methods, properties, constants
PHPDoc extraction, signature extraction with type hints and return types
PHPUnit test detection: TestCase subclass = test_suite, test* methods = test_case, setUp/tearDown = test_hook

Yii2 framework awareness:

Convention routing: trace_route("site/index") resolves to SiteController::actionIndex() (incl. module nesting)
analyze_project detects Yii2 via composer.json and extracts: controllers, models, modules, widgets, behaviors, components, assets, config files
6 PHP-specific tools: namespace resolution (PSR-4), event/listener tracing, view mapping, service locator resolution, security scanning, project audit (meta-tool with ActiveRecord analysis, N+1 detection, god-model detection via checks= parameter)
Auto-load: PHP tools are automatically enabled when composer.json is detected at CWD — no need to call discover_tools/describe_tools first

Laravel support:

Route tracing via Route::get('/path', [Controller::class, 'method']) pattern matching
Convention extraction: controllers, middleware, models, routes, migrations

# Trace a Yii2 route
trace_route(repo, path="site/about")

# Analyze ActiveRecord models (via php_project_audit)
php_project_audit(repo, checks=["activerecord"], model_name="User")

# PHP security scan (8 parallel checks)
php_security_scan(repo)

# Resolve PSR-4 namespace to file
resolve_php_namespace(repo, class_name="App\\Models\\User")

LSP bridge: Intelephense configured for go-to-definition, find-references, type-info, and rename across PHP files.

Next.js intelligence

Deep Next.js static analysis — 3 core tools covering routing, rendering, security, SEO, and architecture:

framework_audit — one-call meta-audit: runs route map + metadata + server actions + boundary + data flow + middleware + component classification checks. Returns composite score with prioritized findings. Use checks= parameter to run individual checks (e.g., checks=["server-actions"], checks=["boundary"], checks=["link-integrity"], checks=["data-flow"], checks=["middleware"], checks=["components"], checks=["api-contract"])
nextjs_route_map — maps all App Router and Pages Router routes with rendering strategy (SSG/SSR/ISR/PPR), dynamic params, route groups, parallel routes, and intercepting routes
nextjs_metadata_audit — detects missing/incomplete metadata exports, OpenGraph gaps, missing robots/sitemap, and SEO anti-patterns across all routes

Auto-load: Next.js tools are automatically enabled when next is detected in package.json — no manual discovery needed.

When to use CodeSift vs grep

| Task | Best tool | Why | |------|-----------|-----| | Find text in files | codesift search | 33% fewer tokens, BM25 ranking | | Find function by name | codesift symbols | Returns signature + body in 1 call | | File structure | codesift tree | 20% fewer tokens, symbol counts | | "How does X work?" | codesift retrieve (semantic) | 20% better quality on concept queries | | Call chain tracing | codesift trace | AST-based caller/callee graph, Mermaid output | | Dead code / unused exports | codesift dead-code | Automated scan, no manual grep needed | | Complexity hotspots | codesift complexity | Cyclomatic complexity + nesting depth | | Copy-paste detection | codesift clones | Hash bucketing + line similarity scoring | | Anti-pattern search | codesift patterns | 9 built-in CQ patterns + custom regex | | Explore new codebase | codesift suggest-queries | Instant overview: top files, kind distribution, example queries | | Re-index after edit | index_file | 9ms skip / 153ms reparse vs 3-8s full folder | | Trace HTTP route | trace_route | URL → handler → service → DB calls in one call | | Discover code modules | detect_communities | Louvain clustering finds architectural boundaries | | Dense context (5-10x) | assemble_context --level L1 | Signatures only — fits 56 symbols where L0 fits 19 | | Go to definition | go_to_definition | LSP-precise when available, index fallback | | Get type info | get_type_info | Return types + docs via LSP hover — no file reading | | Rename across files | rename_symbol | LSP type-safe rename in all files at once | | Detect hardcoded secrets | scan_secrets | ~1,100 rules, AST-aware, masked output, auto-cached | | Ranked text search | search_text(ranked=true) | Classifies hits by function, saves follow-up get_symbol calls | | Find hidden tools | discover_tools + describe_tools | 95 tools hidden by default — search by keyword, get full schema | | Route task → tools | plan_turn(query="...") | Natural-language router: ranked tool/symbol/file recommendations with auto-reveal | | Architecture wiki | codesift wiki-generate | Auto-generated markdown wiki from Louvain communities, hubs, surprises | | Visual architecture | Open codesift-lens.html | D3 chord diagram + force graph in one self-contained HTML file | | Find ALL occurrences | grep -rn | Exhaustive, no top_k cap | | Count matches | grep -c | Simple exact count |

Built-in anti-patterns (33 total)

The patterns command searches for common code quality issues across your codebase:

| Pattern | What it finds | |---------|---------------| | empty-catch | catch (e) {} — swallowed errors | | any-type | : any or as any — lost type safety | | console-log | console.log/debug/info in production code | | await-in-loop | Sequential await inside for loops | | no-error-type | Catch without instanceof Error narrowing | | toctou | Read-then-write without atomic operation | | unbounded-findmany | Prisma findMany without take limit | | scaffolding | TODO/FIXME/HACK markers, Phase/Step stubs, "not implemented" throws | | runblocking-in-coroutine | Kotlin: runBlocking inside suspend function — deadlock risk | | globalscope-launch | Kotlin: GlobalScope.launch/async — lifecycle leak | | data-class-mutable | Kotlin: data class with var property — breaks hashCode contract | | lateinit-no-check | Kotlin: lateinit var without isInitialized check | | empty-when-branch | Kotlin: empty when branch — swallowed case | | mutable-shared-state | Kotlin: mutable var inside object/companion — thread-unsafe | | React (14 + 6 below) | | | useEffect-no-cleanup | useEffect without cleanup return — memory leak | | hook-in-condition | Hook inside if/for/while/switch — Rule of Hooks violation | | useEffect-async | async function directly in useEffect | | useEffect-object-dep | Object/array literal in dep array — infinite re-render | | missing-display-name | React.memo/forwardRef without displayName | | index-as-key | Array index used as React key — incorrect reconciliation | | inline-handler | Arrow function in JSX event handler — memoization killer | | conditional-render-hook | Hook called after early return — Rule of Hooks violation | | dangerously-set-html | dangerouslySetInnerHTML — XSS risk | | direct-dom-access | document.getElementById/querySelector — use useRef | | unstable-default-value | = []/= {} default in params — new ref every render | | jsx-falsy-and | {count && <Comp/>} renders "0" when count is 0 | | nested-component-def | Component inside component — remounts every render | | usecallback-no-deps | useCallback/useMemo without dep array — useless memoization | | React 19 (4) | | | react19-use-without-suspense | use(promise) call — verify Suspense boundary | | react19-server-action-not-async | Non-async function in "use server" file | | react19-form-action-non-function | <form action="url"> instead of action={fn} | | react19-useoptimistic-no-transition | useOptimistic without useTransition pair | | RSC (2) | | | rsc-non-serializable-prop | Function passed as prop across RSC boundary | | rsc-date-prop | Date object in JSX prop — loses prototype across boundary | | PHP (7) | | | sql-injection-php | User input flowing into SQL query | | xss-php | Unescaped user input echoed to output | | eval-php / exec-php | eval/shell execution — injection risk | | unserialize-php | unserialize() on user input | | unescaped-yii-view | Yii2 view without Html::encode() | | raw-query-yii | Yii2 createCommand with string interpolation |

Custom regex is also supported: codesift patterns local/project "Promise<.*any>".

Performance anti-patterns (`find_perf_hotspots`)

A separate tool scans for performance-specific issues with balanced-brace loop body extraction (not just regex):

| Pattern | What it finds | Severity | |---------|---------------|----------| | unbounded-query | findMany/find without take/limit | high | | sync-in-handler | readFileSync/execSync in route/handler/controller files | high | | n-plus-one | DB/fetch call inside for/while loop body | high | | unbounded-parallel | Promise.all(arr.map(...)) without concurrency control | medium | | missing-pagination | API response from unbounded list query | medium | | expensive-recompute | Same method called 2+ times in loop body (excludes common methods) | low |

# Scan all patterns
find_perf_hotspots(repo)

# Only N+1 and unbounded queries
find_perf_hotspots(repo, patterns="n-plus-one,unbounded-query")

# Scope to API directory
find_perf_hotspots(repo, file_pattern="src/api")

MCP server setup

CodeSift runs as an MCP server, exposing 146 tools to AI agents (55 core + 95 discoverable). The fastest setup method is codesift setup <platform> which handles everything automatically. Manual configuration is also supported:

OpenAI Codex

Add this to ~/.codex/config.toml:

[mcp_servers.codesift]
command = "npx"
args = ["-y", "codesift-mcp"]
tool_timeout_sec = 120

You can also add it manually or via the Codex CLI:

codex mcp add codesift -- npx -y codesift-mcp

Claude Code

Add this to ~/.claude/settings.json:

{
  "mcpServers": {
    "codesift": {
      "command": "npx",
      "args": ["-y", "codesift-mcp"]
    }
  }
}

With semantic search (OpenAI embeddings), add the env var manually:

{
  "mcpServers": {
    "codesift": {
      "command": "/bin/sh",
      "args": ["-c", "CODESIFT_OPENAI_API_KEY='sk-...' exec codesift-mcp"]
    }
  }
}

Claude Desktop

Add to ~/Library/Application Support/Claude/claude_desktop_config.json (macOS) or %APPDATA%\Claude\claude_desktop_config.json (Windows):

{
  "mcpServers": {
    "codesift": {
      "command": "node",
      "args": ["/path/to/codesift-mcp/dist/server.js"]
    }
  }
}

Cursor

Add this to ~/.cursor/mcp.json, or to .cursor/mcp.json in your project:

{
  "mcpServers": {
    "codesift": {
      "command": "npx",
      "args": ["-y", "codesift-mcp"]
    }
  }
}

Gemini CLI

Add this to ~/.gemini/settings.json, or to .gemini/settings.json in your project:

{
  "mcpServers": {
    "codesift": {
      "command": "npx",
      "args": ["-y", "codesift-mcp"]
    }
  }
}

You can also use the Gemini CLI:

gemini mcp add codesift -s user npx -- -y codesift-mcp

Google Antigravity

Add this to ~/.gemini/antigravity/mcp_config.json:

{
  "mcpServers": {
    "codesift": {
      "command": "npx",
      "args": ["-y", "codesift-mcp"]
    }
  }
}

All platforms at once

codesift setup all

This configures Codex, Claude Code, Cursor, Gemini CLI, and Antigravity in one command. Safe to run multiple times — skips platforms that are already configured.

Semantic search

Semantic search uses embeddings to answer concept queries like "how does authentication work?" that keyword search misses.

Setup

Zero config — semantic search works out of the box. CodeSift defaults to local on-device embeddings (nomic-ai/nomic-embed-text-v1.5 via @huggingface/transformers v3, INT8 ONNX, ~140MB downloaded on first use, cached after). No API key, no internet after first run, no data leaves your machine. The provider applies the model's task-aware prefixes (search_document: / search_query:) automatically, so retrieval quality matches remote providers.

To opt into a remote provider for higher quality, set one of these:

| Variable | Provider | Model | Cost | |----------|----------|-------|------| | (default) | Local (ONNX) | nomic-ai/nomic-embed-text-v1.5 | Free, runs on CPU | | CODESIFT_VOYAGE_API_KEY | Voyage AI | voyage-code-3 | Best for code | | CODESIFT_OPENAI_API_KEY | OpenAI | text-embedding-3-small | ~$0.02/1M tok (~$0.21 for 44 repos) | | CODESIFT_OLLAMA_URL | Ollama (local) | nomic-embed-text | Free (local) |

To disable local embeddings entirely (BM25-only), set CODESIFT_DISABLE_LOCAL_EMBEDDINGS=true. To pin a different local model, set CODESIFT_LOCAL_MODEL=<owner>/<model> (e.g. Xenova/bge-small-en-v1.5).

Usage

# Pure semantic search
codesift retrieve local/my-project \
  --queries '[{"type":"semantic","query":"error handling and retry logic","top_k":10}]'

# Hybrid search (semantic + BM25 text, RRF-merged)
codesift retrieve local/my-project \
  --queries '[{"type":"hybrid","query":"caching strategy","top_k":10}]'

Semantic and hybrid queries exclude test files by default to maximize token efficiency. To include test files, set "exclude_tests": false in the sub-query or pass --exclude-tests=false on the CLI.

Configuration

All configuration is via environment variables.

| Variable | Description | Default | |----------|-------------|---------| | CODESIFT_DATA_DIR | Storage directory for indexes | ~/.codesift | | CODESIFT_WATCH_DEBOUNCE_MS | File watcher debounce interval | 500 | | CODESIFT_DEFAULT_TOKEN_BUDGET | Default token budget for retrieval | 8000 | | CODESIFT_DEFAULT_TOP_K | Default max results for search | 50 | | CODESIFT_EMBEDDING_BATCH_SIZE | Symbols per embedding API call | 128 | | CODESIFT_SECRET_SCAN | Enable/disable secret scanning | true (set false to disable) |

How it works

Indexing -- Tree-sitter WASM grammars parse source files into ASTs. Symbol extraction produces functions, classes, methods, types, constants, etc. with signatures, docstrings, and source code. Filesystem mtime is stored per file for incremental skip on reindex.
BM25F search -- Symbols are tokenized (camelCase/snake_case splitting) and indexed with field-weighted BM25 scoring. Name matches rank 5x higher than body matches. Symbols in frequently-imported files get a log-scaled centrality bonus as tiebreaker.
Semantic search (optional) -- Source code is chunked and embedded via the configured provider. Queries are embedded at search time and ranked by cosine similarity. Multi-sub-query decomposition with Reciprocal Rank Fusion (RRF, k=60).
Hybrid search -- Combines semantic embedding similarity with BM25 text matches via RRF, getting the best of both keyword and concept search.
File watcher -- chokidar watches indexed folders for changes. Modified files are re-parsed and the index is updated incrementally.
Response guards -- Multiple layers prevent token waste: progressive cascade (>15K tok → compact, >25K → counts, >30K → truncate), response dedup cache (30s), in-flight request coalescing, H1-H9 sequential hints, and source truncation.
Agent onboarding -- MCP instructions field sends ~800 tokens of guidance (tool discovery, hints, ALWAYS/NEVER rules) to every client automatically. codesift setup installs full rules files per platform + Claude Code hooks for enforcement.
LSP bridge (optional) -- When a language server is installed (typescript-language-server, pylsp, gopls, rust-analyzer, kotlin-language-server, solargraph, intelephense), CodeSift uses it for type-safe find_references, precise go_to_definition, get_type_info via hover, and cross-file rename_symbol. Falls back to tree-sitter/grep when LSP is unavailable. Lazy start + 5 min idle kill — zero overhead when not used.

Glob pattern support

File pattern parameters (file_pattern) support full glob syntax via picomatch:

*.ts — match by extension at any depth
*.{ts,tsx} — brace expansion
src/**/*.service.ts — directory globbing
[!.]*.ts — character classes
service — plain substring match (no glob chars)

React workflow with CodeSift

CodeSift auto-loads 6 React tools when a React project is detected (package.json with react + .tsx/.jsx files). Zero config.

Day 1 — new React codebase (1 command, ~5s)

react_quickstart

One call returns: component/hook counts, stack (state mgmt, routing, UI lib, form lib, build tool), critical pattern violations, top hooks used, and suggested next queries. Replaces 5+ manual exploration calls.

Daily development

analyze_renders("MyComponent")          # re-render risk for a specific component
trace_component_tree("App")             # JSX composition hierarchy
analyze_hooks(component_name="Foo")     # hook inventory + Rule of Hooks check
trace_call_chain("useAuth", filter_react_hooks=true)  # hook dependency graph, stdlib filtered
find_references("UserContext")          # where this context is consumed
analyze_context_graph                   # all createContext → Provider → useContext flows

PR review

review_diff                             # 10-check composite (React patterns auto-skipped on non-.tsx diffs)
changed_symbols(since="HEAD~3")         # what changed structurally
search_patterns("hook-in-condition")    # Rule of Hooks violations in changed files
impact_analysis(since="HEAD~3")         # blast radius of your changes

CI gates (via `audit_scan` REACT gate + `audit_compiler_readiness`)

audit_scan                              # includes REACT gate: hook-in-condition, useEffect-async,
                                        # dangerously-set-html, index-as-key, nested-component-def
audit_compiler_readiness                # React Compiler (v1.0) adoption score — flags bailout
                                        # patterns before migration, counts redundant memo to remove

Set CI to fail on: any dangerously-set-html, any Rule of Hooks violation, any useEffect-missing-cleanup in new code.

Common queries — "how do I..."

| Question | Command | |----------|---------| | Find all components | search_symbols(kind="component") | | Find all custom hooks | search_symbols(kind="hook") | | Why is my app re-rendering? | analyze_renders — ranks components by risk | | Is my code React Compiler ready? | audit_compiler_readiness — scans 7 bailout patterns | | Who uses AuthContext? | analyze_context_graph — lists all consumers | | Rule of Hooks violations? | search_patterns("hook-in-condition") | | Memory leaks in useEffect? | search_patterns("useEffect-missing-cleanup") | | Missing TanStack invalidation? | search_patterns("tanstack-missing-invalidation") | | Should this class be a function component? | search_patterns("prefer-function-component") | | XSS risks from dangerouslySetInnerHTML? | search_patterns("dangerously-set-html") |

Supported languages

TypeScript, JavaScript (JSX/TSX), Python, Go, Rust, Kotlin, Java, Ruby, PHP, Markdown, CSS, Prisma, Astro.

React/JSX/TSX has first-class support across 8 waves: component and hook SymbolKind values, JSX-aware call graph (all graph tools see <Component> usage as call edges), 43 React anti-patterns with engine-level comment/string preprocessing. Tier 8 (May 2026) added preprocess: "strip-comments-strings" declarative field on BUILTIN_PATTERNS entries — single-pass 7-state-machine source stripper at src/utils/source-stripper.ts strips comments, string/template/regex literals before regex match (closes the false-positive class where comment-embedded mentions spoofed detection). Tier 7 (May 2026) fixed 3 pre-existing CRITICAL bugs (useOptimistic lookahead trivial bypass, useEffect-setstate-loop array-arg false positive, react19-server-action-not-async missing arrow/default-export forms) and added cross-file findSuspenseAncestor + findLazyComponentsWithoutSuspense walkers reusing reverse JSX adjacency from Tier 5. Tier 6 (May 2026) — derived-state-reducer (useReducer sync action), derived-state-custom-setter (custom setter naming), stale-closure-toggle (setX(!X)), stale-closure-broken-functional (setX(prev => X+1) wrong reference), context-provider-value-via-variable (intermediate-var inline), context-provider-value-inline-destructured ({Provider} form), react-lazy-no-suspense-same-file (single-file heuristic), rsc-non-serializable-prop-deep (Map/Set/Class across RSC boundary), error-boundary-incomplete (partial lifecycle) + full severity migration on all 29 prior patterns; multiline hook-in-condition, bug-free nested-component-def, and Tier 5 (May 2026) patterns — derived-state (useState(props.X) + useEffect sync), stale-closure-setstate (setX(X+1) non-functional update), context-provider-value-inline (inline object/array forces consumer re-render), jsx-no-target-blank (tabnabbing security with postFilter validator), button-no-type (implicit submit foot-gun, lookahead-bounded for HTML only). trace_component_tree (BFS JSX composition tree), analyze_hooks (hook inventory + Rule of Hooks violation detection), analyze_renders (re-render risk + prop_chain_depth render-tree depth metric with explicit "NOT prop-drilling depth" disclaimer in suggestion text — semantic prop-flow tracking is Tier 6 scope), buildContextGraph (createContext → Provider → useContext consumer mapping), React complexity metrics, enriched get_context_bundle, filter_react_hooks option on trace_call_chain, audit_scan REACT gate, React-aware review_diff, generate_report React section, route entry point detection, shadcn/ui + Tailwind + form library detection, @/ alias resolution, RSC boundary detection, build tool detection (Vite, CRA, webpack, Parcel, esbuild, Rspack, Rsbuild, Turbopack), severity-aware react_quickstart bucketing into critical_issues / warnings / style_issues, and declarative postFilter field on BUILTIN_PATTERNS entries. Auto-loaded on React projects (package.json + .tsx files).

Astro has deep framework intelligence — the first and only static code intelligence for Astro in the MCP ecosystem. 4 dedicated tools: astro_analyze_islands (detect all client:*/server:defer directives, group by framework, track server islands), astro_hydration_audit (12 anti-pattern detectors AH01-AH12 with A/B/C/D scoring — catches client:load on Astro components, islands in loops, missing framework hints, below-fold eager hydration, and more), astro_route_map (file-based routing analysis with dynamic params, route conflicts, rendering mode per page, endpoint method detection), astro_config_analyze (tree-sitter AST walker for astro.config.mjs — extracts output mode, adapter, integrations, i18n, redirects with config_resolution honesty field). Also: 6 Astro anti-patterns in search_patterns, Astro-aware trace_route, analyze_project returns full astro_conventions, .astro extension normalization in import graph, framework detection for dead-code analysis, .mdx file indexing. Template parser (parseAstroTemplate) extracts islands, slots, component usages, and directives from HTML template section with balanced-brace tracking, conditional/loop detection, and landmark section awareness.

Kotlin support includes full tree-sitter parsing with a dedicated extractor for functions, classes (data/sealed/enum/abstract/annotation), interfaces, objects (singleton + companion), properties (val/var/const), type aliases, extension functions, suspend functions, generics, KDoc comments, and JUnit test detection (@Test, @BeforeEach, @AfterEach, @BeforeAll, @AfterAll). Route tracing supports Ktor DSL and Spring Boot Kotlin. Six Kotlin anti-patterns are built-in. | PHP/Yii2 support | src/parser/extractors/php.ts (+ PHPDoc @property/@method synthesis), src/tools/php-tools.ts (6 tools: resolve_php_namespace, trace_php_event, find_php_views, resolve_php_service, php_security_scan, php_project_audit), src/tools/project-tools.ts (Yii2Conventions), src/tools/route-tools.ts (findYii2Handlers, findLaravelHandlers), src/tools/pattern-tools.ts (8 PHP anti-patterns), src/tools/graph-tools.ts (PHP method call detection), src/utils/import-graph.ts (PHP require/include + PSR-4 cross-file edges via resolvePhpNamespace), src/utils/walk.ts (BACKUP_FILE_PATTERNS auto-exclusion), src/parser/parser-manager.ts (error recovery try/catch), src/lsp/lsp-servers.ts (Intelephense), scripts/download-wasm.ts ([email protected]) |

Development

git clone https://github.com/greglas75/codesift.git
cd codesift-mcp
npm install
npm run download-wasm   # Download tree-sitter WASM grammars
npm run build           # TypeScript compilation
npm test                # Run tests (Vitest, 2900+ tests)
npm run test:coverage   # Coverage report
npm run lint            # Type check (tsc --noEmit)

Publishing a new version

After making changes, follow these steps to publish to npm:

# 1. Ensure clean working tree
git status              # No uncommitted changes

# 2. Build and verify
npm run build           # Must succeed with 0 errors
npm test                # Must pass (flaky ast-query tests may fail in full suite — OK if they pass individually)

# 3. Bump version (choose one)
npm version patch       # 0.2.0 → 0.2.1 (bug fixes)
npm version minor       # 0.2.0 → 0.3.0 (new features)
npm version major       # 0.2.0 → 1.0.0 (breaking changes)
# This creates a git commit + tag automatically

# 4. Publish to npm
npm publish --ignore-scripts
# npm will open browser for WebAuthn/Keychain authentication
# Press Enter, confirm in browser, done

# 5. Push to GitHub (commit + tag)
git push && git push --tags

What gets published

The files field in package.json controls what ships:

dist/ — compiled JavaScript
rules/ — platform-specific agent rules (codesift.md, codesift.mdc, codex.md, gemini.md)
src/parser/languages/ — tree-sitter WASM grammars
README.md, LICENSE

After publishing

Users update with:

npm update -g codesift-mcp        # Update package
codesift setup all                 # Update rules files to latest version

If using npx -y codesift-mcp (the default in MCP config), the latest version is picked up automatically on next session start.

Checklist before publishing

[ ] npm run build — 0 TypeScript errors
[ ] npm test — 2900+ tests pass
[ ] rules/codesift.md updated if hints or tools changed
[ ] src/instructions.ts updated if rules changed (compact version)
[ ] README.md updated if features added
[ ] CLAUDE.md updated if architecture changed
[ ] Version bumped via npm version
[ ] Changes committed and pushed to GitHub

License

BSL-1.1