@pugi/plugin-codegraph
v0.1.0-alpha.2
Published
Pugi codegraph plugin - exposes pugi.code.search / definition / callers / trace / outline / repo_map tools backed by tree-sitter + SQLite FTS5 + PageRank.
Maintainers
Readme
@pugi/plugin-codegraph
Tree-sitter + SQLite FTS5 + PageRank code intelligence for Pugi / Pugi.
Part of the Pugi 1.0 soft fork sprint (see ADR-0081).
What it does
Builds a local symbol-level index of the active workspace so the model can answer "where is X defined?", "what calls Y?", and "rank these files by relevance" without re-reading source files. Inspired by Aider's PageRank- over-symbols repo map and Cody / NotebookLM-style grounded citations.
Six tools are exposed via tool.register:
| Tool | Purpose |
|---|---|
| pugi.code.search | FTS5 keyword search ranked by BM25 + PageRank |
| pugi.code.definition | Locate symbol declaration(s) |
| pugi.code.callers | List call sites for a symbol |
| pugi.code.trace | Follow a call chain up to N hops |
| pugi.code.outline | Hierarchical symbol tree for a single file |
| pugi.code.repo_map | Top files by PageRank (with optional task bias) |
In addition, the plugin transparently injects a "relevant symbols" section into the system prompt whenever the user's last turn mentions a PascalCase identifier or function call that the index can resolve (token budget configurable, defaults to 2000).
Install
pnpm add @pugi/plugin-codegraphBring your own tree-sitter grammars - we treat them as peer dependencies so each consumer ships only what their codebase needs:
pnpm add better-sqlite3 chokidar tree-sitter tree-sitter-typescript tree-sitter-javascript tree-sitter-pythonOptional language grammars (tree-sitter-rust, tree-sitter-go,
tree-sitter-java, tree-sitter-ruby) are loaded lazily; if missing, the
plugin logs once per language and skips those files.
Usage
// pugi.config.ts
export default {
plugin: ['@pugi/plugin-codegraph'],
};Or with options:
export default {
plugin: [['@pugi/plugin-codegraph', {
dbPath: '/custom/path/codegraph.sqlite',
languages: ['typescript', 'tsx', 'python'],
excludePatterns: ['node_modules/**', 'dist/**', '*.min.js'],
maxFileBytes: 1_000_000,
embeddingProvider: 'none',
watchMode: true,
injectTokenBudget: 2_000,
injectMaxSymbols: 8,
}]],
};Architecture
Single SQLite file at <workspace>/.pugi/codegraph.sqlite (WAL mode so the
watcher's writes don't block SDK reads). Schema:
files
id PK, path UNIQUE, language, size_bytes, mtime_ms, sha256, pagerank
symbols
id PK, file_id FK CASCADE, name, kind, start/end_line, start/end_col,
signature, docstring, parent_symbol_id (self-ref FK)
references_table
id PK, symbol_id FK SET NULL, file_id FK CASCADE, name,
line, col, context_snippet
symbols_fts (FTS5 virtual table)
name, signature, docstring -- BM25-ranked
triggers keep it in sync with symbols on INSERT / UPDATE / DELETEreferences_table carries the _table suffix because references is a SQL
reserved word.
Indexing pipeline (per file):
- Skip if
.gitignore-style excludePatterns match (matches the named directory at any depth). - Skip if extension not in supported language list.
- Compute SHA-256 over file contents.
- If existing row has matching SHA, return
unchanged(~1ms). - Parse via tree-sitter native binding.
- Extract symbol declarations + call-site references.
- Single transaction: DELETE existing file row (CASCADE) + INSERT new file/symbols/references. FTS5 stays in sync via triggers.
Cross-file reference resolution: each call site looks up the highest-
PageRank declaration of its target name across all files. Order-dependent
edge case handled by resolveDanglingRefs after the initial crawl
completes.
PageRank: power iteration over the file-to-file graph (damping 0.85, max
50 iters, epsilon 1e-4). Recomputed after the initial crawl and every
100 file edits during watch mode. Stored in files.pagerank.
Why these choices
- Native tree-sitter, not WASM. ~2x parse throughput, no WASM bundle,
no async init. Trade-off: native ABI per Node major version. Pinned to
engines.node >= 20. - better-sqlite3 with WAL. Synchronous API keeps the hot path simple; WAL lets the watcher write while queries read.
- FTS5 with prefix-quoted tokens.
unicode61keeps camelCase as one token. Prefix matching (OrderService*) catches partial matches without trigram indexes. - BM25 + PageRank hybrid ranking. Cheap, deterministic, no embedding round-trip. Delivers most of the retrieval value at a fraction of the moving parts.
- No Effect-TS. Per ADR-0081 containment rule.
- No cross-plugin imports.
@pugi/plugin-codegraphruns standalone.
Performance budget
Measured on packages/pugi-plugins/ (M-series Mac, 116 source files):
| Operation | Target | Measured |
|---|---|---|
| Cold index | <60s / 10k files | 330ms / 116 files |
| Incremental reindex | <50ms | ~5-15ms |
| pugi.code.definition | <10ms | 0.05ms avg |
| pugi.code.callers | <100ms | 0.07ms avg |
| pugi.code.search | <200ms | 0.11ms avg |
| pugi.code.repo_map | <50ms | 1ms |
| pugi.code.outline | <10ms | 0.03ms avg |
| PageRank recompute | <500ms / 10k files | 1ms / 69 files (7 iters, converged) |
| DB size | ~5MB / 1k files | 4.3MB / 1k files (extrapolated) |
Upstream contract note
@pugi-ai/[email protected] does not expose userMessageText on the
experimental.chat.system.transform hook. We capture the last user
message via chat.message and look it up by sessionID when the
transform fires. Session cache is bounded (256 entries) so a long-
running Pugi process cannot leak memory.
Deferred (P1)
- Embedding-backed reranking for
pugi.code.repo_map(embeddingProvideroption is plumbed but onlynoneis implemented). - Additional grammars (Rust, Go, Java, Ruby) are loaded lazily but their declaration tables are sketched rather than exhaustive.
License
MIT. See LICENSE.
