@pugi/plugin-codegraph

v0.1.0-alpha.2

Published

13 days ago

Pugi codegraph plugin - exposes pugi.code.search / definition / callers / trace / outline / repo_map tools backed by tree-sitter + SQLite FTS5 + PageRank.

0High
0Medium
0Low

pugi.dev

pugi codegraph tree-sitter sqlite fts5 pagerank

@pugi/plugin-codegraph

Tree-sitter + SQLite FTS5 + PageRank code intelligence for Pugi / Pugi.

Part of the Pugi 1.0 soft fork sprint (see ADR-0081).

What it does

Builds a local symbol-level index of the active workspace so the model can answer "where is X defined?", "what calls Y?", and "rank these files by relevance" without re-reading source files. Inspired by Aider's PageRank- over-symbols repo map and Cody / NotebookLM-style grounded citations.

Six tools are exposed via tool.register:

| Tool | Purpose | |---|---| | pugi.code.search | FTS5 keyword search ranked by BM25 + PageRank | | pugi.code.definition | Locate symbol declaration(s) | | pugi.code.callers | List call sites for a symbol | | pugi.code.trace | Follow a call chain up to N hops | | pugi.code.outline | Hierarchical symbol tree for a single file | | pugi.code.repo_map | Top files by PageRank (with optional task bias) |

In addition, the plugin transparently injects a "relevant symbols" section into the system prompt whenever the user's last turn mentions a PascalCase identifier or function call that the index can resolve (token budget configurable, defaults to 2000).

Install

pnpm add @pugi/plugin-codegraph

Bring your own tree-sitter grammars - we treat them as peer dependencies so each consumer ships only what their codebase needs:

pnpm add better-sqlite3 chokidar tree-sitter tree-sitter-typescript tree-sitter-javascript tree-sitter-python

Optional language grammars (tree-sitter-rust, tree-sitter-go, tree-sitter-java, tree-sitter-ruby) are loaded lazily; if missing, the plugin logs once per language and skips those files.

Usage

// pugi.config.ts
export default {
  plugin: ['@pugi/plugin-codegraph'],
};

Or with options:

export default {
  plugin: [['@pugi/plugin-codegraph', {
    dbPath: '/custom/path/codegraph.sqlite',
    languages: ['typescript', 'tsx', 'python'],
    excludePatterns: ['node_modules/**', 'dist/**', '*.min.js'],
    maxFileBytes: 1_000_000,
    embeddingProvider: 'none',
    watchMode: true,
    injectTokenBudget: 2_000,
    injectMaxSymbols: 8,
  }]],
};

Architecture

Single SQLite file at <workspace>/.pugi/codegraph.sqlite (WAL mode so the watcher's writes don't block SDK reads). Schema:

files
  id PK, path UNIQUE, language, size_bytes, mtime_ms, sha256, pagerank

symbols
  id PK, file_id FK CASCADE, name, kind, start/end_line, start/end_col,
  signature, docstring, parent_symbol_id (self-ref FK)

references_table
  id PK, symbol_id FK SET NULL, file_id FK CASCADE, name,
  line, col, context_snippet

symbols_fts (FTS5 virtual table)
  name, signature, docstring  -- BM25-ranked
  triggers keep it in sync with symbols on INSERT / UPDATE / DELETE

references_table carries the _table suffix because references is a SQL reserved word.

Indexing pipeline (per file):

Skip if .gitignore-style excludePatterns match (matches the named directory at any depth).
Skip if extension not in supported language list.
Compute SHA-256 over file contents.
If existing row has matching SHA, return unchanged (~1ms).
Parse via tree-sitter native binding.
Extract symbol declarations + call-site references.
Single transaction: DELETE existing file row (CASCADE) + INSERT new file/symbols/references. FTS5 stays in sync via triggers.

Cross-file reference resolution: each call site looks up the highest- PageRank declaration of its target name across all files. Order-dependent edge case handled by resolveDanglingRefs after the initial crawl completes.

PageRank: power iteration over the file-to-file graph (damping 0.85, max 50 iters, epsilon 1e-4). Recomputed after the initial crawl and every 100 file edits during watch mode. Stored in files.pagerank.

Why these choices

Native tree-sitter, not WASM. ~2x parse throughput, no WASM bundle, no async init. Trade-off: native ABI per Node major version. Pinned to engines.node >= 20.
better-sqlite3 with WAL. Synchronous API keeps the hot path simple; WAL lets the watcher write while queries read.
FTS5 with prefix-quoted tokens. unicode61 keeps camelCase as one token. Prefix matching (OrderService*) catches partial matches without trigram indexes.
BM25 + PageRank hybrid ranking. Cheap, deterministic, no embedding round-trip. Delivers most of the retrieval value at a fraction of the moving parts.
No Effect-TS. Per ADR-0081 containment rule.
No cross-plugin imports. @pugi/plugin-codegraph runs standalone.

Performance budget

Measured on packages/pugi-plugins/ (M-series Mac, 116 source files):

| Operation | Target | Measured | |---|---|---| | Cold index | <60s / 10k files | 330ms / 116 files | | Incremental reindex | <50ms | ~5-15ms | | pugi.code.definition | <10ms | 0.05ms avg | | pugi.code.callers | <100ms | 0.07ms avg | | pugi.code.search | <200ms | 0.11ms avg | | pugi.code.repo_map | <50ms | 1ms | | pugi.code.outline | <10ms | 0.03ms avg | | PageRank recompute | <500ms / 10k files | 1ms / 69 files (7 iters, converged) | | DB size | ~5MB / 1k files | 4.3MB / 1k files (extrapolated) |

Upstream contract note

@pugi-ai/[email protected] does not expose userMessageText on the experimental.chat.system.transform hook. We capture the last user message via chat.message and look it up by sessionID when the transform fires. Session cache is bounded (256 entries) so a long- running Pugi process cannot leak memory.

Deferred (P1)

Embedding-backed reranking for pugi.code.repo_map (embeddingProvider option is plumbed but only none is implemented).
Additional grammars (Rust, Go, Java, Ruby) are loaded lazily but their declaration tables are sketched rather than exhaustive.

License

MIT. See LICENSE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@pugi/plugin-codegraph

What it does

Install

Usage

Architecture

Why these choices

Performance budget

Upstream contract note

Deferred (P1)

License