@jafreck/lore
v0.3.7
Published
Language-aware codebase indexer — tree-sitter parsing, symbol extraction, call-graph construction, and optional embeddings for semantic search
Maintainers
Readme
Lore
The teammate that knows it all
Lore holds your agent's institutional knowledge over the codebase — it knows what was built, why it changed, and how it all connects. Lore indexes your code and git history into a structured knowledge base that agents query through MCP. It maps symbols, imports, call relationships, type relationships, routes, annotations, docs, and all git data — with optional embeddings for semantic search — so agents can reason about your codebase without re-reading it from scratch.
Lore-enabled agents achieve up to 8.8× faster responses, up to 97% fewer tokens, and up to +33pp correctness improvement on code intelligence tasks compared to a baseline agent with grep and file reads alone. See the full benchmark results for details.
What Lore does
- Indexes source files using SCIP indexers by default for pre-resolved symbols and edges, with tree-sitter parsing as a fallback for languages without a SCIP indexer
- Extracts symbols, imports, call refs, type refs, annotations, and API routes across all 23 supported languages
- Resolves internal vs external imports and builds call/import/module/inheritance/type-dependency graph edges using a 3-tier resolution strategy (SCIP/LSP containment, same-file name match, unique name match)
- Discovers and indexes documentation (
.md,.rst,.adoc,.txt) with inferred kinds/titles - Stores everything in a normalized SQL schema with optional vector search
- Enables RAG-style retrieval with semantic/fused search across symbols and doc sections
- Indexes git history (commits, touched files, refs/branches/tags)
- Enriches symbols with resolved type signatures and definitions via optional index-time LSP integration (batch-pipelined hover + definition requests)
- Supports line-level git blame through MCP
- Supports automatic refresh via watch mode, poll mode, and git hooks
How Lore integrates with LLMs
flowchart LR
subgraph Codebase
SRC[Source Files]
DOCS[Documentation<br/>md · rst · adoc · txt]
GIT[Git Repo]
COV[Coverage Reports]
end
subgraph INDEXER[Lore Indexer]
SCIPDIRECT[SCIP Source<br/>pre-resolved symbols + refs] --> WALK
WALK[Walker] --> PARSE[Parser] --> EXTRACT[Extractors<br/>symbols · imports · call refs<br/>type refs · routes · annotations]
EXTRACT --> RESOLVE[Import Resolver<br/>internal ↔ external]
EXTRACT --> CALLGRAPH[Relationship Resolver]
EXTRACT -.-> LSPENRICH[LSP Enrichment<br/>type signatures · definition locations]
DOCSINGEST[Docs Ingest<br/>sections · headings · notes]
GITHIST[Git History Ingest<br/>commits · diffs · refs]
COVINGEST[Coverage Ingest<br/>lcov · cobertura]
end
SRC --> SCIPDIRECT
SRC --> WALK
DOCS --> DOCSINGEST
GIT --> GITHIST
COV --> COVINGEST
DB[(SQL DB)]
EMBED([Embedding Model])
subgraph MCP_SERVER[MCP Server]
LOOKUP[lore_lookup]
SEARCH[lore_search]
DOCS_TOOL[lore_docs]
GRAPH[lore_graph]
ROUTES[lore_routes]
NOTES[lore_notes_read/write]
TESTMAP[lore_test_map]
SNIPPET[lore_snippet]
BLAME[lore_blame]
HISTORY[lore_history]
METRICS[lore_metrics]
end
subgraph MCP_CLIENTS[MCP Clients — Agents]
CLAUDE_CODE[Claude Code / Desktop]
COPILOT[VS Code + Copilot]
CURSOR[Cursor]
CUSTOM[Custom Agent Frameworks]
CLAUDE_CODE ~~~ COPILOT ~~~ CURSOR ~~~ CUSTOM
end
DOCSINGEST --> DB
GITHIST --> DB
COVINGEST --> DB
RESOLVE & CALLGRAPH --> DB
LSPENRICH -.->|optional| DB
RESOLVE -.->|optional| EMBED
EMBED -.-> DB
DB --- LOOKUP & SEARCH & DOCS_TOOL & GRAPH & ROUTES & NOTES & TESTMAP & SNIPPET & BLAME & HISTORY & METRICS
EMBED <-.->|semantic/fused| SEARCH
LOOKUP & SEARCH & DOCS_TOOL & GRAPH & ROUTES & NOTES & TESTMAP & SNIPPET & BLAME & HISTORY & METRICS <--> MCP_CLIENTSLore sits between your codebase and any LLM-powered tool. A LoreRuntime
instance owns all long-lived resources (database, embedder, LSP coordinators,
watcher/poller) and dispatches both CLI commands and MCP server requests through
a single lifecycle.
The indexer is decomposed into composable pipeline stages orchestrated
by IndexPipeline. IndexBuilder is now a thin façade (~310 lines, down from
~1 230) that delegates to the pipeline for both full builds and incremental
updates. The stage ordering enforces data dependencies structurally:
ScipSource → SourceIndex → DocsIndex → ImportResolution → DependencyApi
→ ScipEnrichment → LspEnrichment → Resolution → TestMap → History → EmbeddingSCIP is the default source stage. ScipSourceStage runs first and
populates symbols, refs, and pre-resolved edges for every language that has
a SCIP indexer available. SourceIndexStage then handles remaining languages
(or all languages if SCIP is explicitly disabled) via tree-sitter as a
fallback. A SCIP enrichment stage then writes definition locations and
type metadata from SCIP data, while an optional LSP enrichment stage can
enrich symbols not already covered by SCIP. Symbol references are
resolved using a 3-tier strategy: SCIP/LSP containment → same-file name
match → unique name match. An optional embedder generates dense vectors
for semantic search (with async initialization so the MCP server starts
instantly), and a parallel git history ingest captures commits, diffs,
and refs. Everything is persisted to a normalized SQL database.
The MCP server uses a ToolRegistry that auto-discovers tool modules
from their exported toolDef / handler definitions — no duplicate schema
code. It exposes the database as a set of tools that any MCP-compatible
client can call to look up symbols, search code, traverse call graphs, read
snippets, query blame/history, and write summaries back.
The index stays fresh automatically. You can install git hooks
(post-commit, post-merge, etc.) that trigger an incremental refresh on
every commit, run a watch mode that reacts to filesystem events in
real time (with live embedding updates), or use poll mode for
environments where watch events are unreliable (also with live embedding
updates). Each refresh only re-processes files whose content hash has
changed, so updates are fast even on large repositories.
See docs/architecture.md for the full schema and pipeline breakdown.
Supported languages
Lore currently supports extractors for:
- C, C++, C#
- Rust, Go, Java, Kotlin, Scala, Swift, Objective-C, Zig
- Python, JavaScript, TypeScript, PHP, Ruby, Lua, Bash, Elixir
- OCaml, Haskell, Julia, Elm
Install
npm install @jafreck/loreNote: Lore uses native add-ons (tree-sitter, better-sqlite3). A working
C/C++ toolchain is required the first time dependencies are built.
Quick start (CLI)
# 1) Build an index
npx @jafreck/lore index --root ./my-project --db ./lore.db
# 2) Start MCP server over stdio
npx @jafreck/lore mcp --db ./lore.dbQuick start (programmatic)
import { IndexBuilder } from '@jafreck/lore';
const builder = new IndexBuilder(
'./lore.db',
{ rootDir: './my-project' },
undefined,
{ history: true },
);
await builder.build();MCP tools
| Tool | Purpose |
|------|---------|
| lore_lookup | Find symbols by name or files by path, including external dependency API symbols and LSP-resolved metadata when available |
| lore_search | Structural BM25, semantic vector, or fused RRF search across symbols and doc sections |
| lore_docs | List, fetch, or search indexed documentation with branch, kind, and path filters |
| lore_routes | Query extracted API routes/endpoints with optional method, path prefix, and framework filters |
| lore_notes_write | Upsert agent-authored notes by key and scope, with optional source hash for staleness tracking |
| lore_notes_read | Read notes by exact key or key prefix with scope-aware staleness metadata |
| lore_graph | Query call/import/inheritance/type-dependency edges; supports source_id for outbound and target_id for inbound/reverse queries; call edges include callee_coverage_percent |
| lore_snippet | Return snippets from indexed source snapshots by file path + line range or by symbol name; path/symbol resolution is branch-aware and responses include containing-symbol context metadata (name, kind, start/end lines) when available |
| lore_test_map | Return mapped test files (with confidence) for a given source file path |
| lore_blame | Query blame, line-range history, or ownership aggregates with optional symbol targeting, commit-context enrichment, and risk signals |
| lore_history | Query commit history by file, commit, author, ref, recency, or semantic commit-message similarity |
| lore_metrics | Aggregate index metrics plus coverage/staleness fields |
lore_lookup query options
For symbol lookups (kind: "symbol"), lore_lookup supports:
match_mode: optional symbol-name matching mode (exact,prefix,contains); defaults toexact(case-insensitive).symbol_kind: optional symbol kind filter (for example,functionorclass).path_prefix: optional indexed file-path prefix filter.language: optional indexed file language filter.limit: optional maximum rows for empty/browse symbol queries (default20).offset: optional rows to skip for empty/browse symbol queries (default0).
Example symbol lookup requests:
{ "kind": "symbol", "query": "IndexBuilder", "match_mode": "prefix", "symbol_kind": "class" }
{ "kind": "symbol", "query": "", "path_prefix": "src/indexer/", "language": "typescript", "limit": 20, "offset": 20 }MCP config example
{
"mcpServers": {
"lore": {
"command": "npx",
"args": ["@jafreck/lore", "mcp", "--db", "/path/to/lore.db"]
}
}
}lore_docs examples
{ "action": "list", "branch": "main", "kinds": ["readme", "architecture"] }
{ "action": "get", "path": "/repo/docs/architecture.md", "branch": "main", "include_sections": true }
{ "action": "search", "query": "incremental refresh", "kinds": ["guide", "architecture"], "limit": 10 }lore_search filter parameters
lore_search supports additional optional filters to narrow symbol and documentation hits:
| Parameter | Applies to | Description |
|-----------|------------|-------------|
| path_prefix | Symbol results | Restrict symbol hits to files whose source path starts with the prefix |
| language | Symbol results | Restrict symbol hits to indexed file language (for example typescript, python) |
| kind | Symbol results | Restrict symbol hits to a symbol kind (for example function, class) |
| doc_path_prefix | Doc-section results | Restrict semantic/fused doc hits to docs whose path starts with the prefix |
| doc_kind | Doc-section results | Restrict semantic/fused doc hits to a documentation kind (for example readme, architecture) |
Mode behavior:
structural: returns symbol hits only; appliespath_prefix,language, andkind.semantic: may return symbol and doc-section hits; symbol filters (path_prefix,language,kind) apply to symbol results, whiledoc_path_prefixanddoc_kindapply to doc-section results before ranking output.fused: combines structural and semantic candidates; symbol filters apply to symbol candidates and doc filters apply to semantic doc-section candidates before final fused ranking.
lore_history modes
| Mode | Query |
|------|-------|
| recent | Newest commits |
| semantic | Conceptual commit-message search (falls back to recent when vectors are unavailable) |
| file | Commits that touched a path |
| commit | Full/prefix SHA lookup (+files +refs) |
| author | Commits by author/email substring |
| ref | Commits matching branch/tag ref name |
lore_blame examples
{ "path": "/repo/src/index.ts", "line": 120 }
{ "path": "/repo/src/index.ts", "start_line": 120, "end_line": 140 }
{ "path": "/repo/src/index.ts", "line": 120, "ref": "main" }
{ "symbol": "handleAuth", "path": "/repo/src/auth.ts", "branch": "main" }
{ "mode": "history", "symbol": "handleAuth", "path": "/repo/src/auth.ts", "ref": "main" }
{ "mode": "ownership", "path": "/repo/src", "scope": "directory", "ref": "main" }Legacy line and line-range requests remain fully supported; mode defaults to "blame" when omitted.
History and ownership responses include commit context (commits, history[*].commit_context with message/files/refs) and risk indicators (recency, author_dispersion, churn, overall), and symbol-targeted requests return resolved_symbol.
Data ingestion
Lore indexes multiple data sources into a normalized SQLite schema. Each source has its own ingestion pipeline and can be enabled independently.
Source code
The indexer uses a SCIP-first strategy: for languages with a SCIP indexer it produces symbols and pre-resolved edges directly, then falls back to tree-sitter parsing for remaining languages. Optional LSP enrichment can augment symbols from either path. The import resolver classifies each import as internal or external, and a call-graph builder creates edges between symbols.
Programmatic example:
import { IndexBuilder } from '@jafreck/lore';
await new IndexBuilder('./lore.db', {
rootDir: './my-project',
includeGlobs: ['src/**'],
excludeGlobs: ['**/*.gen.ts'],
extensions: ['.ts', '.tsx'],
}).build();Documentation
Lore discovers and indexes documentation files (.md, .rst, .adoc, .txt)
during both index and refresh flows. By default it scans:
README*variantsdocs/**/*.{md,rst,adoc,txt}- ADR-style paths (
**/{adr,adrs,ADR,ADRS}/**/*and**/{ADR,adr}-*) - Top-level architecture/design/overview/changelog/guide files
Indexed docs are stored per (path, branch) in docs, with heading-based
chunks in doc_sections. When embeddings are enabled, section vectors are stored
in doc_section_embeddings.
CLI discovery controls:
--docs-include <glob>/--docs-exclude <glob>— repeatable include/exclude filters--docs-extension <ext>— repeatable extension filter (e.g..md)--docs-auto-notes/--no-docs-auto-notes— toggle seeded doc-note upserts (default: enabled)
When auto-notes are enabled, Lore seeds notes rows for README, architecture,
and ADR docs using deterministic keys. Each note tracks a source_hash for
staleness detection — lore_notes_read reports doc-scoped notes as stale when
the backing document changes or disappears.
Programmatic example:
await new IndexBuilder('./lore.db', {
rootDir: './my-project',
docsIncludeGlobs: ['**/README*', 'handbook/**/*.rst'],
docsExcludeGlobs: ['**/docs/private/**'],
docsExtensions: ['.md', '.rst'],
}).build();Git history
Lore ingests commits, touched files (with change type and diff stats), and
refs (branches/tags). Enable with --history; use --history-all to traverse
all refs and --history-depth <n> to cap the number of commits.
Indexed tables:
commits— sha, author, author_email, timestamp, message, parentscommit_files— per-commit touched paths with change type and diff statscommit_refs— refs currently pointing at commits (branch/tag/other)commit_embeddings— commit-message vectors keyed tocommitsfor semantic history retrieval
Programmatic example:
await new IndexBuilder('./lore.db', {
rootDir: './my-project',
}, undefined, {
history: { all: true, depth: 2000 },
}).build();Coverage
Coverage reports are auto-detected during build/update/refresh from known paths
(coverage/lcov.info, coverage/cobertura-coverage.xml, coverage.xml) and
only ingested when newer than the last stored coverage run.
For non-standard report locations, use lore ingest-coverage:
npx @jafreck/lore ingest-coverage --db ./lore.db --root ./my-project \
--file ./custom/coverage.xml --format coberturaEmbeddings
Lore optionally generates dense vector embeddings for semantic search using
@huggingface/transformers (Transformers.js), which runs ONNX models natively
in Node.js — no Python or external processes required. The default model is
Qwen/Qwen3-Embedding-0.6B (1024-dim); override with --embedding-model:
npx @jafreck/lore index --root ./my-project --db ./lore.db \
--embedding-model 'nomic-ai/nomic-embed-text-v1.5'Hardware acceleration is automatic: CoreML on Apple Silicon, WebGPU when
available, CPU elsewhere. Override via the LORE_EMBED_DEVICE env var.
Quantized ONNX dtype (fp32/fp16/q8/q4) is configurable with LORE_EMBED_DTYPE.
In update/watch/poll mode, symbols and docs whose embedding text is unchanged
(SHA-256 hash comparison) are skipped entirely for fast incremental re-embeds.
At query time, lore_search in semantic or fused mode embeds the query
and performs cosine similarity against stored vectors. If the model cannot
initialize, search gracefully degrades to structural BM25.
When history indexing is enabled, Lore also stores commit-message vectors in
commit_embeddings so lore_history can serve semantic commit retrieval.
LSP enrichment
Lore can enrich symbols and call refs with resolved type metadata at index time by querying language servers via the Language Server Protocol. Enriched columns:
resolved_type_signature,resolved_return_typedefinition_uri,definition_path
These are persisted in symbols, symbol_refs, and external_symbols tables.
lore_lookup and lore_search return them when present. Query handlers stay
SQLite-only — language servers are never invoked at runtime.
LSP precedence:
- CLI flag (
--lsp) .lore.configlsp.enabled- Built-in default (
false)
.lore.config example:
{
"lsp": {
"enabled": true,
"timeoutMs": 5000,
"servers": {
"typescript": { "command": "typescript-language-server", "args": ["--stdio"] },
"python": { "command": "pyright-langserver", "args": ["--stdio"] }
}
}
}Default server mappings cover all supported extractor languages:
| Language(s) | Default command |
|-------------|------------------|
| c, cpp, objc | clangd |
| rust | rust-analyzer |
| python | pyright-langserver --stdio |
| typescript, javascript | typescript-language-server --stdio |
| go | gopls |
| java | jdtls |
| csharp | csharp-ls |
| ruby | solargraph stdio |
| php | intelephense --stdio |
| swift | sourcekit-lsp |
| kotlin | kotlin-language-server |
| scala | metals |
| lua | lua-language-server |
| bash | bash-language-server start |
| elixir | elixir-ls |
| zig | zls |
| ocaml | ocamllsp |
| haskell | haskell-language-server-wrapper --lsp |
| julia | julia --startup-file=no --history-file=no --quiet --eval "using LanguageServer, SymbolServer; runserver()" |
| elm | elm-language-server |
Install whichever language servers you need on PATH; unavailable servers are
auto-detected and skipped without failing indexing.
Dependency APIs
Lore can index declaration-level public API surface from direct dependencies.
Enable with --index-deps or indexDependencies: true programmatically.
Supported ecosystems:
- TypeScript/JavaScript — exported declarations from
.d.tsfiles in direct npm dependencies - Python — stubbed/public declarations from direct dependencies via
.pyiandpy.typed - Go — exported declarations from direct module requirements in
go.mod - Rust —
pubdeclarations from crates inCargo.toml
Implementation bodies are excluded and transitive dependencies are not crawled.
Keeping the index fresh
The index stays current automatically through three mechanisms:
Git hooks — install once with lore hooks, and Lore refreshes on every
post-commit, post-merge, post-checkout, and post-rewrite:
npx @jafreck/lore hooks --root ./my-project --db ./lore.db --historyWatch mode — reacts to filesystem events in real time:
npx @jafreck/lore refresh --db ./lore.db --root ./my-project --watchPoll mode — periodic mtime diffing, most reliable across filesystems:
npx @jafreck/lore refresh --db ./lore.db --root ./my-project --pollBoth watch and poll modes support live embeddings — when an embedding model is configured, changed files have their vectors re-generated incrementally during each refresh cycle.
Each refresh only re-processes files whose content hash has changed, so updates are fast even on large repositories.
CLI reference
lore index
Build or update a knowledge base.
npx @jafreck/lore index --root <dir> --db <path> [--embedding-model <id>] [--blocking-embedder] [--index-deps] [--history] [--history-depth <n>] [--history-all] [--include <glob>] [--exclude <glob>] [--language <lang>] [--docs-include <glob>] [--docs-exclude <glob>] [--docs-extension <ext>] [--docs-auto-notes|--no-docs-auto-notes] [--lsp] [--no-scip]lore refresh
Incremental refresh (one-shot, watch, or poll).
npx @jafreck/lore refresh --db <path> --root <dir> [--index-deps] [--history] [--history-depth <n>] [--history-all] [--docs-include <glob>] [--docs-exclude <glob>] [--docs-extension <ext>] [--docs-auto-notes|--no-docs-auto-notes] [--lsp] [--no-scip]
npx @jafreck/lore refresh --db <path> --root <dir> --watch [--index-deps] [--history] [--docs-include <glob>] [--docs-exclude <glob>] [--docs-extension <ext>] [--lsp] [--no-scip]
npx @jafreck/lore refresh --db <path> --root <dir> --poll [--index-deps] [--history] [--docs-include <glob>] [--docs-exclude <glob>] [--docs-extension <ext>] [--lsp] [--no-scip]lore hooks
Install repo-local git hooks for automatic refresh.
npx @jafreck/lore hooks --root <repo> --db <path> [--history] [--lsp] [--no-scip]lore ingest-coverage
Manually ingest a coverage report.
npx @jafreck/lore ingest-coverage --db <path> --root <dir> --file <path> --format <lcov|cobertura> [--commit <sha>]lore mcp
Start the MCP server over stdio.
npx @jafreck/lore mcp --db <path> [--blocking-embedder]Build from source
git clone https://github.com/jafreck/Lore.git
cd Lore
npm install
npm run buildContributing
Environment expectations:
- Node.js
>=22.0.0 - Native build toolchain for
tree-sitterandbetter-sqlite3
Common local workflow:
npm run build
npm test
npm run coverageCI currently enforces minimum coverage thresholds of 77% statements, 64% branches, 80% functions, and 79% lines.
Publish authentication (npm)
Lore publish operations use NODE_AUTH_TOKEN (see .npmrc) and never commit
tokens to the repository.
Local publish flow:
export NODE_AUTH_TOKEN=<npm automation token>
npm publish --access publicCI publish flow:
- Add
NODE_AUTH_TOKENas a secret in your CI provider (for GitHub Actions, use a repository or environment secret). - Ensure publish jobs expose that secret as the
NODE_AUTH_TOKENenvironment variable before runningnpm publish.
Release publish workflow
Publishing is automated by .github/workflows/publish.yml. Publishing a
GitHub Release triggers the npm publish job.
Release steps:
- Ensure
package.jsonhas the target version. - Publish a GitHub Release with the matching
vX.Y.Ztag. - Confirm the workflow logs show
npm publish --dry-runoutput before the livenpm publishstep.
Post-publish verification:
- Check the package metadata:
npm view @jafreck/lore version. - Confirm installability:
npm view @jafreck/lore@<version> name version.
Benchmarking index performance (500+ file repos)
Use this procedure when you need measurable before/after evidence for indexing changes:
- Pick a repository with at least 500 source files and note the exact commit SHA you will test.
- Capture a baseline timing from the same machine and environment:
time npx @jafreck/lore index --root /path/to/repo --db ./lore-baseline.db- Apply your change, rebuild Lore, then capture a post-change timing against the same repository commit:
npm run build
time npx @jafreck/lore index --root /path/to/repo --db ./lore-after.db- Record both timings (baseline and post-change) in the related GitHub issue or PR under an "Acceptance Evidence" section, including repo name, commit SHA, and command used.
