diffdoc
v0.7.0
Published
Translate repository code shifts into plain-English business context
Readme
DiffDoc
Your codebase already knows how the product works. DiffDoc turns that implementation into a living, portable knowledgebase that humans and agents can search, question, and reuse.
It generates plain-English summaries from source files, records them in a manifest-first artifact model, and keeps the resulting context close to the repository. Use it to give developers, agents, reviewers, and stakeholders implementation-grounded answers without asking them to read every file first.
Guiding Principles
- The codebase is the source of truth. Requirements documents, tickets, wikis, and tribal knowledge can drift, but product behavior is ultimately defined by the code that ships.
- Summaries should describe implemented behavior, not imagined intent. DiffDoc focuses on what the current files do so product questions are answered from the implementation first.
- The knowledgebase should evolve with the product. When files change, DiffDoc refreshes affected summaries and manifest entries so generated context does not become a stale snapshot.
- The manifest is the durable contract. DiffDoc is intentionally manifest-first: the manifest is the source of truth for generated summaries, and downstream tools should be able to consume the manifest and summary assets without depending on DiffDoc's built-in embedding workflow.
- Retrieval is optional infrastructure. The built-in
embedcommand, local Vectra index,search,query, and MCP server are convenience features for teams that want an end-to-end local workflow, but consumers should be free to use their own embedding provider, vector store, search system, or documentation pipeline. - Useful context should serve humans and agents. The generated knowledgebase is intended for product questions, onboarding, code review, agent workflows, audits, and long-term maintenance.
Requirements
- Node.js
>=22 - An OpenAI-compatible chat model for
summarizeandquery - An OpenAI-compatible embedding model for
embed,search, andquery - A local model server such as Ollama, LM Studio, or vLLM, or a cloud OpenAI-compatible endpoint
Install
Run DiffDoc without adding it to your project:
npx diffdoc --helpInstall it as a project dev dependency:
npm install --save-dev diffdocRecommended package scripts:
{
"scripts": {
"diffdoc:init": "diffdoc init",
"diffdoc:summarize": "diffdoc summarize",
"diffdoc:embed": "diffdoc embed",
"diffdoc:search": "diffdoc search",
"diffdoc:query": "diffdoc query",
"diffdoc:status": "diffdoc status",
"diffdoc:mcp": "diffdoc-mcp"
}
}Quick Start
Initialize DiffDoc in your repository:
npx diffdoc initFor a non-interactive setup using defaults:
npx diffdoc init --yesCreate summaries:
npx diffdoc summarize --path . --mode allBuild the local search index:
npx diffdoc embedSearch raw matches:
npx diffdoc search "How does authentication work?"Ask a question using retrieved project context:
npx diffdoc query "What business behavior does this repository implement?"After the first full run, refresh changed files with delta mode:
npx diffdoc summarize --path . --mode delta
npx diffdoc embedWhat Init Creates
diffdoc init creates or updates repository-local setup files:
.diffdocrc: local DiffDoc configuration.diffdocignore: gitignore-style file selection rules for summarization.gitignore: entries for local/generated DiffDoc files when needed
It does not summarize or embed anything. Run summarize and embed after initialization.
Configuration
DiffDoc reads settings in this order:
- CLI flags
.diffdocrcor the file passed with--config <path>- Environment variables
- Built-in defaults
Example .diffdocrc for local models:
{
"baseDir": "./.diffdoc",
"aiProvider": "local",
"localLlmEndpoint": "http://localhost:11434/v1",
"localEmbedEndpoint": "http://localhost:11434/v1/embeddings",
"localChatModel": "qwen2.5-coder:7b",
"localEmbedModel": "nomic-embed-code",
"embedBatchSize": 25,
"summarizeConcurrency": 2,
"includeGlobs": [],
"excludeGlobs": [],
"ignoreFile": ".diffdocignore"
}Example .diffdocrc for a cloud OpenAI-compatible endpoint:
{
"baseDir": "./.diffdoc",
"aiProvider": "cloud",
"cloudLlmEndpoint": "https://api.openai.com/v1",
"cloudChatModel": "gpt-4o-mini",
"cloudEmbedModel": "text-embedding-3-small",
"embedBatchSize": 25,
"summarizeConcurrency": 2,
"includeGlobs": [],
"excludeGlobs": [],
"ignoreFile": ".diffdocignore"
}Set OPENAI_API_KEY for cloud providers instead of committing API keys:
OPENAI_API_KEY="..." npx diffdoc summarize --path . --mode allSupported environment variables:
AI_PROVIDER
DIFFDOC_BASE_DIR
DIFFDOC_EMBED_BATCH_SIZE
DIFFDOC_SUMMARIZE_CONCURRENCY
DIFFDOC_INCLUDE_GLOBS
DIFFDOC_EXCLUDE_GLOBS
DIFFDOC_IGNORE_FILE
DIFFDOC_SUMMARY_PROMPT
DIFFDOC_SUMMARY_PROMPT_FILE
LOCAL_LLM_ENDPOINT
LOCAL_CHAT_MODEL
LOCAL_EMBED_ENDPOINT
LOCAL_EMBED_MODEL
CLOUD_LLM_ENDPOINT
CLOUD_CHAT_MODEL
CLOUD_EMBED_MODEL
OPENAI_API_KEYFile Selection
.diffdocignore uses .gitignore-style syntax. This is the main way to keep generated files, dependencies, secrets, binaries, and local artifacts out of summaries.
Example .diffdocignore:
.git/
.diffdoc/
node_modules/
dist/
coverage/
.env
*.logPrecedence is intentionally conservative:
.diffdocignoreskips files firstexcludeGlobsskip files secondincludeGlobsnarrow whatever remains
An included file is still skipped if it matches .diffdocignore or excludeGlobs.
Use include and exclude filters from config:
{
"includeGlobs": ["src/**/*.ts"],
"excludeGlobs": ["**/*.test.ts"]
}Or pass them at runtime:
npx diffdoc summarize --path . --mode all --include-glob "src/**/*.ts" --exclude-glob "**/*.test.ts"Commands
Initialize setup files:
npx diffdoc init
npx diffdoc init --yes
npx diffdoc init --provider cloud --forceSummarize files into .diffdoc/manifest.json and .diffdoc/summaries/*.json:
npx diffdoc summarize --path . --mode all
npx diffdoc summarize --path . --mode delta
npx diffdoc summarize --path . --mode delta --json
npx diffdoc summarize --path . --mode all --summarize-concurrency 4
npx diffdoc summarize --path . --mode all --refreshSummarization runs with bounded concurrency. The default is 2; use 1 for strict rate limits, 2-4 for most providers, and higher values only when your local model server or API quota can handle the request volume.
Use --summary-prompt or --summary-prompt-file to add domain-specific guidance without replacing DiffDoc's default structured prompt:
npx diffdoc summarize --summary-prompt "Emphasize billing behavior, permissions, data retention, and operational risk."
npx diffdoc summarize --summary-prompt-file ./diffdoc-summary-prompt.mdRaw code snapshots are optional. DiffDoc normally stores file path and content hash metadata so tools can look up source files from the repository when needed. Store raw code snapshots only when you need exported, offline, or point-in-time audit artifacts to include source text:
npx diffdoc summarize --path . --mode all --include-code-snapshotSnapshots increase artifact size and duplicate source code, which can include sensitive or proprietary content.
Check manifest and index freshness:
npx diffdoc status
npx diffdoc status --jsonstatus also recommends the next command to run. It prioritizes refreshing missing or stale summaries before rebuilding the vector index.
Embed summaries into the local Vectra index:
npx diffdoc embed
npx diffdoc embed --rebuild
npx diffdoc embed --embed-batch-size 20Search indexed summaries:
npx diffdoc search "How does this project process changed files?"
npx diffdoc search "How does embedding work?" --top 3 --codeAsk questions with retrieval-augmented answers:
npx diffdoc query "How does this project process changed files?"
npx diffdoc query "How does embedding work?" --top 3 --codeUse a custom config or artifact directory:
npx diffdoc query "How does embedding work?" --config ./config/diffdoc.local.json
npx diffdoc embed --config ./.diffdocrc --base-dir ./tmp-diffdocArtifacts
DiffDoc keeps generated project context under baseDir, which defaults to ./.diffdoc:
.diffdoc/
manifest.json
summaries/
<content-hash>.json
vectra/The manifest maps repository-relative file paths to content hashes:
{
"schemaVersion": 2,
"lastSyncedCommit": "string-hash",
"files": {
"src/example.ts": "md5-string"
}
}Each summary asset is portable JSON:
{
"schemaVersion": 2,
"content_hash": "md5-string",
"metadata": {
"file_path": "src/example.ts",
"file_name": "example.ts",
"extension": ".ts",
"line_count": 42,
"byte_size": 1200,
"content_hash": "md5-string",
"generated_at": "2026-05-27T00:00:00.000Z",
"generator": {
"provider": "local",
"model": "qwen2.5-coder:7b",
"base_url": "http://localhost:11434/v1"
},
"prompt_version": 1,
"summary_format": "structured-functional-v1"
},
"summary": "## Metadata\n- File path: src/example.ts\n...",
"raw_code_snapshot": "Optional code text when --include-code-snapshot is enabled"
}The JSON metadata contains deterministic source and generation facts. The markdown summary begins with ## Metadata, which is embedded with the rest of the summary so file paths, hashes, inferred language/type, symbols, functions, classes, and dependencies are searchable. Language/type and symbol/dependency details are inferred by the model from the file path, extension, and code content rather than maintained through a static parser.
Structured summaries use these sections in order:
## Metadata
## Purpose
## User-Visible Behavior
## Business Rules
## Data Inputs And Outputs
## Side Effects
## Error And Edge Cases
## Dependencies
## Operational NotesSummary assets are regenerated when the source hash changes, summary schema changes, prompt version changes, summary format changes, custom prompt hash changes, provider/model changes, or --refresh is passed. Regenerate existing schema 1 artifacts with npx diffdoc summarize --mode all --refresh. The embed command remains tolerant of older summary assets as long as they contain a content hash and summary text; use status or summarize to identify and refresh stale metadata.
Commit .diffdoc/manifest.json and .diffdoc/summaries/*.json if you want summaries shared across machines or CI runs. Keep .diffdoc/vectra/ local unless you have a specific reason to commit the generated vector index.
The manifest and summary assets are the stable handoff point for consumers. The local Vectra index produced by diffdoc embed is optional and can be replaced by any embedding model and storage backend that fits your environment.
MCP Server
DiffDoc ships an MCP stdio server as diffdoc-mcp. Run summarize and embed before using it so the MCP tools have a local index to query.
Run the server manually:
npx diffdoc-mcp --config ./.diffdocrcExample MCP client configuration:
{
"mcpServers": {
"diffdoc": {
"command": "npx",
"args": ["diffdoc-mcp", "--config", "./.diffdocrc"]
}
}
}Available MCP tools:
diffdoc_search: search the local index and return matching files, summaries, scores, hashes, and optional code snapshotsdiffdoc_answer: retrieve relevant context and ask the configured chat model to answer a questiondiffdoc_index_stats: return index path, existence status, and indexed item count
CI
For CI, prefer environment variables or a generated config file instead of committing local credentials.
Typical CI flow:
npm ci
npx diffdoc summarize --path . --mode delta --json
npx diffdoc embedUse summarize --json and status --json when a workflow needs machine-readable output.
Commit the manifest and summary assets from CI if you want DiffDoc state to advance with the branch. Ignore .diffdoc/vectra/ unless your workflow intentionally persists the local index.
Notes
summarizerequires a configured chat model.embedandsearchrequire a configured embedding model.queryrequires both chat and embedding configuration.statusdoes not require chat or embedding configuration.- Delta summarization uses Git changes plus the existing manifest state.
- Manifest schema is currently
schemaVersion: 2; older manifest shapes are not auto-migrated. - For code-oriented embedding models such as
nomic-embed-code, DiffDoc prefixes query embeddings withRepresent this query for searching relevant code:.
