gitwiki-cli
v0.1.1
Published
Turn a repo's PR history into a living wiki. Installs the `gitwiki` CLI and a `git wiki` subcommand.
Maintainers
Readme
gitwiki
Turn a repo's PR history into a living wiki.
gitwiki ingests merged pull requests (and optionally linked issues, commits, and review comments), extracts structured facts with an LLM, and maintains a cross-linked Markdown wiki plus a SQLite graph of how pages relate.
Install
Requires Node 20+.
# npm (global)
npm install -g gitwiki
# Homebrew (via the tap, once published)
brew install yourorg/tap/gitwikiInstalling registers two equivalent binaries on your PATH:
gitwiki— the CLI directly.git-wiki— picked up by git as a subcommand, so you can rungit wiki <cmd>.
From source
git clone <repo-url> gitwiki && cd gitwiki
npm install
npm run build
npm link # exposes `gitwiki` and `git wiki` on your PATHCredentials
Set via env vars or ~/.gitwiki/credentials (KEY=VALUE per line):
GITHUB_TOKEN— GitHub personal access token (read access to the target repo).GEMINI_API_KEY— Google Gemini API key (also acceptsGOOGLE_API_KEY/GOOGLE_GEMINI_API_KEY).
.env in the working directory is loaded automatically.
Usage
Run from inside a git checkout — gitwiki auto-detects owner/name from the origin remote. Pass it explicitly to override.
# First run: ingest all PR history for the current repo
git wiki init
# …or target any repo
git wiki init owner/name
# Incrementally pull new PRs since the last cursor
git wiki update
# Rebuild the landing page without fetching
git wiki index
# Agentically explore the wiki and suggest next things to do
git wiki suggest [--focus reliability] [--write]
# Inspect store state
git wiki status
# Validate wiki frontmatter
git wiki verify
# List configured sources
git wiki sourcesAll commands are also available as gitwiki <cmd> if you prefer.
Common flags:
--store <path>— store location (default.gitwiki/).--from-pr <n>/--to-pr <n>— bound the PR range oninit.--model <name>— override the Gemini model.--dry-run— fetch and cache raw data, skip LLM and wiki writes.--resume— continue an interruptedinit.--offline— reprocess cached raw data without hitting GitHub.
Store layout
Everything lives under the store directory (default .gitwiki/):
config.json— repo binding and tuning knobs.raw/— cached GitHub payloads per PR.graph.sqlite— pages, edges, cursor.wiki/— generated Markdown pages with YAML frontmatter.
Configuration
config.json is seeded from defaults on init. Adjustable fields include model, halfLifeDays, per-stage concurrency, what to include (review comments, linked issues, commits), and byte-level truncation limits for bodies, diffs, commit messages, and issue bodies.
Scripts
npm run dev— run the CLI viatsxwithout building.npm run build— compile TypeScript todist/.npm run typecheck— type-check without emitting.npm start— run the built CLI.
Design notes for the next agent
This section captures what has been iterated on so far and why, so the next person (or agent) picking this up doesn't re-derive the constraints. Read top-down — earliest changes first.
Initial architecture (v0)
- Three-layer store, following the Karpathy "LLM Wiki v2" pattern:
raw/— append-only JSONL of PR metadata + gzipped diffs + cached extractions.wiki/— markdown pages with YAML frontmatter (tier, confidence, sources, supersedes).gitwiki-cli.db— SQLite graph (cursor, nodes, typed edges).
- Strict per-PR serial pipeline (fetch → extract → synthesize → graph-write → cursor-advance) in one SQLite transaction so the cursor only advances after all side-effects succeed. This is the resumption invariant.
- Pluggable
IssueSourceabstraction with a workingGitHubIssueSource(regex-parsesFixes #N/ cross-repo refs, hydrates issue bodies). Linear/Jira are stubbed. Keep this interface stable — it is the extension point for other trackers.
Caching and offline iteration
- 24h cache for repo metadata (
raw/repo.json) — README, description, topics, default branch, license, stars. Avoids hittingGET /reposon every run. - Per-PR GitHub cache: if a PR already lives in
raw/prs.jsonl, the fetcher skips the 6-call hydration and yields the cached row instead. Only the paging request still costs API quota. --offlineflag oninit/update: iterates purely from on-disk raw data, never calls GitHub, and (in offline mode only) ignores the cursor so you can re-extract and re-synthesize with new prompts without burning rate limits. Essential for iterating on prompts/schemas.- Extraction cache (
raw/extractions/<prNumber>.<inputHash>.json) keyed by prompt hash. Changing the extractor prompt invalidates the cache automatically. Delete the directory manually when you want a hard reset.
Living landing page (not just a TOC)
README.mdis maintained incrementally per PR, the same way component pages are — every PR triggers anupdateReadmecall with the existing README + the PR's extraction. The first PR seeds the page fromrepo.readme; later PRs revise in place.- Explicit section scaffold: What is this / Quick start / When NOT to use this / Architecture at a glance / Key engineering decisions / Build, test, deploy / Observability & debugging / Gotchas / Recent activity / Where to look next.
index.mdis deterministic: lists every page on disk grouped by directory, with a tier/confidence legend. Do NOT let the LLM touch this file — it is the machine-readable TOC.CONTRIBUTING.mdandhow-to.mdare LLM-synthesized once at the end of each run from procedural pages + recipes the extractor tagged.glossary.mdis a static template (src/templates/glossary.ts) — tier / confidence / edges / node kinds.
Extraction schema and prompt rules
Key rules baked into the extractor prompt (in priority order):
- Canonical id prefixes only:
architecture/*,components/*,decisions/*,ops/*,guides/*. Kebab-case slugs, no numbers on ADRs. - Do NOT create per-test-file pages (
components/X-tests). Testing knowledge goes onarchitecture/testing-suiteAND the subject's component page. - ADR slug dedup is critical. Jaccard-60% on significant tokens between a proposed slug and any existing one → reuse the existing id via
affected_components. - Tier defaults to
semantic.proceduralfor stable ops/CI/release workflows.workingonly for genuinely in-flux subsystems.episodiconly for superseded content. - Extractor also emits
component_paths(source files touched per component) andhow_to(imperative-question recipes with steps).
Synthesizer rules (per page)
- First line is the Source marker:
**Source:** \path/a`, `path/b` · browse on GitHub`. The system rebuilds this line deterministically — the LLM only supplies the backticked paths. - Update in place, do NOT append. When a PR refines an existing sentence, rewrite the sentence. The page must read as a current-state snapshot, not a changelog. Only
## Historyis append-only. - No meta-framing: never "This page documents…" or "This project provides…". Start with the fact.
- Group decisions under subheadings when a page accumulates 10+ decisions.
- Rationale ≠ Decisions: rationale explains why, never restates a decision.
Hardening passes at run end
- Backlinks injection on every page: for each node with inbound edges (non-PR, non-issue sources), append a
## See alsosection listing the src nodes + edge type. Idempotent — previous section is stripped before rewriting. - Link validator (
src/util/linkValidator.ts): scans every page body + README for[label](x.md)links and either repairs (same basename in a different dir → rewrite) or unlinks (drop brackets, keep label) when the target doesn't exist. - PR reference hyperlinker: turns every plain
(PR #N)into[PR #N](https://github.com/owner/repo/pull/N)with negative-lookbehind to skip already-linked refs. - Source path normalizer (
src/util/links.ts::linkSourcePaths): rebuilds the**Source:**line from whatever format the LLM emitted, into backticks + one trailing GitHub-tree link. - Fuzzy ADR dedup in the orchestrator: before creating
decisions/<slug>, scans existingdecisions/*ids withtokenJaccard >= 0.6and reuses the match. Belt-and-suspenders on top of the extractor prompt rule.
Confidence and supersession
- Stored
confidencestarts at 0.6 for new pages, +0.1 per reinforcement (capped at 1.0). - Decay is computed at read time, not written —
stored * exp(-ln(2) * days / halfLifeDays). This keepsgit diffon the wiki clean between runs. - Supersession drops confidence to 0.2, moves the page to
episodictier, and prepends a "Superseded by PR #N" banner.
CLI commands worth knowing
gitwiki-cli init <owner/repo> [--offline] [--resume]— ingest chronologically.gitwiki-cli update [--offline]— continue from cursor.gitwiki-cli index [--skip-refresh]— regenerate only landing-page artifacts (glossary / README / index / CONTRIBUTING / how-to) without reprocessing PRs. Cheap and handy while tuning the final prompts.gitwiki-cli suggest [--focus X] [--write]— agentic "what should I look at next" over the wiki. Writeswiki/SUGGESTIONS.mdwith--write.gitwiki-cli status/gitwiki-cli verify/gitwiki-cli sources— diagnostics.
Known rough edges
- Some ADRs still end up with near-duplicate slugs when the extractor thinks of them as distinct. The fuzzy dedup catches most but not all — a follow-up pass could merge post-hoc by adding
supersedesedges. components/*-testsproliferation is prompt-controlled; if you see it recurring, tighten the extractor's "do not create per-test-file pages" rule.gemini-3-flash-previewandgemini-3.1-flash-lite-previeware the verified model ids as of this writing.gemini-3-flash/gemini-3.1-flash-lite(without-preview) return 404.- Fallback
indexSynthesizer.synthesizeReadme()is still kept around for stores that somehow have no per-PR README (e.g. dry-run only runs). Prefer the incremental path.
Regeneration workflow
To iterate on prompts without re-fetching GitHub:
# 1. Archive current state but keep raw/.
TS=$(date +%Y%m%d-%H%M%S)
mv .gitwiki-cli-liteparse/wiki .gitwiki-cli-liteparse/wiki.archive.$TS
mv .gitwiki-cli-liteparse/gitwiki-cli.db .gitwiki-cli-liteparse/gitwiki-cli.db.archive.$TS
rm -f .gitwiki-cli-liteparse/gitwiki-cli.db-{shm,wal}
rm -rf .gitwiki-cli-liteparse/raw/extractions # force fresh extractions
# 2. Regenerate from cache.
npx tsx src/cli/index.ts init run-llama/liteparse --store .gitwiki-cli-liteparse --offline --resumeThis reuses all cached PRs + diffs + repo metadata. The only API cost is Gemini (≈ $0.10 for ~30 PRs on Flash-Lite).
