gitwiki-cli

v0.1.1

Published

a month ago

Turn a repo's PR history into a living wiki. Installs the `gitwiki` CLI and a `git wiki` subcommand.

0High
0Medium
0Low

hexapode

git wiki github pull-requests documentation cli

gitwiki

Turn a repo's PR history into a living wiki.

gitwiki ingests merged pull requests (and optionally linked issues, commits, and review comments), extracts structured facts with an LLM, and maintains a cross-linked Markdown wiki plus a SQLite graph of how pages relate.

Install

Requires Node 20+.

# npm (global)
npm install -g gitwiki

# Homebrew (via the tap, once published)
brew install yourorg/tap/gitwiki

Installing registers two equivalent binaries on your PATH:

gitwiki — the CLI directly.
git-wiki — picked up by git as a subcommand, so you can run git wiki <cmd>.

From source

git clone <repo-url> gitwiki && cd gitwiki
npm install
npm run build
npm link   # exposes `gitwiki` and `git wiki` on your PATH

Credentials

Set via env vars or ~/.gitwiki/credentials (KEY=VALUE per line):

GITHUB_TOKEN — GitHub personal access token (read access to the target repo).
GEMINI_API_KEY — Google Gemini API key (also accepts GOOGLE_API_KEY / GOOGLE_GEMINI_API_KEY).

.env in the working directory is loaded automatically.

Usage

Run from inside a git checkout — gitwiki auto-detects owner/name from the origin remote. Pass it explicitly to override.

# First run: ingest all PR history for the current repo
git wiki init
# …or target any repo
git wiki init owner/name

# Incrementally pull new PRs since the last cursor
git wiki update

# Rebuild the landing page without fetching
git wiki index

# Agentically explore the wiki and suggest next things to do
git wiki suggest [--focus reliability] [--write]

# Inspect store state
git wiki status

# Validate wiki frontmatter
git wiki verify

# List configured sources
git wiki sources

All commands are also available as gitwiki <cmd> if you prefer.

Common flags:

--store <path> — store location (default .gitwiki/).
--from-pr <n> / --to-pr <n> — bound the PR range on init.
--model <name> — override the Gemini model.
--dry-run — fetch and cache raw data, skip LLM and wiki writes.
--resume — continue an interrupted init.
--offline — reprocess cached raw data without hitting GitHub.

Store layout

Everything lives under the store directory (default .gitwiki/):

config.json — repo binding and tuning knobs.
raw/ — cached GitHub payloads per PR.
graph.sqlite — pages, edges, cursor.
wiki/ — generated Markdown pages with YAML frontmatter.

Configuration

config.json is seeded from defaults on init. Adjustable fields include model, halfLifeDays, per-stage concurrency, what to include (review comments, linked issues, commits), and byte-level truncation limits for bodies, diffs, commit messages, and issue bodies.

Scripts

npm run dev — run the CLI via tsx without building.
npm run build — compile TypeScript to dist/.
npm run typecheck — type-check without emitting.
npm start — run the built CLI.

Design notes for the next agent

This section captures what has been iterated on so far and why, so the next person (or agent) picking this up doesn't re-derive the constraints. Read top-down — earliest changes first.

Initial architecture (v0)

Three-layer store, following the Karpathy "LLM Wiki v2" pattern:
- raw/ — append-only JSONL of PR metadata + gzipped diffs + cached extractions.
- wiki/ — markdown pages with YAML frontmatter (tier, confidence, sources, supersedes).
- gitwiki-cli.db — SQLite graph (cursor, nodes, typed edges).
Strict per-PR serial pipeline (fetch → extract → synthesize → graph-write → cursor-advance) in one SQLite transaction so the cursor only advances after all side-effects succeed. This is the resumption invariant.
Pluggable IssueSource abstraction with a working GitHubIssueSource (regex-parses Fixes #N / cross-repo refs, hydrates issue bodies). Linear/Jira are stubbed. Keep this interface stable — it is the extension point for other trackers.

Caching and offline iteration

24h cache for repo metadata (raw/repo.json) — README, description, topics, default branch, license, stars. Avoids hitting GET /repos on every run.
Per-PR GitHub cache: if a PR already lives in raw/prs.jsonl, the fetcher skips the 6-call hydration and yields the cached row instead. Only the paging request still costs API quota.
--offline flag on init / update: iterates purely from on-disk raw data, never calls GitHub, and (in offline mode only) ignores the cursor so you can re-extract and re-synthesize with new prompts without burning rate limits. Essential for iterating on prompts/schemas.
Extraction cache (raw/extractions/<prNumber>.<inputHash>.json) keyed by prompt hash. Changing the extractor prompt invalidates the cache automatically. Delete the directory manually when you want a hard reset.

Living landing page (not just a TOC)

README.md is maintained incrementally per PR, the same way component pages are — every PR triggers an updateReadme call with the existing README + the PR's extraction. The first PR seeds the page from repo.readme; later PRs revise in place.
Explicit section scaffold: What is this / Quick start / When NOT to use this / Architecture at a glance / Key engineering decisions / Build, test, deploy / Observability & debugging / Gotchas / Recent activity / Where to look next.
index.md is deterministic: lists every page on disk grouped by directory, with a tier/confidence legend. Do NOT let the LLM touch this file — it is the machine-readable TOC.
CONTRIBUTING.md and how-to.md are LLM-synthesized once at the end of each run from procedural pages + recipes the extractor tagged.
glossary.md is a static template (src/templates/glossary.ts) — tier / confidence / edges / node kinds.

Extraction schema and prompt rules

Key rules baked into the extractor prompt (in priority order):

Canonical id prefixes only: architecture/*, components/*, decisions/*, ops/*, guides/*. Kebab-case slugs, no numbers on ADRs.
Do NOT create per-test-file pages (components/X-tests). Testing knowledge goes on architecture/testing-suite AND the subject's component page.
ADR slug dedup is critical. Jaccard-60% on significant tokens between a proposed slug and any existing one → reuse the existing id via affected_components.
Tier defaults to semantic. procedural for stable ops/CI/release workflows. working only for genuinely in-flux subsystems. episodic only for superseded content.
Extractor also emits component_paths (source files touched per component) and how_to (imperative-question recipes with steps).

Synthesizer rules (per page)

First line is the Source marker: **Source:** \path/a`, `path/b` · browse on GitHub`. The system rebuilds this line deterministically — the LLM only supplies the backticked paths.
Update in place, do NOT append. When a PR refines an existing sentence, rewrite the sentence. The page must read as a current-state snapshot, not a changelog. Only ## History is append-only.
No meta-framing: never "This page documents…" or "This project provides…". Start with the fact.
Group decisions under subheadings when a page accumulates 10+ decisions.
Rationale ≠ Decisions: rationale explains why, never restates a decision.

Hardening passes at run end

Backlinks injection on every page: for each node with inbound edges (non-PR, non-issue sources), append a ## See also section listing the src nodes + edge type. Idempotent — previous section is stripped before rewriting.
Link validator (src/util/linkValidator.ts): scans every page body + README for [label](x.md) links and either repairs (same basename in a different dir → rewrite) or unlinks (drop brackets, keep label) when the target doesn't exist.
PR reference hyperlinker: turns every plain (PR #N) into [PR #N](https://github.com/owner/repo/pull/N) with negative-lookbehind to skip already-linked refs.
Source path normalizer (src/util/links.ts::linkSourcePaths): rebuilds the **Source:** line from whatever format the LLM emitted, into backticks + one trailing GitHub-tree link.
Fuzzy ADR dedup in the orchestrator: before creating decisions/<slug>, scans existing decisions/* ids with tokenJaccard >= 0.6 and reuses the match. Belt-and-suspenders on top of the extractor prompt rule.

Confidence and supersession

Stored confidence starts at 0.6 for new pages, +0.1 per reinforcement (capped at 1.0).
Decay is computed at read time, not written — stored * exp(-ln(2) * days / halfLifeDays). This keeps git diff on the wiki clean between runs.
Supersession drops confidence to 0.2, moves the page to episodic tier, and prepends a "Superseded by PR #N" banner.

CLI commands worth knowing

gitwiki-cli init <owner/repo> [--offline] [--resume] — ingest chronologically.
gitwiki-cli update [--offline] — continue from cursor.
gitwiki-cli index [--skip-refresh] — regenerate only landing-page artifacts (glossary / README / index / CONTRIBUTING / how-to) without reprocessing PRs. Cheap and handy while tuning the final prompts.
gitwiki-cli suggest [--focus X] [--write] — agentic "what should I look at next" over the wiki. Writes wiki/SUGGESTIONS.md with --write.
gitwiki-cli status / gitwiki-cli verify / gitwiki-cli sources — diagnostics.

Known rough edges

Some ADRs still end up with near-duplicate slugs when the extractor thinks of them as distinct. The fuzzy dedup catches most but not all — a follow-up pass could merge post-hoc by adding supersedes edges.
components/*-tests proliferation is prompt-controlled; if you see it recurring, tighten the extractor's "do not create per-test-file pages" rule.
gemini-3-flash-preview and gemini-3.1-flash-lite-preview are the verified model ids as of this writing. gemini-3-flash / gemini-3.1-flash-lite (without -preview) return 404.
Fallback indexSynthesizer.synthesizeReadme() is still kept around for stores that somehow have no per-PR README (e.g. dry-run only runs). Prefer the incremental path.

Regeneration workflow

To iterate on prompts without re-fetching GitHub:

# 1. Archive current state but keep raw/.
TS=$(date +%Y%m%d-%H%M%S)
mv .gitwiki-cli-liteparse/wiki .gitwiki-cli-liteparse/wiki.archive.$TS
mv .gitwiki-cli-liteparse/gitwiki-cli.db .gitwiki-cli-liteparse/gitwiki-cli.db.archive.$TS
rm -f .gitwiki-cli-liteparse/gitwiki-cli.db-{shm,wal}
rm -rf .gitwiki-cli-liteparse/raw/extractions  # force fresh extractions

# 2. Regenerate from cache.
npx tsx src/cli/index.ts init run-llama/liteparse --store .gitwiki-cli-liteparse --offline --resume

This reuses all cached PRs + diffs + repo metadata. The only API cost is Gemini (≈ $0.10 for ~30 PRs on Flash-Lite).