docpup
v0.1.10
Published
CLI for indexing documentation from GitHub repositories.
Readme
docpup
CLI tool to clone GitHub documentation and source code, generating AGENTS.md indexes for AI coding agents.
What it does
Docpup fetches documentation or source code from GitHub repositories using sparse checkout, copies files to a local directory, and generates compact index files in the AGENTS.md format. These indexes provide persistent context to AI coding agents.
Supports:
- Documentation files (
.md,.mdx) - Source code with custom extensions (
.ts,.js,.py, etc.) - Selective directory fetching (e.g., only
srcandsamples) - Single file fetching (e.g., just
README.md) - Fetching docs directly from URLs (live HTML pages converted to Markdown)
Paths in the config are resolved from the current working directory where you run the CLI.
For git repos, docpup keeps a root-level docpup-lock.json file. Each run resolves the current remote commit first and skips re-downloading repos whose commit, generation inputs, and outputs are unchanged.
Installation
npm install -g docpupOr run directly with npx:
npx docpup generateQuick Start
- Create a
docpup.config.yamlin your project root:
docsDir: documentation
indicesDir: documentation/indices
repos:
- name: nextjs
repo: https://github.com/vercel/next.js
sourcePath: docs
ref: canary- Run docpup:
docpup generate- Find your docs in
documentation/nextjs/, the index indocumentation/indices/nextjs-index.md, and the git freshness metadata indocpup-lock.json.
Configuration
Full Configuration Example
docsDir: documentation
indicesDir: documentation/indices
gitignore:
addDocsDir: true
addDocsSubDirs: false
addIndexFiles: false
sectionHeader: "Docpup generated docs"
scan:
includeMd: true
includeMdx: true
includeHiddenDirs: false
excludeDirs:
- .git
- node_modules
- images
- assets
concurrency: 2
repos:
# Traditional documentation indexing
- name: nextjs
repo: https://github.com/vercel/next.js
sourcePath: docs
ref: canary
- name: auth0-docs
repo: https://github.com/auth0/docs-v2
sourcePath: main/docs
# Source code indexing with multiple directories
- name: codex-sdk
repo: https://github.com/openai/codex
contentType: source
sourcePaths:
- sdk/typescript/src
- sdk/typescript/samples
scan:
extensions: [".ts", ".tsx"]
# Single file indexing
- name: codex-readme
repo: https://github.com/openai/codex
sourcePaths:
- sdk/typescript/README.md
# URL-based documentation fetching
- name: claude-docs
urls:
- https://docs.anthropic.com/en/docs/overview
- https://docs.anthropic.com/en/docs/quickstart
selector: main
password: ${DOCS_PASSWORD}
# Sitemap-based documentation discovery
- name: anthropic-api-docs
sitemap: https://platform.claude.com/sitemap.xml
paths:
- prefix: docs/en/api
subs:
- sdks
selector: mainConfiguration Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| docsDir | string | "documentation" | Output directory for copied docs |
| indicesDir | string | "documentation/indices" | Output directory for index files |
| gitignore.addDocsDir | boolean | true | Add docs directory to .gitignore |
| gitignore.addDocsSubDirs | boolean | false | Add per-repo subdirectories to .gitignore (e.g., docs/nextjs/) instead of whole docs dir |
| gitignore.addIndexFiles | boolean | false | Add indices directory to .gitignore |
| gitignore.sectionHeader | string | "Docpup generated docs" | Header for .gitignore section |
| scan.includeMd | boolean | true | Include .md files (ignored if extensions is set) |
| scan.includeMdx | boolean | true | Include .mdx files (ignored if extensions is set) |
| scan.includeHiddenDirs | boolean | false | Scan hidden directories (dotfolders) |
| scan.excludeDirs | string[] | [...] | Directories to exclude |
| scan.extensions | string[] | - | Custom file extensions to include (e.g., [".ts", ".js"]). Overrides includeMd/includeMdx |
| concurrency | number | 2 | Number of repos to process in parallel |
Repo Configuration
| Option | Type | Required | Description |
|--------|------|----------|-------------|
| name | string | Yes | Unique identifier for this repo |
| repo | string | No | GitHub repository URL. Exactly one of repo, urls, or sitemap must be provided |
| urls | string[] | No | List of URLs to fetch docs from. Exactly one of repo, urls, or sitemap must be provided |
| sitemap | string | No | Sitemap URL to discover doc pages. Exactly one of repo, urls, or sitemap must be provided |
| paths | object[] | No | Path prefix rules for filtering sitemap URLs (see Sitemap Sources). sitemap sources only |
| selector | string | No | CSS selector to extract content from HTML pages (e.g., main, article, #content). Used with urls and sitemap |
| password | string | No | Password for protected doc sites (e.g., Readme.com). Supports ${ENV_VAR} interpolation. urls or sitemap sources only |
| sourcePath | string | No | Single path to fetch (use . for root). Required for repo sources |
| sourcePaths | string[] | No | Multiple paths to fetch (directories or single files). Required for repo sources |
| ref | string | No | Branch, tag, or commit. repo sources only (auto-detects default branch if not specified) |
| contentType | string | No | "docs" (default) or "source" - affects index title and warning message |
| preprocess | object | No | Optional preprocess step (sphinx or html, single path only). repo sources only |
| scan | object | No | Per-repo scan overrides (merged with global scan config) |
Preprocess
Note that preprocess supports Sphinx and HTML today, but is extensible to utilize any required preprocessor.
The Sphinx preprocessor uses (Sphinx) to build docs before scanning. This is useful for projects like Django that rely on reStructuredText includes, substitutions, and directives.
repos:
- name: django-docs
repo: https://github.com/django/django
sourcePath: docs
preprocess:
type: sphinx
workDir: docs
builder: markdown
outputDir: docpup-buildPrerequisites:
- Python 3 on PATH (
python) - Sphinx + Markdown builder:
python -m pip install sphinx sphinx-markdown-builder
Notes:
sourcePath(or a single entry insourcePaths) must exist in the repo (used for sparse checkout).- If
workDiris omitted, it defaults tosourcePathor the singlesourcePathsentry. buildermust bemarkdown(requiressphinx-markdown-builder).outputDirmust be a non-hidden directory unlessscan.includeHiddenDirsis true.- Preprocess is not supported with multiple
sourcePaths.
HTML Preprocess
The HTML preprocessor converts HTML pages into Markdown before scanning and indexing.
repos:
- name: python-genai
repo: https://github.com/googleapis/python-genai
sourcePath: docs
preprocess:
type: html
workDir: docs
outputDir: docpup-build
rewriteLinks: trueNotes:
workDirdefaults tosourcePath.rewriteLinksdefaults totrue, converting.html/.htmlinks to.md.selectorcan be used to target a specific content node (e.g.,main,article, or#content).
Source Code Indexing
Docpup can index source code in addition to documentation. Use contentType: source and custom extensions to fetch specific file types:
repos:
- name: my-sdk
repo: https://github.com/example/sdk
contentType: source
sourcePaths:
- src
- samples
scan:
extensions: [".ts", ".tsx", ".js"]
excludeDirs: [node_modules, dist, __tests__]This generates an index with a "Source Index" title and appropriate warning:
<!-- MY-SDK-AGENTS-MD-START -->[my-sdk Source Index]|root: documentation/my-sdk|STOP. This is source code from my-sdk. Search and read files before making changes.|src:{index.ts,client.ts}|samples:{basic.ts}<!-- MY-SDK-AGENTS-MD-END -->Single File Fetching
You can fetch individual files by specifying file paths in sourcePaths:
repos:
- name: project-readme
repo: https://github.com/example/project
sourcePaths:
- README.md
- docs/CONTRIBUTING.mdURL Sources
Docpup can fetch documentation directly from live HTML pages using the urls option, as an alternative to cloning a Git repository.
repos:
- name: claude-docs
urls:
- https://docs.anthropic.com/en/docs/overview
- https://docs.anthropic.com/en/docs/quickstart
selector: mainFor each URL, docpup uses a three-tier fetching strategy:
- Requests the URL with an
Accept: text/markdownheader - Tries a
.mdURL variant (e.g.,/overview→/overview.md) - Falls back to fetching the HTML and converting it to Markdown
Filenames are automatically derived from page titles, with common prefixes/suffixes stripped and collisions resolved by appending a numeric suffix.
Notes:
selectoris optional. When omitted, docpup falls back through common content elements (main,article,#content,.content,body).passwordcan be used to authenticate with password-protected doc sites. Supports${ENV_VAR}interpolation for secrets.sourcePath,sourcePaths,ref, andpreprocessare not valid withurls.
Sitemap Sources
Docpup can automatically discover documentation URLs from a sitemap.xml, with path prefix filtering to control which pages to include.
repos:
- name: anthropic-api-docs
sitemap: https://platform.claude.com/sitemap.xml
paths:
- prefix: docs/en/api
subs:
- sdks
- skills
selector: mainThe paths array controls which URLs from the sitemap are included:
- First-level children of each prefix are included automatically (e.g.,
docs/en/api/overview,docs/en/api/errors) - Nested paths are excluded by default (e.g.,
docs/en/api/sdks/python) subsopts in specific sub-directories at full depth (e.g.,subs: [sdks]includesdocs/en/api/sdks/python,docs/en/api/sdks/typescript, etc.)- The prefix page itself (e.g.,
docs/en/api) is also included if it exists in the sitemap
When paths is omitted, all URLs from the sitemap are included without filtering.
Sitemap index files (sitemaps that reference other sitemaps) are handled automatically.
Notes:
sourcePath,sourcePaths,ref, andpreprocessare not valid withsitemap.passwordcan be used to authenticate with password-protected doc sites.pathsis only valid withsitemap.
CLI Usage
# Run with default config
docpup generate
# Specify config file
docpup generate --config ./custom-config.yaml
# Process only specific repos
docpup generate --only nextjs,temporal
# Override concurrency
docpup generate --concurrency 4
# Force git repos to rebuild even if unchanged
docpup generate --refresh
# Show help
docpup --help
# Show version
docpup --versionIndex File Format
Docpup generates index files in the AGENTS.md format:
Documentation Index:
<!-- NEXTJS-AGENTS-MD-START -->[nextjs Docs Index]|root: documentation/nextjs|STOP. What you remember about nextjs may be WRONG for this project. Always search docs and read before any task.|(root):{index.mdx}|guides:{setup.md,intro.md}<!-- NEXTJS-AGENTS-MD-END -->Source Code Index:
<!-- CODEX-SDK-AGENTS-MD-START -->[codex-sdk Source Index]|root: documentation/codex-sdk|STOP. This is source code from codex-sdk. Search and read files before making changes.|sdk/typescript/src:{index.ts,client.ts}|sdk/typescript/samples:{basic.ts}<!-- CODEX-SDK-AGENTS-MD-END -->This compact format provides:
- Start/end markers for easy parsing
- Root path for the files
- Context-aware warning (docs vs source code)
- Directory-to-file mapping with preserved path structure
Authentication
Docpup uses your existing git credentials (SSH keys, credential helpers, or stored tokens). No additional authentication configuration is required.
For private repositories, ensure you have access configured in your git environment.
Error Handling
- If a repository fails to clone, docpup logs a warning and continues with other repos
- If a git repo is unchanged and its outputs still exist, docpup skips downloading it and reuses the existing generated files
- If
docpup-lock.jsonmatches but the docs or index files are missing, docpup downloads the repo again to rebuild them - The CLI always exits with status 0 if it can continue running (non-fatal errors)
- Invalid configuration or unexpected errors result in non-zero exit
Requirements
- Node.js 20 or later
- Git 2.25 or later (for sparse-checkout support)
License
MIT
