@ebowwa/markdown-docs-scraper
v1.2.2
Published
Scrape and mirror markdown-based documentation sites
Maintainers
Readme
@ebowwa/markdown-docs-scraper
Scrape and mirror markdown-based documentation sites
Features
- 📥 Download full markdown documentation
- 🔄 Organize into directory structure
- 📊 Track downloads and failures
- 🚀 Fast concurrent downloads
- 🎯 CLI and programmatic API
Installation
bun add @ebowwa/markdown-docs-scraper
# or
npm install @ebowwa/markdown-docs-scraperCLI Usage
Quick Start - Anthropic Docs
markdown-docs-scraper anthropic -o ./docsScrape Any Site
markdown-docs-scraper scrape -u https://docs.example.com -o ./docsDiscover Available Pages
markdown-docs-scraper discover -u https://code.claude.comOptions
Commands:
scrape Scrape documentation from a URL
discover Discover all available documentation pages
anthropic Quick scrape of Anthropic Claude Code docs
help Display help for command
Options:
-u, --url <url> Base URL of the documentation site
-o, --output <dir> Output directory (default: "./docs")
--docs-path <path> Docs path (default: "/docs/en")
-c, --concurrency <num> Concurrency level (default: "5")
--llms-paths <paths> Comma-separated llms.txt paths (default: "/llms.txt,/docs/llms.txt")
--no-subdomain Disable docs/doc subdomain fallbackllms.txt Discovery
The scraper automatically tries multiple paths to find llms.txt:
- Configured paths (default:
/llms.txt,/docs/llms.txt) - Docs subdomain (e.g.,
https://docs.example.com/llms.txt) - Doc subdomain (e.g.,
https://doc.example.com/llms.txt)
Example with custom paths:
markdown-docs-scraper scrape -u https://example.com --llms-paths "/llms.txt,/api/llms.txt"Disable subdomain fallback:
markdown-docs-scraper scrape -u https://example.com --no-subdomainProgrammatic Usage
import { MarkdownDocsScraper } from "@ebowwa/markdown-docs-scraper";
const scraper = new MarkdownDocsScraper({
baseUrl: "https://code.claude.com",
docsPath: "/docs/en",
categories: {
"getting-started": ["introduction", "installation", "quick-start"],
features: ["inline-edits", "tool-use", "file-operations"],
},
outputDir: "./docs",
concurrency: 5,
});
const result = await scraper.scrape();
console.log(`Downloaded: ${result.downloaded.length}`);
console.log(`Failed: ${result.failed.length}`);
// Save pages to disk
await scraper.savePages(result.downloaded);Convenience Function
import { scrapeMarkdownDocs } from "@ebowwa/markdown-docs-scraper";
const result = await scrapeMarkdownDocs({
baseUrl: "https://docs.example.com",
outputDir: "./docs",
});API
MarkdownDocsScraper
Constructor Options
interface ScraperOptions {
baseUrl: string; // Base URL of the documentation site
docsPath?: string; // Docs path (default: "/docs/en")
categories?: Record<string, string[]>; // Categories and pages
outputDir?: string; // Output directory (default: "./docs")
concurrency?: number; // Concurrent downloads (default: 5)
onProgress?: (current: number, total: number) => void;
llmsPaths?: string[]; // llms.txt paths to try (default: ["/llms.txt", "/docs/llms.txt"])
tryDocsSubdomain?: boolean; // Also try docs/doc subdomains (default: true)
}Methods
scrape()- Scrape all configured pagesfetchMarkdown(url)- Fetch markdown from a URLdownloadPage(category, page)- Download a single pagesavePages(pages)- Save pages to diskdiscoverPages()- Discover available pages
Result
interface ScraperResult {
downloaded: DocPage[]; // Successfully downloaded pages
failed: Array<{ url: string; error: string }>;
duration: number; // Duration in milliseconds
}Output Format
Each downloaded file includes a header comment:
<!--
Source: https://code.claude.com/docs/en/introduction.md
Downloaded: 2026-02-06T00:00:00.000Z
-->
# Introduction
Original markdown content...Composable Scrapers Module
The package includes a composable scraper architecture for multiple documentation source types.
Usage
import {
scrapeSource,
registerScraper,
llmsTxtScraper,
githubRawScraper,
type SourceConfig,
} from "@ebowwa/markdown-docs-scraper/scrapers";
// Configure a source
const config: SourceConfig = {
name: "My Docs",
sourceType: "llms-txt",
baseUrl: "https://docs.example.com",
docsPath: "/docs",
outputDir: "./docs/my-docs",
reportDir: "./reports/my-docs",
};
// Scrape using the registry (auto-selects scraper by sourceType)
const result = await scrapeSource(config);Built-in Scrapers
- llms-txt: Scrapes docs sites with llms.txt index files
- github-raw: Downloads markdown directly from GitHub repos
Custom Scrapers
import { registerScraper, type Scraper, type SourceType } from "@ebowwa/markdown-docs-scraper/scrapers";
const myScraper: Scraper = {
type: "my-type" as SourceType,
async scrape(config) {
// Custom scraping logic
return {
downloaded: [],
failed: [],
duration: 0,
};
},
};
registerScraper(myScraper);Types
type SourceType = "llms-txt" | "github-raw";
interface SourceConfig {
name: string;
sourceType: SourceType;
baseUrl: string;
docsPath: string;
outputDir: string;
reportDir: string;
llmsTxtPath?: string;
linkPattern?: RegExp;
github?: {
repo: string;
includeCommits: boolean;
includeReleases: boolean;
includePRs: boolean;
};
}
interface Scraper {
type: SourceType;
scrape(config: SourceConfig): Promise<ScrapeResult>;
}License
MIT
Contributing
This package is part of the codespaces monorepo.
