npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

docscraper-ai

v1.4.0

Published

Fast documentation scraper and HTML-to-Markdown converter built for AI agents. Discover, crawl, and extract docs into clean structured Markdown.

Downloads

89

Readme

docscraper-ai

Fast documentation scraper and HTML-to-Markdown converter built for AI agents. Discover, crawl, and extract docs into clean structured Markdown.

npm version CI License: MIT

Claude Code Skill — /skill-tree

Generate structured knowledge graphs from any documentation, directly in Claude Code.

Install the skill with one command:

curl -fsSL https://raw.githubusercontent.com/Alex28042/docscraper-ai/main/skills/skill-tree.md -o ~/.claude/commands/skill-tree.md

Use it:

/skill-tree hono-routing
/skill-tree supabase-auth
/skill-tree https://docs.stripe.com/webhooks
/skill-tree stripe-webhooks --global

What it does:

  1. Discovers official documentation sources
  2. Scrapes and converts to Markdown
  3. Builds an atomic concept graph (8-20 nodes)
  4. Generates interlinked .md files with [[wikilinks]]
  5. Saves to .claude/skills/<topic>/ (or ~/.claude/skills/ with --global)

Features

  • Discover official documentation sources via DuckDuckGo search
  • Crawl multi-page sites with BFS, depth control, and concurrency
  • Convert HTML to clean Markdown (Turndown + Cheerio)
  • Sitemap parsing for efficient URL discovery via sitemap.xml
  • Robots.txt compliance with User-Agent matching and crawl delay
  • Retry with backoff — exponential backoff + jitter on failures
  • Progress events — real-time callbacks for page-by-page tracking
  • Code snippet extraction — pull code blocks with language detection
  • Single-file export — concatenate all pages into one Markdown file
  • 3 HTTP clients — native fetch, Axios, or Undici (fastest)
  • Branded types + Zod schemas for type-safe runtime validation
  • SOLID architecture with full dependency injection

Install

npm install docscraper-ai
# or
pnpm add docscraper-ai

CLI

Discover documentation sources

npx docscraper-ai discover "hono framework"
[
  { "url": "https://hono.dev/docs/", "title": "Hono - Web framework", "snippet": "..." },
  { "url": "https://github.com/honojs/hono", "title": "GitHub - honojs/hono", "snippet": "..." }
]

Scrape URLs to Markdown

# Output as JSON to stdout
npx docscraper-ai scrape https://hono.dev/docs/

# Save as .md files to a directory
npx docscraper-ai scrape https://hono.dev/docs/ --output ./raw

# Multiple URLs with concurrency
npx docscraper-ai scrape https://hono.dev/docs https://hono.dev/api --output ./raw --concurrency 5

CLI Options

Commands:
  discover <topic>              Search for official documentation URLs
  scrape <url> [<url>...]       Scrape URLs and output markdown

Discover options:
  --max-results <n>             Max URLs to return (default: 4)

Scrape options:
  --output <dir>                Output directory for .md files (default: stdout as JSON)
  --max-chars <n>               Skip pages over N characters (default: 80000)
  --concurrency <n>             Max concurrent requests (default: 3)

Programmatic API

Quick Start

import { createDefaultCrawler, createDefaultDiscoverer } from 'docscraper-ai';

// Discover documentation sources
const discoverer = await createDefaultDiscoverer();
const sources = await discoverer.discover('supabase auth');
console.log(sources);

// Crawl a documentation site
const crawler = await createDefaultCrawler();
const result = await crawler.crawl('https://hono.dev/docs/', {
  maxDepth: 2,
  maxPages: 20,
});

for (const page of result.pages) {
  console.log(`${page.title}: ${page.markdown.length} chars`);
}

Scrape a Single Page

const crawler = await createDefaultCrawler();
const page = await crawler.scrapePage('https://hono.dev/docs/getting-started');

console.log(page.title);       // "Getting Started - Hono"
console.log(page.markdown);    // Clean markdown content
console.log(page.links);       // ["https://hono.dev/docs/api", ...]

Choose Your HTTP Client

Three built-in HTTP clients — pick the one that fits your needs:

import { createDefaultCrawler } from 'docscraper-ai';

// Native fetch (default, zero deps)
const fetchCrawler = await createDefaultCrawler({ httpClient: 'fetch' });

// Axios (feature-rich)
const axiosCrawler = await createDefaultCrawler({ httpClient: 'axios' });

// Undici (fastest for high-throughput)
const undiciCrawler = await createDefaultCrawler({ httpClient: 'undici' });

Retry with Exponential Backoff

Wrap any HTTP client with automatic retries on network errors and 5xx responses:

import { RetryHttpClient, FetchHttpClient } from 'docscraper-ai';
import { TokenBucketRateLimiter } from 'docscraper-ai/dist/http/rate-limiter';

const inner = new FetchHttpClient(new TokenBucketRateLimiter(5));

const resilientClient = new RetryHttpClient(inner, {
  maxRetries: 3,        // retry up to 3 times
  baseDelayMs: 1000,    // start with 1s delay
  maxDelayMs: 10_000,   // cap at 10s
});
// Delay formula: min(baseDelay * 2^attempt, maxDelay) + random jitter

Progress Events

Track crawling progress in real-time:

import { createDefaultCrawler, type ICrawlProgress, type CrawlProgressEvent, type CrawlStats } from 'docscraper-ai';

const progress: ICrawlProgress = {
  onPageComplete(event: CrawlProgressEvent) {
    const icon = event.status === 'success' ? 'ok' : 'ERR';
    console.log(`[${event.currentPage}/${event.totalPages}] ${icon} ${event.url}`);
  },
  onCrawlComplete(stats: CrawlStats) {
    console.log(`Done: ${stats.totalPages} pages in ${stats.durationMs}ms`);
  },
};

const crawler = await createDefaultCrawler();
// Pass progress as the constructor's last argument when using Crawler directly:
import { Crawler, NullLogger } from 'docscraper-ai';
// new Crawler(httpClient, concurrencyLimiter, parser, converter, linkExtractor, logger, progress);

Sitemap Parsing

Discover all pages from a site's sitemap.xml:

import { SitemapParser, FetchHttpClient } from 'docscraper-ai';
import { TokenBucketRateLimiter } from 'docscraper-ai/dist/http/rate-limiter';

const httpClient = new FetchHttpClient(new TokenBucketRateLimiter(5));
const sitemap = new SitemapParser(httpClient);

const urls = await sitemap.parse('https://hono.dev/sitemap.xml');
console.log(`Found ${urls.length} pages`);
// ["https://hono.dev/docs/", "https://hono.dev/docs/api", ...]

Robots.txt Compliance

Respect crawling rules like a professional scraper:

import { RobotsParser, FetchHttpClient } from 'docscraper-ai';
import { TokenBucketRateLimiter } from 'docscraper-ai/dist/http/rate-limiter';

const httpClient = new FetchHttpClient(new TokenBucketRateLimiter(5));
const robots = new RobotsParser(httpClient);

// Check if a URL is allowed
const allowed = await robots.isAllowed('https://example.com/docs/api');

// Get crawl delay (seconds)
const delay = robots.getCrawlDelay(); // e.g. 2

// Discover sitemaps from robots.txt
const sitemaps = robots.getSitemapUrls(); // ["https://example.com/sitemap.xml"]

Code Snippet Extraction

Extract code blocks from HTML with language detection:

import { CheerioCodeExtractor } from 'docscraper-ai';

const extractor = new CheerioCodeExtractor();
const snippets = extractor.extract(html);

for (const snippet of snippets) {
  console.log(`Language: ${snippet.language}`);  // "typescript", "bash", etc.
  console.log(`Context: ${snippet.context}`);    // nearest heading
  console.log(`Code: ${snippet.code}`);
}

Single-File Export

Concatenate all crawled pages into one Markdown file:

import { createDefaultCrawler, exportToSingleFile } from 'docscraper-ai';

const crawler = await createDefaultCrawler();
const result = await crawler.crawl('https://hono.dev/docs/', { maxPages: 20 });

// As a string
const markdown = exportToSingleFile(result, {
  includeMetadata: true,  // adds crawl stats header
  separator: '\n\n---\n\n',
});

// Write to file
import { exportToSingleFileAndWrite } from 'docscraper-ai';
import { FsContentWriter } from 'docscraper-ai/dist/generation/fs-writer';

exportToSingleFileAndWrite(result, './docs-all.md', new FsContentWriter());

Generate a Skill Tree

Build a hierarchical documentation tree with navigation:

import { createDefaultCrawler, generateSkillTree } from 'docscraper-ai';
import { FsContentWriter } from 'docscraper-ai/dist/generation/fs-writer';

const crawler = await createDefaultCrawler();
const result = await crawler.crawl('https://hono.dev/docs/', {
  maxDepth: 2,
  maxPages: 30,
});

const writer = new FsContentWriter();
const { files, indexPath } = generateSkillTree(result, './output', writer);

console.log(`Generated ${files.length} files`);
console.log(`Index: ${indexPath}`);

Custom Configuration

import { createDefaultCrawler, StderrLogger } from 'docscraper-ai';

const crawler = await createDefaultCrawler(
  {
    rateLimit: 10,        // requests per second
    timeoutMs: 15_000,    // 15 second timeout
    concurrency: 5,       // 5 concurrent requests
    httpClient: 'undici',
  },
  new StderrLogger(),
);

const result = await crawler.crawl('https://docs.stripe.com/api', {
  maxDepth: 3,
  maxPages: 100,
  includePatterns: ['/api/'],
  excludePatterns: ['/changelog'],
});

Dependency Injection

All core classes accept interfaces via constructor — swap any component:

import {
  Crawler,
  type IHttpClient,
  type IHtmlParser,
  type IHtmlConverter,
  type ILinkExtractor,
  NullLogger,
  NullProgress,
} from 'docscraper-ai';
import { SemaphoreConcurrencyLimiter } from 'docscraper-ai/dist/http/concurrency-limiter';

const myClient: IHttpClient = {
  async fetch(url) {
    const res = await fetch(url);
    return res.text();
  },
};

const crawler = new Crawler(
  myClient,
  new SemaphoreConcurrencyLimiter(3),
  myParser,         // IHtmlParser
  myConverter,      // IHtmlConverter
  myLinkExtractor,  // ILinkExtractor
  new NullLogger(),
  new NullProgress(),
);

Zod Schemas for Validation

Validate data at runtime boundaries:

import { ScraperConfigSchema, CrawlOptionsSchema, UrlSchema } from 'docscraper-ai';

const config = ScraperConfigSchema.parse({
  rateLimit: 5,
  timeoutMs: 10_000,
  concurrency: 3,
});

const url = UrlSchema.parse('https://docs.example.com');

Branded Types

Type-safe domain primitives — prevent mixing up strings and numbers:

import type { Url, Milliseconds } from 'docscraper-ai';
import { toUrl, toMilliseconds } from 'docscraper-ai';

const url: Url = toUrl('https://docs.example.com');
const timeout: Milliseconds = toMilliseconds(10_000);

// TypeScript errors — can't assign raw primitives:
// const bad: Url = 'https://example.com';
// const wrong: Milliseconds = 5000;

Architecture

src/
  interfaces/     # Pure interfaces (Strategy pattern)
  types/          # Branded types and domain types
  schemas/        # Zod runtime validation
  http/           # FetchHttpClient, AxiosHttpClient, UndiciHttpClient, RetryHttpClient
  parsing/        # CheerioHtmlParser, TurndownConverter, CheerioLinkExtractor, CheerioCodeExtractor
  discovery/      # Discoverer, DuckDuckGoSearchEngine, SitemapParser, RobotsParser, scoring
  crawling/       # Crawler with DI (7 injectable deps)
  generation/     # Skill tree, content generator, single-file exporter, file writer
  cli/            # CLI commands
  factories.ts    # createDefaultCrawler(), createDefaultDiscoverer()

SOLID principles with constructor dependency injection. No DI container — factory functions wire the dependency graph.

License

MIT