npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, πŸ‘‹, I’m Ryan HefnerΒ  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you πŸ™

Β© 2026 – Pkg Stats / Ryan Hefner

@leibniz/docnom

v1.0.2

Published

DocNom (doc + nom 🍴) - A powerful blog crawler that hungrily consumes and archives web content from blogs and documentation sites

Readme

DocNom 🍴

DocNom = doc + nom (the sound of eating)

A powerful blog crawler that hungrily "noms" through blogs and documentation sites, consuming and archiving web content into clean, organized Markdown files.

npm version License: MIT

Features

  • 🍽️ Hungry for Content - Devours blogs, documentation sites, and multi-page articles
  • πŸ”Œ Plugin-Based Adapters - Custom extractors for VictoriaMetrics, Cloudflare blogs, and more
  • πŸ“ Clean Markdown Output - Converts HTML to readable Markdown with frontmatter
  • πŸ–ΌοΈ Image Archiving - Downloads and localizes all images (with parallel downloads!)
  • πŸ’Ύ Incremental Crawling - Smart caching to avoid re-downloading known content
  • 🎯 Multiple Formats - Export as Markdown and/or JSON
  • πŸ—„οΈ Database Support - SQLite, PostgreSQL, or MySQL for content tracking
  • ⚑ High Performance - Rate limiting, concurrency control, and retry logic

Installation

Global Installation (Recommended)

npm install -g @leibniz/docnom

Using npx (No Installation Required)

npx @leibniz/docnom --help

Local Project Installation

npm install @leibniz/docnom

Quick Start

Basic Usage

# Crawl a blog or documentation site
docnom crawl https://victoriametrics.com/blog/

# Crawl with options
docnom crawl https://blog.cloudflare.com/ \
  --max-posts 50 \
  --output ./my-archive \
  --formats markdown json

# Crawl specific post
docnom crawl https://zh.wikipedia.org/wiki/Euler

Using npx

# No installation needed!
npx @leibniz/docnom crawl https://example.com/blog/

# With options
npx @leibniz/docnom crawl https://blog.example.com/ \
  --max-posts 20 \
  --formats markdown

Command Line Options

crawl Command

docnom crawl <url> [options]

Options

| Option | Description | Default | |--------|-------------|---------| | -o, --output <dir> | Output directory | ./output | | -f, --formats <types> | Export formats (markdown, json) | markdown | | --max-posts <number> | Maximum posts to crawl | Unlimited | | --max-pages <number> | Maximum list pages to scan | 10 | | -c, --concurrency <number> | Concurrent requests | 3 | | --rate-limit <ms> | Delay between requests (ms) | 1000 | | --retries <number> | Retry attempts for failed requests | 3 | | --headless | Run browser in headless mode | true | | --incremental | Skip known URLs (requires database | false | | --stop-after-known <number> | Stop after N consecutive known posts | 10 | | --db <type> | Database type (sqlite, postgres, mysql) | sqlite | | --db-path <path> | SQLite database file path | ./data/blog-crawler.db |

Examples

Crawl with limits:

docnom crawl https://blog.example.com/ \
  --max-posts 100 \
  --max-pages 5

High-speed crawling:

docnom crawl https://docs.example.com/ \
  --concurrency 10 \
  --rate-limit 500

Incremental mode (skip existing):

docnom crawl https://blog.example.com/ \
  --incremental \
  --stop-after-known 5

Export both formats:

docnom crawl https://example.com/blog/ \
  --formats markdown json

Configuration File

Create a .config.json file in your project root:

{
  "outputDir": "./archive",
  "formats": ["markdown", "json"],
  "maxPosts": 100,
  "concurrency": 5,
  "rateLimit": 1000,
  "retries": 3,
  "incremental": true,
  "stopAfterKnown": 10,
  "database": {
    "type": "sqlite",
    "path": "./data/archive.db"
  }
}

Then run:

docnom crawl https://blog.example.com/

Output Structure

DocNom organizes content by domain:

output/
β”œβ”€β”€ example.com/              # Domain-based directory
β”‚   β”œβ”€β”€ index.md              # Index of all posts
β”‚   β”œβ”€β”€ posts.json            # Metadata (if --formats json)
β”‚   β”œβ”€β”€ post-title-1.md       # Individual post
β”‚   β”œβ”€β”€ post-title-2.md
β”‚   └── assets/               # Downloaded images
β”‚       β”œβ”€β”€ post-title-1/
β”‚       β”‚   β”œβ”€β”€ image_0.jpg
β”‚       β”‚   └── image_1.png
β”‚       └── post-title-2/
β”‚           └── diagram.svg

Markdown Format

Each .md file includes:

---
title: "Post Title"
url: "https://example.com/post"
image: "https://example.com/cover.jpg"
wordCount: 1500
readingTime: 7
crawledAt: 2026-01-26T05:30:00.000Z
imagesCount: 5
---

# Post Title

Content with localized images:

![Alt text](./assets/post-title/image_0.jpg)

...

Supported Sites

DocNom includes specialized adapters for:

Built-in Adapters

  • VictoriaMetrics Blog (victoriametrics.com/blog)
  • Cloudflare Blog (blog.cloudflare.com)
  • Default Adapter - Works with most blogs and documentation sites

Custom Adapters

You can create your own adapter by implementing the SiteAdapter interface:

import { SiteAdapter } from '@leibniz/docnom';

export class MyBlogAdapter implements SiteAdapter {
  name = 'my-blog';
  baseUrl = 'https://myblog.com';
  
  canHandle(url: string): boolean {
    return url.includes('myblog.com');
  }
  
  async getListPages(startUrl: string): Promise<string[]> {
    // Return URLs of list pages
  }
  
  async extractPosts(listPageHtml: string, listUrl: string): Promise<BlogPostPreview[]> {
    // Extract post previews from list page
  }
  
  async extractFullPost(html: string, url: string): Promise<BlogPost> {
    // Extract full post content
  }
}

Programmatic Usage

import { Crawler } from '@leibniz/docnom';
import { VictoriaMetricsAdapter } from '@leibniz/docnom/adapters';

const crawler = new Crawler(new VictoriaMetricsAdapter(), {
  outputDir: './output',
  formats: ['markdown', 'json'],
  maxPosts: 50,
  concurrency: 5,
});

const result = await crawler.crawl('https://victoriametrics.com/blog/');

console.log(`Crawled ${result.posts.length} posts in ${result.duration}ms`);

Advanced Features

Incremental Crawling

Save time by skipping already-crawled posts:

docnom crawl https://blog.example.com/ \
  --incremental \
  --stop-after-known 10

DocNom will:

  1. Check URLs against the database
  2. Skip known posts
  3. Stop after encountering 10 consecutive known posts (configurable)

Custom Databases

PostgreSQL

docnom crawl https://example.com/blog/ \
  --db postgres \
  --db-connection "postgresql://user:pass@localhost:5432/docnom"

MySQL

docnom crawl https://example.com/blog/ \
  --db mysql \
  --db-connection "mysql://user:pass@localhost:3306/docnom"

Rate Limiting

Respect server resources:

docnom crawl https://example.com/ \
  --rate-limit 2000 \  # 2 seconds between requests
  --concurrency 1       # Sequential requests

Troubleshooting

Images Not Downloading

  • Check network connectivity
  • Verify image URLs are accessible
  • Images are downloaded with 10 concurrent requests by default (configurable in @leibniz/extractor)

Database Locked (SQLite)

  • Ensure only one crawler instance is running
  • Use PostgreSQL/MySQL for concurrent access

Out of Memory

  • Reduce --concurrency
  • Crawl in smaller batches with --max-posts
  • Use --incremental mode

Architecture

DocNom is built on:

  • @leibniz/extractor - Content extraction and image downloading
  • Playwright - JavaScript rendering for dynamic sites
  • Drizzle ORM - Type-safe database operations
  • Turndown - HTML to Markdown conversion

Contributing

Contributions are welcome! Please:

  1. Fork the repository
  2. Create a feature branch
  3. Add tests for new adapters
  4. Submit a pull request

License

MIT Β© 2026


Related Projects


Happy Nomming! πŸ΄πŸ“š