@leibniz/docnom
v1.0.2
Published
DocNom (doc + nom π΄) - A powerful blog crawler that hungrily consumes and archives web content from blogs and documentation sites
Maintainers
Readme
DocNom π΄
DocNom = doc + nom (the sound of eating)
A powerful blog crawler that hungrily "noms" through blogs and documentation sites, consuming and archiving web content into clean, organized Markdown files.
Features
- π½οΈ Hungry for Content - Devours blogs, documentation sites, and multi-page articles
- π Plugin-Based Adapters - Custom extractors for VictoriaMetrics, Cloudflare blogs, and more
- π Clean Markdown Output - Converts HTML to readable Markdown with frontmatter
- πΌοΈ Image Archiving - Downloads and localizes all images (with parallel downloads!)
- πΎ Incremental Crawling - Smart caching to avoid re-downloading known content
- π― Multiple Formats - Export as Markdown and/or JSON
- ποΈ Database Support - SQLite, PostgreSQL, or MySQL for content tracking
- β‘ High Performance - Rate limiting, concurrency control, and retry logic
Installation
Global Installation (Recommended)
npm install -g @leibniz/docnomUsing npx (No Installation Required)
npx @leibniz/docnom --helpLocal Project Installation
npm install @leibniz/docnomQuick Start
Basic Usage
# Crawl a blog or documentation site
docnom crawl https://victoriametrics.com/blog/
# Crawl with options
docnom crawl https://blog.cloudflare.com/ \
--max-posts 50 \
--output ./my-archive \
--formats markdown json
# Crawl specific post
docnom crawl https://zh.wikipedia.org/wiki/EulerUsing npx
# No installation needed!
npx @leibniz/docnom crawl https://example.com/blog/
# With options
npx @leibniz/docnom crawl https://blog.example.com/ \
--max-posts 20 \
--formats markdownCommand Line Options
crawl Command
docnom crawl <url> [options]Options
| Option | Description | Default |
|--------|-------------|---------|
| -o, --output <dir> | Output directory | ./output |
| -f, --formats <types> | Export formats (markdown, json) | markdown |
| --max-posts <number> | Maximum posts to crawl | Unlimited |
| --max-pages <number> | Maximum list pages to scan | 10 |
| -c, --concurrency <number> | Concurrent requests | 3 |
| --rate-limit <ms> | Delay between requests (ms) | 1000 |
| --retries <number> | Retry attempts for failed requests | 3 |
| --headless | Run browser in headless mode | true |
| --incremental | Skip known URLs (requires database | false |
| --stop-after-known <number> | Stop after N consecutive known posts | 10 |
| --db <type> | Database type (sqlite, postgres, mysql) | sqlite |
| --db-path <path> | SQLite database file path | ./data/blog-crawler.db |
Examples
Crawl with limits:
docnom crawl https://blog.example.com/ \
--max-posts 100 \
--max-pages 5High-speed crawling:
docnom crawl https://docs.example.com/ \
--concurrency 10 \
--rate-limit 500Incremental mode (skip existing):
docnom crawl https://blog.example.com/ \
--incremental \
--stop-after-known 5Export both formats:
docnom crawl https://example.com/blog/ \
--formats markdown jsonConfiguration File
Create a .config.json file in your project root:
{
"outputDir": "./archive",
"formats": ["markdown", "json"],
"maxPosts": 100,
"concurrency": 5,
"rateLimit": 1000,
"retries": 3,
"incremental": true,
"stopAfterKnown": 10,
"database": {
"type": "sqlite",
"path": "./data/archive.db"
}
}Then run:
docnom crawl https://blog.example.com/Output Structure
DocNom organizes content by domain:
output/
βββ example.com/ # Domain-based directory
β βββ index.md # Index of all posts
β βββ posts.json # Metadata (if --formats json)
β βββ post-title-1.md # Individual post
β βββ post-title-2.md
β βββ assets/ # Downloaded images
β βββ post-title-1/
β β βββ image_0.jpg
β β βββ image_1.png
β βββ post-title-2/
β βββ diagram.svgMarkdown Format
Each .md file includes:
---
title: "Post Title"
url: "https://example.com/post"
image: "https://example.com/cover.jpg"
wordCount: 1500
readingTime: 7
crawledAt: 2026-01-26T05:30:00.000Z
imagesCount: 5
---
# Post Title
Content with localized images:

...Supported Sites
DocNom includes specialized adapters for:
Built-in Adapters
- VictoriaMetrics Blog (
victoriametrics.com/blog) - Cloudflare Blog (
blog.cloudflare.com) - Default Adapter - Works with most blogs and documentation sites
Custom Adapters
You can create your own adapter by implementing the SiteAdapter interface:
import { SiteAdapter } from '@leibniz/docnom';
export class MyBlogAdapter implements SiteAdapter {
name = 'my-blog';
baseUrl = 'https://myblog.com';
canHandle(url: string): boolean {
return url.includes('myblog.com');
}
async getListPages(startUrl: string): Promise<string[]> {
// Return URLs of list pages
}
async extractPosts(listPageHtml: string, listUrl: string): Promise<BlogPostPreview[]> {
// Extract post previews from list page
}
async extractFullPost(html: string, url: string): Promise<BlogPost> {
// Extract full post content
}
}Programmatic Usage
import { Crawler } from '@leibniz/docnom';
import { VictoriaMetricsAdapter } from '@leibniz/docnom/adapters';
const crawler = new Crawler(new VictoriaMetricsAdapter(), {
outputDir: './output',
formats: ['markdown', 'json'],
maxPosts: 50,
concurrency: 5,
});
const result = await crawler.crawl('https://victoriametrics.com/blog/');
console.log(`Crawled ${result.posts.length} posts in ${result.duration}ms`);Advanced Features
Incremental Crawling
Save time by skipping already-crawled posts:
docnom crawl https://blog.example.com/ \
--incremental \
--stop-after-known 10DocNom will:
- Check URLs against the database
- Skip known posts
- Stop after encountering 10 consecutive known posts (configurable)
Custom Databases
PostgreSQL
docnom crawl https://example.com/blog/ \
--db postgres \
--db-connection "postgresql://user:pass@localhost:5432/docnom"MySQL
docnom crawl https://example.com/blog/ \
--db mysql \
--db-connection "mysql://user:pass@localhost:3306/docnom"Rate Limiting
Respect server resources:
docnom crawl https://example.com/ \
--rate-limit 2000 \ # 2 seconds between requests
--concurrency 1 # Sequential requestsTroubleshooting
Images Not Downloading
- Check network connectivity
- Verify image URLs are accessible
- Images are downloaded with 10 concurrent requests by default (configurable in
@leibniz/extractor)
Database Locked (SQLite)
- Ensure only one crawler instance is running
- Use PostgreSQL/MySQL for concurrent access
Out of Memory
- Reduce
--concurrency - Crawl in smaller batches with
--max-posts - Use
--incrementalmode
Architecture
DocNom is built on:
- @leibniz/extractor - Content extraction and image downloading
- Playwright - JavaScript rendering for dynamic sites
- Drizzle ORM - Type-safe database operations
- Turndown - HTML to Markdown conversion
Contributing
Contributions are welcome! Please:
- Fork the repository
- Create a feature branch
- Add tests for new adapters
- Submit a pull request
License
MIT Β© 2026
Related Projects
- @leibniz/extractor - The extraction library powering DocNom
- Readability - Mozilla's content extraction algorithm (inspiration)
Happy Nomming! π΄π
