npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@tremendous.dev/clippy

v0.0.1

Published

Crawl any site. Save as markdown.

Readme

clippy

Crawl any site. Save as markdown.

A fast, simple web scraper that saves crawled content as individual markdown files with frontmatter.

Note: Not affiliated with that helpful paperclip from your childhood. This one just grabs web pages.

Install

npm install -g @tremendous.dev/clippy

Quick Start

# Crawl a website and save as markdown files
clippy https://react.dev

# Specify output directory
clippy https://docs.python.org -o python-docs

# Crawl a GitHub repo
clippy https://github.com/user/repo -o repo-docs

# Crawl local codebase
clippy . -o my-code

What It Does

  • Crawls websites with configurable depth and concurrency
  • Extracts clean content using Mozilla Readability
  • Converts to markdown automatically
  • Saves individual files — one .md file per page with YAML frontmatter
  • Handles JavaScript sites — automatically falls back to browser mode when needed
  • Supports authentication — crawl sites that require login using persistent sessions
  • Respects robots.txt and rate limits by default
  • Works with Git repos — can crawl GitHub repos or local directories

Output Format

Each crawled page is saved as a separate markdown file with frontmatter:

---
title: "Page Title"
url: "https://example.com/page"
author: "Author Name"
date: "2024-01-01"
crawled: "2024-01-15T10:30:00.000Z"
---

# Page Title

Page content in clean markdown format...

Usage

Basic Crawling

# Crawl with default settings (depth=2, max=150 pages)
clippy https://example.com

# Customize depth and page limit
clippy https://example.com --depth 3 --max-pages 500

# Multiple sources into one directory
clippy https://react.dev https://nextjs.org -o frontend-docs

Output Control

# Specify output directory
clippy https://example.com -o my-docs

# Default output: ./clippy-output
clippy https://example.com

Crawling Behavior

# Control concurrency and rate limiting
clippy https://example.com -c 5 -r 2

# Include/exclude patterns
clippy https://example.com --include "docs/.*" --exclude ".*\.pdf"

# Disable sitemap discovery
clippy https://example.com --no-sitemap

# Ignore robots.txt (use responsibly!)
clippy https://example.com --no-robots

Browser Modes

# Force browser mode (for JS-heavy sites)
clippy https://spa-site.com --browser

# Force stealth mode (bypass anti-bot)
clippy https://protected-site.com --stealth

Authentication

Crawl sites that require login using persistent browser sessions:

# Login once (opens browser for manual authentication)
clippy auth login https://example.com

# Crawl authenticated site (automatically uses saved session)
clippy https://example.com/private-docs

# Manage sessions
clippy auth list              # List all stored sessions
clippy auth logout <url>      # Remove a session
clippy auth clear             # Clear all sessions

# Disable auth for a specific crawl
clippy https://example.com --no-auth

Preview Before Crawling

# See what's available via sitemap
clippy preview https://example.com

Options

| Flag | Description | Default | |------|-------------|---------| | -o, --output <dir> | Output directory for markdown files | ./clippy-output | | -d, --depth <n> | Crawl depth (0 = single page only) | 2 | | -m, --max-pages <n> | Maximum pages to crawl | 150 | | -c, --concurrency <n> | Concurrent requests | 10 | | -r, --rate-limit <n> | Max requests per second | 10 | | -t, --timeout <ms> | Request timeout | 10000 | | --include <regex> | Only crawl URLs matching pattern | - | | --exclude <regex> | Skip URLs matching pattern | - | | --label <label> | Label for crawled documents | web | | --sitemap | Use sitemap.xml for discovery | true | | --no-sitemap | Disable sitemap discovery | - | | --no-robots | Ignore robots.txt | - | | --no-auth | Disable automatic auth detection | - | | --browser | Force browser mode | - | | --stealth | Force stealth mode | - | | -q, --quiet | Minimal output | - | | -v, --verbose | Verbose output | - |

Examples

Documentation Sites

# Crawl React docs
clippy https://react.dev -o react-docs

# Crawl Python docs (large site)
clippy https://docs.python.org --depth 3 --max-pages 1000 -o python-docs

# Crawl Stripe API docs
clippy https://stripe.com/docs -o stripe-docs

Blogs & Articles

# Archive a blog
clippy https://paulgraham.com/articles.html -o pg-essays

# Specific article
clippy "https://example.com/article" -o articles

GitHub Repositories

# Crawl a GitHub repo
clippy https://github.com/user/repo -o repo-docs

# Local codebase
clippy . -o my-project-docs
clippy /path/to/project -o project-docs

Advanced Usage

# Slow and steady for rate-sensitive sites
clippy https://example.com -c 2 -r 1

# Fast crawl with high concurrency
clippy https://example.com -c 20 -r 50

# Deep crawl of specific section
clippy https://example.com/docs --depth 5 --include "docs/.*"

# JavaScript-heavy SPA
clippy https://spa.example.com --browser --max-pages 50

How It Works

Crawling Strategy

  1. URL Discovery

    • Starts with provided URLs
    • Checks for sitemap.xml (unless disabled)
    • Follows links up to specified depth
    • Respects robots.txt by default
  2. Content Extraction

    • Fetches pages with optimized waterfall:
      • fetch (fast, works for 90% of sites)
      • playwright (real browser, for JS sites)
      • rebrowser (stealth mode, bypasses anti-bot)
    • Extracts clean content using Mozilla Readability
    • Converts HTML to markdown
  3. File Output

    • Sanitizes URL/title into safe filename
    • Adds YAML frontmatter with metadata
    • Writes individual .md file per page
    • Handles duplicate filenames with counters

Deduplication

  • Skips duplicate URLs automatically
  • Detects locale variants (/en/, /es/, etc.)
  • Identifies similar content to avoid redundancy

Use Cases

  • Offline documentation — Read docs without internet
  • Documentation archival — Preserve documentation versions
  • Content backup — Archive websites before they change
  • Research — Collect content for analysis
  • Training data — Gather markdown content for ML
  • Knowledge base — Build searchable documentation collections

Programmatic Usage

import { clippy, preview } from 'clippy';

// Crawl a site
const result = await clippy(['https://example.com'], {
  output: './docs',
  depth: 2,
  maxPages: 100,
  quiet: false
});

console.log(`Crawled ${result.pages} pages`);
console.log(`Saved to: ${result.output}`);

// Preview available pages
const sitePreview = await preview('https://example.com');
console.log(`${sitePreview.totalPages} pages available`);

Rate Limiting & Ethics

  • Default: 10 requests/second with exponential backoff
  • Respects robots.txt by default (use --no-robots to override)
  • Be responsible: Don't hammer servers or bypass restrictions maliciously
  • Consider: Reduce concurrency/rate-limit for smaller sites

Limitations

  • No form submission — Only GET requests
  • Basic JavaScript — Complex SPAs may need --browser or --stealth
  • Cloudflare/reCAPTCHA — Stealth mode helps but isn't perfect

Troubleshooting

Site blocks requests?

clippy https://example.com --stealth

JavaScript not rendering?

clippy https://example.com --browser

Rate limited?

clippy https://example.com -c 2 -r 1

Too many pages?

clippy https://example.com --max-pages 50 --depth 1

Development

# Clone and install
git clone https://github.com/yerffejytnac/clippy.git
cd clippy
bun install

# Build
bun run build

# Link locally (creates symlink in ~/.bun/bin)
bun link

# Use
clippy https://example.com

Building with Bun

# Install with bun
bun install

# Build
bun run build

# Link globally (creates symlink in ~/.bun/bin)
bun link

# Make sure ~/.bun/bin is in your PATH

# Use
clippy https://example.com

License

MIT

Contributing

Contributions welcome! Please open an issue or PR.

Credits