npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@fwdslsh/inform

v0.2.0

Published

A high-performance web crawler powered by Bun that downloads pages and converts them to Markdown

Readme

Inform

Features

  • 🚀 Powered by Bun - Significantly faster than Node.js with built-in optimizations
  • ⚡ Native DOM parsing - Uses Bun's built-in DOMParser for zero-dependency HTML processing
  • ⚡ Concurrent crawling - Process multiple pages simultaneously for better performance
  • Crawls websites starting from a base URL
  • Stays within the same domain
  • Maintains original folder structure (e.g., /docs/button becomes docs/button.md)
  • Extracts main content by removing navigation, ads, and other non-content elements
  • Properly converts HTML code examples to markdown code blocks
  • Converts HTML to clean Markdown
  • Respects rate limiting with configurable delays
  • Saves files with meaningful names based on URL structure
  • Skips binary files and non-HTML content
  • Performance monitoring - Shows processing time for each page
  • Minimal dependencies - Only essential packages, no heavy DOM libraries

Installation

Quick Install Script

curl -fsSL https://raw.githubusercontent.com/fwdslsh/inform/main/install.sh | sh

Manual Downloads

Download pre-built binaries from GitHub Releases.

Docker

docker run fwdslsh/inform:latest --help

Advanced Installation

See docs/installation.md for full instructions, including how to install Inform without Bun using pre-built binaries for Linux, macOS, and Windows.

If you want to use Inform with Bun, you can still install via npm:

bun add @fwdslsh/inform

Or install globally:

bun install -g @fwdslsh/inform

Usage

Basic Usage

inform https://example.com

With Options

inform https://docs.example.com --max-pages 50 --delay 500 --concurrency 5 --output-dir ./documentation

Git Repository Downloads

# Download entire repository
inform https://github.com/owner/repo

# Download specific directory
inform https://github.com/owner/repo/tree/main/docs

# Download with filtering
inform https://github.com/owner/repo --include "*.md" --exclude "node_modules/**"

GitHub API Rate Limits:

For unauthenticated requests, GitHub limits you to 60 requests per hour. With authentication, this increases to 5,000 requests per hour. To authenticate:

export GITHUB_TOKEN="your_github_personal_access_token"
inform https://github.com/owner/repo

See docs/github-integration.md for detailed information on authentication and rate limits.

Documentation

Complete Guides

Quick Examples

Download docs from fwdslsh/unify repository:

inform https://github.com/fwdslsh/unify/tree/main/docs --output-dir ./unify-docs

Download all Scala Play 2.9 documentation:

inform https://www.playframework.com/documentation/2.9.x/ \
  --output-dir ./play-docs --max-pages 500 --delay 500

Complete documentation pipeline with fwdslsh tools:

# Download with Inform
inform https://docs.example.com --output-dir ./docs

# Process with ecosystem tools
npx @fwdslsh/unify --input ./docs --output ./unified
npx @fwdslsh/catalog ./unified --output ./llms.txt

Command Line Options

  • --max-pages <number>: Maximum number of pages to crawl (default: 100)
  • --delay <ms>: Delay between requests in milliseconds (default: 1000)
  • --concurrency <number>: Number of concurrent requests (default: 3)
  • --max-queue-size <number>: Maximum URLs in queue before skipping new links (default: 10000)
  • --max-retries <number>: Maximum retry attempts for failed requests (default: 3)
  • --ignore-robots: Ignore robots.txt directives (use with caution, web mode only)
  • --output-dir <path>: Output directory for saved files (default: crawled-pages)
  • --raw: Output raw HTML content without Markdown conversion
  • --include <pattern>: Include files matching glob pattern (can be used multiple times)
  • --exclude <pattern>: Exclude files matching glob pattern (can be used multiple times)
  • --ignore-errors: Exit with code 0 even if some pages/files fail
  • --verbose: Enable verbose logging (detailed output including retries, skipped files, and queue status)
  • --quiet: Enable quiet mode (errors only, no progress messages)
  • --help: Show help message

robots.txt Support

By default, Inform respects robots.txt files. It will:

  • Fetch and parse robots.txt from the target site
  • Respect Disallow directives for the "Inform/1.0" user agent and wildcard "*" rules
  • Apply Crawl-delay directives (overrides --delay if robots.txt specifies a higher value)
  • Skip URLs blocked by robots.txt with a log message

To bypass robots.txt (only if you have explicit permission):

inform https://example.com --ignore-robots

Warning: Ignoring robots.txt may violate a site's terms of service. Only use --ignore-robots when you have explicit permission.

Examples

Crawl a documentation site with high concurrency

inform https://docs.example.com --max-pages 50 --delay 500 --concurrency 5

Crawl a blog with custom output directory

inform https://blog.example.com --output-dir ./blog-content

Quick crawl with minimal delay

inform https://example.com --max-pages 20 --delay 200

Raw HTML output without Markdown conversion

inform https://docs.example.com --raw --output-dir ./raw-content

Integration with @fwdslsh/catalog

For users who need LLMS.txt file generation capabilities, we recommend using @fwdslsh/catalog in combination with Inform. This workflow allows you to:

  1. First, use Inform to crawl and convert web content to clean Markdown
  2. Then, use @fwdslsh/catalog to generate LLMS.txt files from the Markdown output

Example Workflow

# Step 1: Crawl documentation site with Inform
inform https://docs.example.com --output-dir ./docs-content

# Step 2: Generate LLMS.txt files with @fwdslsh/catalog  
npx @fwdslsh/catalog ./docs-content --output llms.txt

Benefits of this approach:

  • Separation of concerns: Inform focuses on high-quality web crawling and Markdown conversion
  • Flexibility: Use @fwdslsh/catalog's advanced LLMS.txt generation features with any Markdown content
  • Maintainability: Each tool can be optimized for its specific purpose
  • Reusability: Generated Markdown can be used for multiple purposes beyond LLMS.txt generation

For more information about @fwdslsh/catalog, see the official documentation.

How It Works

  1. URL Validation: Validates the provided base URL
  2. Content Extraction: Uses Bun's native HTMLRewriter for efficient, streaming HTML parsing
  3. Smart Content Selection: Intelligently identifies main content using selectors (main, article, .content, etc.)
  4. Cleanup: Removes navigation, ads, scripts, and other non-content elements during parsing
  5. Conversion: Converts clean HTML to Markdown using Turndown
  6. Link Discovery: Extracts and queues internal links during the streaming parse
  7. Rate Limiting: Respects delay settings to avoid overwhelming servers
  8. File Naming: Generates meaningful filenames based on URL structure

Technical Implementation

  • Zero-dependency HTML parsing: Uses Bun's built-in HTMLRewriter (no jsdom required)
  • Streaming processing: HTMLRewriter processes HTML as a stream for better memory efficiency
  • Native performance: All HTML parsing and DOM manipulation uses Bun's optimized native APIs
  • Minimal footprint: Reduced bundle size by eliminating heavy DOM libraries

Output

  • Files are saved as .md (Markdown) files by default, or .html (raw HTML) files when using --raw
  • Folder structure matches the original website (e.g., /docs/api/ becomes docs/api.md or docs/api.html)
  • Root pages become index.md or index.html
  • Query parameters are included in filenames when present
  • HTML code examples are converted to proper markdown code blocks (Markdown mode only)

Content Extraction Strategy

The crawler attempts to find main content using this priority order:

  1. <main> element
  2. [role="main"] attribute
  3. Common content class names (.main-content, .content, etc.)
  4. <article> elements
  5. Bootstrap-style containers
  6. Fallback to <body> content

Unwanted elements are automatically removed:

  • Navigation (nav, .menu, .navigation)
  • Headers and footers
  • Advertisements (.ad, .advertisement)
  • Social sharing buttons
  • Comments sections
  • Scripts and styles

Requirements

  • Bun v1.0.0 or higher (https://bun.sh)
  • Internet connection for crawling

Dependencies

  • turndown: For converting HTML to Markdown
  • minimatch: For glob pattern matching

Recent Changes

  • Refactored: The main crawler logic is now in src/WebCrawler.js for easier testing and maintenance.
  • CLI script (cli.js) now imports the crawler class and handles argument parsing only.
  • Improved modularity and testability.
  • Unit tests for the crawler are provided in tests/test_cli.js.

Development

Setup

If you want to contribute to Inform or run it from source:

  1. Clone the repository:

    git clone https://github.com/fwdslsh/inform.git
    cd inform
  2. Install dependencies:

    bun install
  3. Verify setup by running tests:

    bun test

All tests should pass (63 tests expected: 52 unit + 11 integration).

Testing

For comprehensive testing guidelines, best practices, and incident reports, see the Testing Guide.

# Run all tests
bun test

# Run specific test suite
bun test tests/web-crawler.test.js
bun test tests/integration/

# Run tests in watch mode
bun test --watch

Development Workflow

# Run from source
bun src/cli.js https://example.com

# Run tests
bun test

# Run tests in watch mode
bun test --watch

# Run performance benchmarks
bun run bench              # Run all benchmarks
bun run bench:save         # Save benchmark results to JSON
bun run bench:crawl        # Crawl benchmarks only
bun run bench:parsing      # HTML parsing benchmarks only

# Build binaries
bun run build              # Build for current platform
bun run build:all          # Build for all platforms (Linux, macOS, Windows)

# Clean build artifacts
bun run clean

Project Structure

inform/
├── src/                    # Source code
│   ├── cli.js             # CLI entry point and argument parsing
│   ├── WebCrawler.js      # Web crawling implementation
│   ├── GitCrawler.js      # Git repository downloading
│   ├── GitUrlParser.js    # Git URL parsing utilities
│   └── FileFilter.js      # File filtering logic
├── tests/                  # Test suites
│   ├── README.md          # Testing guide and best practices
│   ├── *.test.js          # Unit tests
│   └── integration/       # Integration tests
│       ├── test-server.js
│       ├── web-crawler-integration.test.js
│       └── git-crawler-integration.test.js
├── benchmarks/             # Performance benchmarks
│   ├── README.md
│   ├── index.js
│   ├── crawl-benchmark.js
│   └── html-parsing-benchmark.js
├── docs/                   # Documentation
└── docker/                 # Docker configuration

Building Binaries

# Build for your current platform
bun run build

# Build for specific platforms
bun run build:linux       # Linux x86_64
bun run build:macos       # macOS x86_64
bun run build:windows     # Windows x86_64

# Build for all platforms
bun run build:all

The binaries are standalone executables that include all dependencies and don't require Bun to be installed on the target system.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:

  • Development setup and workflow
  • Code style guidelines
  • Testing requirements
  • Submitting pull requests
  • Reporting issues

Quick checklist before submitting:

  • All tests pass (bun test)
  • Code follows the existing style
  • New features include tests
  • Documentation is updated

Ethical Use & Terms of Service

Please respect the work of others when crawling websites.

  • Always review and abide by the target site's robots.txt, terms of service, and copyright policies.
  • Do not use this tool for scraping or redistributing proprietary or copyrighted content without permission.
  • Use reasonable rate limits and avoid overwhelming servers.

Roadmap

  • Create Distribution Process: Add a build process to compile and package inform for zero-dependency cross-platform support.
  • Efficient Git Directory Download: ✅ COMPLETED - Add support for downloading only specific directories (e.g., docs/) from public git repositories, enabling quick access to documentation without cloning the entire repo.
  • Configurable Extraction: Allow users to specify custom selectors or extraction rules for different sites.
  • Advanced Filtering: Add more granular controls for what content is included/excluded.
  • Improved Markdown Conversion: Enhance code block and table handling for more accurate documentation conversion.

License

CC-BY CC-BY BY