@fwdslsh/inform

v0.2.0

Published

15 days ago

A high-performance web crawler powered by Bun that downloads pages and converts them to Markdown

0High
0Medium
0Low

itlackey

crawler web-scraping markdown cli bun performance

Inform

Features

🚀 Powered by Bun - Significantly faster than Node.js with built-in optimizations
⚡ Native DOM parsing - Uses Bun's built-in DOMParser for zero-dependency HTML processing
⚡ Concurrent crawling - Process multiple pages simultaneously for better performance
Crawls websites starting from a base URL
Stays within the same domain
Maintains original folder structure (e.g., /docs/button becomes docs/button.md)
Extracts main content by removing navigation, ads, and other non-content elements
Properly converts HTML code examples to markdown code blocks
Converts HTML to clean Markdown
Respects rate limiting with configurable delays
Saves files with meaningful names based on URL structure
Skips binary files and non-HTML content
Performance monitoring - Shows processing time for each page
Minimal dependencies - Only essential packages, no heavy DOM libraries

Installation

Quick Install Script

curl -fsSL https://raw.githubusercontent.com/fwdslsh/inform/main/install.sh | sh

Manual Downloads

Download pre-built binaries from GitHub Releases.

Docker

docker run fwdslsh/inform:latest --help

Advanced Installation

See docs/installation.md for full instructions, including how to install Inform without Bun using pre-built binaries for Linux, macOS, and Windows.

If you want to use Inform with Bun, you can still install via npm:

bun add @fwdslsh/inform

Or install globally:

bun install -g @fwdslsh/inform

Usage

Basic Usage

inform https://example.com

With Options

inform https://docs.example.com --max-pages 50 --delay 500 --concurrency 5 --output-dir ./documentation

Git Repository Downloads

# Download entire repository
inform https://github.com/owner/repo

# Download specific directory
inform https://github.com/owner/repo/tree/main/docs

# Download with filtering
inform https://github.com/owner/repo --include "*.md" --exclude "node_modules/**"

GitHub API Rate Limits:

For unauthenticated requests, GitHub limits you to 60 requests per hour. With authentication, this increases to 5,000 requests per hour. To authenticate:

export GITHUB_TOKEN="your_github_personal_access_token"
inform https://github.com/owner/repo

See docs/github-integration.md for detailed information on authentication and rate limits.

Documentation

Complete Guides

📖 Documentation Index - Navigate all available documentation
🚀 Getting Started - Basic usage, best practices, and troubleshooting
🔗 GitHub Integration - Download specific directories from GitHub repos
🕷️ Web Crawling - Advanced crawling techniques with real examples
🤖 Automation & Scripting - CI/CD integration and workflow automation
🔧 fwdslsh Ecosystem - Integration with unify, catalog, and other tools
💡 Examples - Real-world use cases and practical scripts

Quick Examples

Download docs from fwdslsh/unify repository:

inform https://github.com/fwdslsh/unify/tree/main/docs --output-dir ./unify-docs

Download all Scala Play 2.9 documentation:

inform https://www.playframework.com/documentation/2.9.x/ \
  --output-dir ./play-docs --max-pages 500 --delay 500

Complete documentation pipeline with fwdslsh tools:

# Download with Inform
inform https://docs.example.com --output-dir ./docs

# Process with ecosystem tools
npx @fwdslsh/unify --input ./docs --output ./unified
npx @fwdslsh/catalog ./unified --output ./llms.txt

Command Line Options

--max-pages <number>: Maximum number of pages to crawl (default: 100)
--delay <ms>: Delay between requests in milliseconds (default: 1000)
--concurrency <number>: Number of concurrent requests (default: 3)
--max-queue-size <number>: Maximum URLs in queue before skipping new links (default: 10000)
--max-retries <number>: Maximum retry attempts for failed requests (default: 3)
--ignore-robots: Ignore robots.txt directives (use with caution, web mode only)
--output-dir <path>: Output directory for saved files (default: crawled-pages)
--raw: Output raw HTML content without Markdown conversion
--include <pattern>: Include files matching glob pattern (can be used multiple times)
--exclude <pattern>: Exclude files matching glob pattern (can be used multiple times)
--ignore-errors: Exit with code 0 even if some pages/files fail
--verbose: Enable verbose logging (detailed output including retries, skipped files, and queue status)
--quiet: Enable quiet mode (errors only, no progress messages)
--help: Show help message

robots.txt Support

By default, Inform respects robots.txt files. It will:

Fetch and parse robots.txt from the target site
Respect Disallow directives for the "Inform/1.0" user agent and wildcard "*" rules
Apply Crawl-delay directives (overrides --delay if robots.txt specifies a higher value)
Skip URLs blocked by robots.txt with a log message

To bypass robots.txt (only if you have explicit permission):

inform https://example.com --ignore-robots

Warning: Ignoring robots.txt may violate a site's terms of service. Only use --ignore-robots when you have explicit permission.

Examples

Crawl a documentation site with high concurrency

inform https://docs.example.com --max-pages 50 --delay 500 --concurrency 5

Crawl a blog with custom output directory

inform https://blog.example.com --output-dir ./blog-content

Quick crawl with minimal delay

inform https://example.com --max-pages 20 --delay 200

Raw HTML output without Markdown conversion

inform https://docs.example.com --raw --output-dir ./raw-content

Integration with @fwdslsh/catalog

For users who need LLMS.txt file generation capabilities, we recommend using @fwdslsh/catalog in combination with Inform. This workflow allows you to:

First, use Inform to crawl and convert web content to clean Markdown
Then, use @fwdslsh/catalog to generate LLMS.txt files from the Markdown output

Example Workflow

# Step 1: Crawl documentation site with Inform
inform https://docs.example.com --output-dir ./docs-content

# Step 2: Generate LLMS.txt files with @fwdslsh/catalog  
npx @fwdslsh/catalog ./docs-content --output llms.txt

Benefits of this approach:

Separation of concerns: Inform focuses on high-quality web crawling and Markdown conversion
Flexibility: Use @fwdslsh/catalog's advanced LLMS.txt generation features with any Markdown content
Maintainability: Each tool can be optimized for its specific purpose
Reusability: Generated Markdown can be used for multiple purposes beyond LLMS.txt generation

For more information about @fwdslsh/catalog, see the official documentation.

How It Works

URL Validation: Validates the provided base URL
Content Extraction: Uses Bun's native HTMLRewriter for efficient, streaming HTML parsing
Smart Content Selection: Intelligently identifies main content using selectors (main, article, .content, etc.)
Cleanup: Removes navigation, ads, scripts, and other non-content elements during parsing
Conversion: Converts clean HTML to Markdown using Turndown
Link Discovery: Extracts and queues internal links during the streaming parse
Rate Limiting: Respects delay settings to avoid overwhelming servers
File Naming: Generates meaningful filenames based on URL structure

Technical Implementation

Zero-dependency HTML parsing: Uses Bun's built-in HTMLRewriter (no jsdom required)
Streaming processing: HTMLRewriter processes HTML as a stream for better memory efficiency
Native performance: All HTML parsing and DOM manipulation uses Bun's optimized native APIs
Minimal footprint: Reduced bundle size by eliminating heavy DOM libraries

Output

Files are saved as .md (Markdown) files by default, or .html (raw HTML) files when using --raw
Folder structure matches the original website (e.g., /docs/api/ becomes docs/api.md or docs/api.html)
Root pages become index.md or index.html
Query parameters are included in filenames when present
HTML code examples are converted to proper markdown code blocks (Markdown mode only)

Content Extraction Strategy

The crawler attempts to find main content using this priority order:

<main> element
[role="main"] attribute
Common content class names (.main-content, .content, etc.)
<article> elements
Bootstrap-style containers
Fallback to <body> content

Unwanted elements are automatically removed:

Navigation (nav, .menu, .navigation)
Headers and footers
Advertisements (.ad, .advertisement)
Social sharing buttons
Comments sections
Scripts and styles

Requirements

Bun v1.0.0 or higher (https://bun.sh)
Internet connection for crawling

Dependencies

turndown: For converting HTML to Markdown
minimatch: For glob pattern matching

Recent Changes

Refactored: The main crawler logic is now in src/WebCrawler.js for easier testing and maintenance.
CLI script (cli.js) now imports the crawler class and handles argument parsing only.
Improved modularity and testability.
Unit tests for the crawler are provided in tests/test_cli.js.

Development

Setup

If you want to contribute to Inform or run it from source:

Clone the repository:

git clone https://github.com/fwdslsh/inform.git
cd inform

Install dependencies:
```
bun install
```
Verify setup by running tests:
```
bun test
```

All tests should pass (63 tests expected: 52 unit + 11 integration).

Testing

For comprehensive testing guidelines, best practices, and incident reports, see the Testing Guide.

# Run all tests
bun test

# Run specific test suite
bun test tests/web-crawler.test.js
bun test tests/integration/

# Run tests in watch mode
bun test --watch

Development Workflow

# Run from source
bun src/cli.js https://example.com

# Run tests
bun test

# Run tests in watch mode
bun test --watch

# Run performance benchmarks
bun run bench              # Run all benchmarks
bun run bench:save         # Save benchmark results to JSON
bun run bench:crawl        # Crawl benchmarks only
bun run bench:parsing      # HTML parsing benchmarks only

# Build binaries
bun run build              # Build for current platform
bun run build:all          # Build for all platforms (Linux, macOS, Windows)

# Clean build artifacts
bun run clean

Project Structure

inform/
├── src/                    # Source code
│   ├── cli.js             # CLI entry point and argument parsing
│   ├── WebCrawler.js      # Web crawling implementation
│   ├── GitCrawler.js      # Git repository downloading
│   ├── GitUrlParser.js    # Git URL parsing utilities
│   └── FileFilter.js      # File filtering logic
├── tests/                  # Test suites
│   ├── README.md          # Testing guide and best practices
│   ├── *.test.js          # Unit tests
│   └── integration/       # Integration tests
│       ├── test-server.js
│       ├── web-crawler-integration.test.js
│       └── git-crawler-integration.test.js
├── benchmarks/             # Performance benchmarks
│   ├── README.md
│   ├── index.js
│   ├── crawl-benchmark.js
│   └── html-parsing-benchmark.js
├── docs/                   # Documentation
└── docker/                 # Docker configuration

Building Binaries

# Build for your current platform
bun run build

# Build for specific platforms
bun run build:linux       # Linux x86_64
bun run build:macos       # macOS x86_64
bun run build:windows     # Windows x86_64

# Build for all platforms
bun run build:all

The binaries are standalone executables that include all dependencies and don't require Bun to be installed on the target system.

Contributing

We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:

Development setup and workflow
Code style guidelines
Testing requirements
Submitting pull requests
Reporting issues

Quick checklist before submitting:

All tests pass (bun test)
Code follows the existing style
New features include tests
Documentation is updated

Ethical Use & Terms of Service

Please respect the work of others when crawling websites.

Always review and abide by the target site's robots.txt, terms of service, and copyright policies.
Do not use this tool for scraping or redistributing proprietary or copyrighted content without permission.
Use reasonable rate limits and avoid overwhelming servers.

Roadmap

Create Distribution Process: Add a build process to compile and package inform for zero-dependency cross-platform support.
Efficient Git Directory Download: ✅ COMPLETED - Add support for downloading only specific directories (e.g., docs/) from public git repositories, enabling quick access to documentation without cloning the entire repo.
Configurable Extraction: Allow users to specify custom selectors or extraction rules for different sites.
Advanced Filtering: Add more granular controls for what content is included/excluded.
Improved Markdown Conversion: Enhance code block and table handling for more accurate documentation conversion.

License

CC-BY CC-BY BY