@fwdslsh/inform
v0.2.0
Published
A high-performance web crawler powered by Bun that downloads pages and converts them to Markdown
Maintainers
Readme
Inform
Features
- 🚀 Powered by Bun - Significantly faster than Node.js with built-in optimizations
- ⚡ Native DOM parsing - Uses Bun's built-in DOMParser for zero-dependency HTML processing
- ⚡ Concurrent crawling - Process multiple pages simultaneously for better performance
- Crawls websites starting from a base URL
- Stays within the same domain
- Maintains original folder structure (e.g.,
/docs/buttonbecomesdocs/button.md) - Extracts main content by removing navigation, ads, and other non-content elements
- Properly converts HTML code examples to markdown code blocks
- Converts HTML to clean Markdown
- Respects rate limiting with configurable delays
- Saves files with meaningful names based on URL structure
- Skips binary files and non-HTML content
- Performance monitoring - Shows processing time for each page
- Minimal dependencies - Only essential packages, no heavy DOM libraries
Installation
Quick Install Script
curl -fsSL https://raw.githubusercontent.com/fwdslsh/inform/main/install.sh | shManual Downloads
Download pre-built binaries from GitHub Releases.
Docker
docker run fwdslsh/inform:latest --helpAdvanced Installation
See docs/installation.md for full instructions, including how to install Inform without Bun using pre-built binaries for Linux, macOS, and Windows.
If you want to use Inform with Bun, you can still install via npm:
bun add @fwdslsh/informOr install globally:
bun install -g @fwdslsh/informUsage
Basic Usage
inform https://example.comWith Options
inform https://docs.example.com --max-pages 50 --delay 500 --concurrency 5 --output-dir ./documentationGit Repository Downloads
# Download entire repository
inform https://github.com/owner/repo
# Download specific directory
inform https://github.com/owner/repo/tree/main/docs
# Download with filtering
inform https://github.com/owner/repo --include "*.md" --exclude "node_modules/**"GitHub API Rate Limits:
For unauthenticated requests, GitHub limits you to 60 requests per hour. With authentication, this increases to 5,000 requests per hour. To authenticate:
export GITHUB_TOKEN="your_github_personal_access_token"
inform https://github.com/owner/repoSee docs/github-integration.md for detailed information on authentication and rate limits.
Documentation
Complete Guides
- 📖 Documentation Index - Navigate all available documentation
- 🚀 Getting Started - Basic usage, best practices, and troubleshooting
- 🔗 GitHub Integration - Download specific directories from GitHub repos
- 🕷️ Web Crawling - Advanced crawling techniques with real examples
- 🤖 Automation & Scripting - CI/CD integration and workflow automation
- 🔧 fwdslsh Ecosystem - Integration with unify, catalog, and other tools
- 💡 Examples - Real-world use cases and practical scripts
Quick Examples
Download docs from fwdslsh/unify repository:
inform https://github.com/fwdslsh/unify/tree/main/docs --output-dir ./unify-docsDownload all Scala Play 2.9 documentation:
inform https://www.playframework.com/documentation/2.9.x/ \
--output-dir ./play-docs --max-pages 500 --delay 500Complete documentation pipeline with fwdslsh tools:
# Download with Inform
inform https://docs.example.com --output-dir ./docs
# Process with ecosystem tools
npx @fwdslsh/unify --input ./docs --output ./unified
npx @fwdslsh/catalog ./unified --output ./llms.txtCommand Line Options
--max-pages <number>: Maximum number of pages to crawl (default: 100)--delay <ms>: Delay between requests in milliseconds (default: 1000)--concurrency <number>: Number of concurrent requests (default: 3)--max-queue-size <number>: Maximum URLs in queue before skipping new links (default: 10000)--max-retries <number>: Maximum retry attempts for failed requests (default: 3)--ignore-robots: Ignore robots.txt directives (use with caution, web mode only)--output-dir <path>: Output directory for saved files (default: crawled-pages)--raw: Output raw HTML content without Markdown conversion--include <pattern>: Include files matching glob pattern (can be used multiple times)--exclude <pattern>: Exclude files matching glob pattern (can be used multiple times)--ignore-errors: Exit with code 0 even if some pages/files fail--verbose: Enable verbose logging (detailed output including retries, skipped files, and queue status)--quiet: Enable quiet mode (errors only, no progress messages)--help: Show help message
robots.txt Support
By default, Inform respects robots.txt files. It will:
- Fetch and parse robots.txt from the target site
- Respect Disallow directives for the "Inform/1.0" user agent and wildcard "*" rules
- Apply Crawl-delay directives (overrides --delay if robots.txt specifies a higher value)
- Skip URLs blocked by robots.txt with a log message
To bypass robots.txt (only if you have explicit permission):
inform https://example.com --ignore-robotsWarning: Ignoring robots.txt may violate a site's terms of service. Only use --ignore-robots when you have explicit permission.
Examples
Crawl a documentation site with high concurrency
inform https://docs.example.com --max-pages 50 --delay 500 --concurrency 5Crawl a blog with custom output directory
inform https://blog.example.com --output-dir ./blog-contentQuick crawl with minimal delay
inform https://example.com --max-pages 20 --delay 200Raw HTML output without Markdown conversion
inform https://docs.example.com --raw --output-dir ./raw-contentIntegration with @fwdslsh/catalog
For users who need LLMS.txt file generation capabilities, we recommend using @fwdslsh/catalog in combination with Inform. This workflow allows you to:
- First, use Inform to crawl and convert web content to clean Markdown
- Then, use @fwdslsh/catalog to generate LLMS.txt files from the Markdown output
Example Workflow
# Step 1: Crawl documentation site with Inform
inform https://docs.example.com --output-dir ./docs-content
# Step 2: Generate LLMS.txt files with @fwdslsh/catalog
npx @fwdslsh/catalog ./docs-content --output llms.txtBenefits of this approach:
- Separation of concerns: Inform focuses on high-quality web crawling and Markdown conversion
- Flexibility: Use @fwdslsh/catalog's advanced LLMS.txt generation features with any Markdown content
- Maintainability: Each tool can be optimized for its specific purpose
- Reusability: Generated Markdown can be used for multiple purposes beyond LLMS.txt generation
For more information about @fwdslsh/catalog, see the official documentation.
How It Works
- URL Validation: Validates the provided base URL
- Content Extraction: Uses Bun's native HTMLRewriter for efficient, streaming HTML parsing
- Smart Content Selection: Intelligently identifies main content using selectors (main, article, .content, etc.)
- Cleanup: Removes navigation, ads, scripts, and other non-content elements during parsing
- Conversion: Converts clean HTML to Markdown using Turndown
- Link Discovery: Extracts and queues internal links during the streaming parse
- Rate Limiting: Respects delay settings to avoid overwhelming servers
- File Naming: Generates meaningful filenames based on URL structure
Technical Implementation
- Zero-dependency HTML parsing: Uses Bun's built-in
HTMLRewriter(no jsdom required) - Streaming processing: HTMLRewriter processes HTML as a stream for better memory efficiency
- Native performance: All HTML parsing and DOM manipulation uses Bun's optimized native APIs
- Minimal footprint: Reduced bundle size by eliminating heavy DOM libraries
Output
- Files are saved as
.md(Markdown) files by default, or.html(raw HTML) files when using--raw - Folder structure matches the original website (e.g.,
/docs/api/becomesdocs/api.mdordocs/api.html) - Root pages become
index.mdorindex.html - Query parameters are included in filenames when present
- HTML code examples are converted to proper markdown code blocks (Markdown mode only)
Content Extraction Strategy
The crawler attempts to find main content using this priority order:
<main>element[role="main"]attribute- Common content class names (
.main-content,.content, etc.) <article>elements- Bootstrap-style containers
- Fallback to
<body>content
Unwanted elements are automatically removed:
- Navigation (
nav,.menu,.navigation) - Headers and footers
- Advertisements (
.ad,.advertisement) - Social sharing buttons
- Comments sections
- Scripts and styles
Requirements
- Bun v1.0.0 or higher (https://bun.sh)
- Internet connection for crawling
Dependencies
turndown: For converting HTML to Markdownminimatch: For glob pattern matching
Recent Changes
- Refactored: The main crawler logic is now in
src/WebCrawler.jsfor easier testing and maintenance. - CLI script (
cli.js) now imports the crawler class and handles argument parsing only. - Improved modularity and testability.
- Unit tests for the crawler are provided in
tests/test_cli.js.
Development
Setup
If you want to contribute to Inform or run it from source:
Clone the repository:
git clone https://github.com/fwdslsh/inform.git cd informInstall dependencies:
bun installVerify setup by running tests:
bun test
All tests should pass (63 tests expected: 52 unit + 11 integration).
Testing
For comprehensive testing guidelines, best practices, and incident reports, see the Testing Guide.
# Run all tests
bun test
# Run specific test suite
bun test tests/web-crawler.test.js
bun test tests/integration/
# Run tests in watch mode
bun test --watchDevelopment Workflow
# Run from source
bun src/cli.js https://example.com
# Run tests
bun test
# Run tests in watch mode
bun test --watch
# Run performance benchmarks
bun run bench # Run all benchmarks
bun run bench:save # Save benchmark results to JSON
bun run bench:crawl # Crawl benchmarks only
bun run bench:parsing # HTML parsing benchmarks only
# Build binaries
bun run build # Build for current platform
bun run build:all # Build for all platforms (Linux, macOS, Windows)
# Clean build artifacts
bun run cleanProject Structure
inform/
├── src/ # Source code
│ ├── cli.js # CLI entry point and argument parsing
│ ├── WebCrawler.js # Web crawling implementation
│ ├── GitCrawler.js # Git repository downloading
│ ├── GitUrlParser.js # Git URL parsing utilities
│ └── FileFilter.js # File filtering logic
├── tests/ # Test suites
│ ├── README.md # Testing guide and best practices
│ ├── *.test.js # Unit tests
│ └── integration/ # Integration tests
│ ├── test-server.js
│ ├── web-crawler-integration.test.js
│ └── git-crawler-integration.test.js
├── benchmarks/ # Performance benchmarks
│ ├── README.md
│ ├── index.js
│ ├── crawl-benchmark.js
│ └── html-parsing-benchmark.js
├── docs/ # Documentation
└── docker/ # Docker configurationBuilding Binaries
# Build for your current platform
bun run build
# Build for specific platforms
bun run build:linux # Linux x86_64
bun run build:macos # macOS x86_64
bun run build:windows # Windows x86_64
# Build for all platforms
bun run build:allThe binaries are standalone executables that include all dependencies and don't require Bun to be installed on the target system.
Contributing
We welcome contributions! Please see CONTRIBUTING.md for detailed guidelines on:
- Development setup and workflow
- Code style guidelines
- Testing requirements
- Submitting pull requests
- Reporting issues
Quick checklist before submitting:
- All tests pass (
bun test) - Code follows the existing style
- New features include tests
- Documentation is updated
Ethical Use & Terms of Service
Please respect the work of others when crawling websites.
- Always review and abide by the target site's robots.txt, terms of service, and copyright policies.
- Do not use this tool for scraping or redistributing proprietary or copyrighted content without permission.
- Use reasonable rate limits and avoid overwhelming servers.
Roadmap
- Create Distribution Process: Add a build process to compile and package
informfor zero-dependency cross-platform support. - Efficient Git Directory Download: ✅ COMPLETED - Add support for downloading only specific directories (e.g.,
docs/) from public git repositories, enabling quick access to documentation without cloning the entire repo. - Configurable Extraction: Allow users to specify custom selectors or extraction rules for different sites.
- Advanced Filtering: Add more granular controls for what content is included/excluded.
- Improved Markdown Conversion: Enhance code block and table handling for more accurate documentation conversion.
License
CC-BY CC-BY BY
