site2rag
v0.5.3
Published
CLI tool to convert websites into RAG-ready local markdown knowledge bases with AI-powered content improvement.
Readme
🚀 site2rag
Transform any website into a maintained, RAG-ready local knowledge base with a single command
npx site2rag docs.example.comThat's it! Your entire documentation site is now a clean, searchable, AI-ready knowledge base in ./docs.example.com/ 🎯
✨ Why site2rag?
🏎️ Free, Efficient HTML Preprocessing (with Site Learning!)
site2rag features a novel two-stage HTML preprocessing pipeline that delivers professional-grade content extraction with maximum efficiency:
- Maximum Rule-Based Filtering: 90%+ of noise is eliminated instantly using fast, free heuristics—no AI required for most pages.
- Strategic Minimal AI: Only truly ambiguous blocks are summarized and sent to AI, saving time and compute.
- Site Structure Learning: If a site has repeated or consistent structure, site2rag learns from your choices and past runs, automatically resolving ambiguity for similar pages in the future. Over time, this means fewer and fewer AI calls are needed!
Why is this unique?
- Most tools either use crude selectors (inaccurate) or send entire pages to AI (expensive/slow). site2rag combines the best of both: blazing fast, free preprocessing with just enough AI to handle the hard cases—and gets smarter the more you use it.
- This approach makes site2rag ideal for large websites, iterative crawls, and anyone who wants to minimize AI costs while maximizing quality.
See html-preprocessing.md for technical details.
The Problem: You want to use local RAG (Retrieval-Augmented Generation) with documentation websites, but:
- 📄 Raw HTML is messy and full of navigation noise
- 🔄 Sites change frequently - manual downloads get stale
- 🧠 Content needs semantic enhancement for better AI retrieval
- 📁 You need clean citations back to original sources
The Solution: site2rag intelligently converts entire websites into maintained, AI-optimized knowledge bases that stay fresh automatically.
Website site2rag RAG-Ready Knowledge Base
┌─────────────┐ ┌─────────────┐ ┌─────────────────────────┐
│ 🌐 Raw HTML │ ────────▶ │ 🧠 AI Magic │ ────────▶ │ 📚 Clean Markdown │
│ • Navigation│ │ • Content │ │ • Semantic hints │
│ • Ads │ │ filtering │ │ • Perfect citations │
│ • Clutter │ │ • Context │ │ • Auto-updated │
│ • Mess │ │ injection │ │ • RAG-optimized │
└─────────────┘ └─────────────┘ └─────────────────────────┘🎯 Perfect For
🔬 Researchers
- Build local knowledge bases from documentation sites
- Keep research materials automatically updated
- Generate perfect citations for academic work
🤖 AI Engineers
- Create high-quality RAG datasets from any website
- Enhance content with semantic context for better retrieval
- Maintain fresh training data with zero effort
📝 Content Creators
- Archive competitor sites and documentation
- Track content changes over time
- Build searchable reference libraries
⚡ Lightning-Fast Updates
The magic happens on subsequent runs:
# First run: downloads everything
npx site2rag docs.kubernetes.io
# 📥 Processing 47,000 pages...
# Later runs: lightning fast
npx site2rag docs.kubernetes.io
# ⚡ Checked 47,000 pages in 30 seconds
# ✅ 3 pages updated, 46,997 unchangedHow? Smart change detection using ETags, Last-Modified headers, and content hashing means only changed content gets re-downloaded. Turn daily documentation syncing into a habit!
🎯 Sitemap-First Architecture (v0.4.0+)
Intelligent crawling that saves bandwidth and time:
# Traditional crawlers: Download everything, then filter
❌ Downloads 1000 pages → Filters → Keeps 200 pages (80% waste)
# site2rag: Discover, filter, then download
✅ Discovers 1000 URLs → Filters → Downloads 200 pages (80% saved!)Key Features:
- 📋 Sitemap Discovery: Automatically finds and parses XML sitemaps
- 🌍 Language Detection: Uses hreflang attributes for language filtering
- 🎛️ Smart Filtering: Apply path and pattern filters before downloading
- 💾 Database-Driven: Stores URL metadata for efficient re-crawls
- ⚡ Bandwidth Savings: Up to 79% reduction in unnecessary downloads
🧠 AI-Enhanced Content Processing
Intelligent Content vs Noise Detection
Traditional scrapers use crude CSS selectors. site2rag uses local AI to understand content semantically:
❌ Traditional: "Remove all .sidebar elements" ✅ site2rag AI: "This sidebar contains valuable API references - keep it!"Enhanced Metadata Extraction 📊
NEW in v0.4.3: Comprehensive metadata extraction from multiple sources:
- JSON-LD structured data - Extracts Article, Person, PodcastEpisode schemas
- Author biographical information - Captures author bios from Person JSON-LD
- Multi-source keyword aggregation - Combines keywords from meta tags, article tags, and JSON-LD
- Fallback chains - Title: JSON-LD → meta → OG; Author: JSON-LD → meta → Dublin Core → byline
- Rich metadata fields - Author bio, job title, organization, published/modified dates
Document Download Support 📄
NEW in v0.4.3: Automatically downloads searchable documents linked in pages:
- PDF, Word, OpenDocument formats supported
- External CDN support - Downloads PDFs even from S3, CloudFlare, etc.
- Smart deduplication - Won't download the same document twice
- Organized storage - Documents saved in
assets/documents/directory - Binary file tracking - PDFs and other binary files tracked as first-class pages
- Change detection - Re-downloads binary files when source changes
- Database integration - Binary files tracked in SQLite with metadata
Before (raw HTML):
<nav>Documentation</nav>
<div class="sidebar">
<h3>Related APIs</h3>
<a href="/auth">Authentication</a>
<a href="/rate-limits">Rate Limits</a>
</div>
<main>
<h1>Getting Started</h1>
<p>To use this API...</p>
</main>
<footer>© 2024 Company</footer>After (AI-processed):
---
source_url: 'https://docs.example.com/getting-started'
title: 'Getting Started Guide'
author: 'John Doe'
authorDescription: 'John Doe is a senior software engineer with 10+ years of experience in API design'
authorJobTitle: 'Senior Software Engineer'
authorOrganization: 'Tech Corp'
datePublished: '2025-06-20T00:00:00Z'
dateModified: '2025-06-25T12:00:00Z'
keywords: ['api', 'authentication', 'getting-started', 'rest-api']
processing: ['ai_content_classification', 'context_injection']
---
# Getting Started
To use <span data-ctx="REST API authentication">this API</span>, first configure your credentials...
## Related APIs
- [Authentication](https://docs.example.com/auth)
- [Rate Limits](https://docs.example.com/rate-limits)Context Disambiguation for RAG Excellence 🧠
Revolutionary inline context enhancement inspired by Anthropic's Contextual Retrieval
Traditional RAG systems fail when chunks lose context. A paragraph mentioning "the company's revenue grew 15%" is useless without knowing which company or time period. site2rag solves this with intelligent inline context insertion using [[...]] notation:
Original: "The company achieved remarkable growth last quarter." Enhanced: "The company [[Microsoft]] achieved remarkable growth last quarter [[Q4 2023]]."How It Works
- Optimized Windowing: Uses fixed 1200/600 word windows for optimal results with mini models
- Parallel Processing: All blocks processed simultaneously with rate limiting (10 concurrent)
- Plain Text Responses: AI returns enhanced text directly, no JSON parsing overhead
- Strict Validation: Ensures only
[[context]]insertions are added, original text preserved exactly
Context Types We Disambiguate
- Pronouns: "he" → "he [[John Smith]]", "they" → "they [[the team]]"
- References: "this approach" → "this [[machine learning]] approach"
- Time: "last year" → "last year [[2023]]"
- Places: "the facility" → "the facility [[San Francisco]]"
- Acronyms: "AI" → "AI [[Artificial Intelligence]]"
- Cross-refs: "as mentioned above" → "as mentioned above [[in Section 2.3]]"
Performance Impact
- 67% reduction in RAG retrieval failures (based on Anthropic's research)
- 10x faster processing through parallel batch optimization
- 90% less tokens used via intelligent caching
- 100% traceable - all context derived from the document itself
No Hallucination Guarantee 🛡️
All disambiguation context is derived only from information found elsewhere in the same document - no external knowledge is added. This ensures accuracy and traceability. The system uses strict validation to reject any responses that modify the original text beyond adding [[context]] insertions.
Result: Every paragraph becomes a self-contained, context-rich unit perfect for RAG retrieval! 🎯
📁 Perfect File Organization
Hierarchical Structure (Default)
./docs.example.com/
├── .site2rag/ # 🗄️ Smart change tracking & config
│ ├── crawl.db # SQLite database
│ └── config.json # Site configuration
├── getting-started.md # 📄 Clean markdown content
├── api/
│ ├── authentication.md
│ └── rate-limits.md
├── guides/
│ └── best-practices.md
└── assets/ # 🖼️ All site assets
├── images/
│ └── architecture.png
└── documents/
└── api-spec.pdfFlat Structure (--flat, Perfect for RAG)
./docs.example.com/
├── .site2rag/ # 🗄️ Smart change tracking & config
│ ├── crawl.db # SQLite database
│ └── config.json # Site configuration
├── getting-started.md # 📄 Root page
├── api_authentication.md # 🔥 Flattened with path-derived names
├── api_rate-limits.md
├── guides_best-practices.md
└── assets/ # 🖼️ All site assets
├── images/
│ └── architecture.png
└── documents/
└── api-spec.pdfWhy this structure?
- 📚 RAG-friendly: Clean markdown files perfect for vector databases
- 🔗 Citation-ready: Assets maintain exact URL structure for perfect citations
- 🔄 Update-efficient: Database tracks changes without file system overhead
- 🎯 Flat mode: Single directory structure ideal for RAG systems that prefer flat file lists
🎮 Dead Simple Usage
Basic Usage
# Convert any documentation site with RAG disambiguation
npx site2rag docs.react.dev
npx site2rag kubernetes.io/docs
npx site2rag python.org/dev/peps
# For RAG systems that prefer flat file structure
npx site2rag docs.example.com --flat
# With smart LLM fallback (automatically selects best available AI)
npx site2rag docs.example.com --auto-fallback
# That's it! Your knowledge base is ready 🎉Advanced Configuration
# Limit pages and use custom output directory
npx site2rag docs.example.com --limit 100 --output ./my-knowledge-base
# Progress bar accurately shows remaining pages based on limit
# Update existing crawl (only downloads changed content)
npx site2rag docs.example.com --update
# Debug mode with test logging
npx site2rag docs.example.com --debug --test
# Use specific AI provider
npx site2rag docs.example.com --use_gpt4o --flat
# Real-world example from our test suite
npx site2rag bahai-education.org --limit 20 --output ./tests/tmp/sites/bahai-education --flat --test --use_gpt4o🤖 AI Integration (Optional)
site2rag includes optional AI features for enhanced content processing with support for multiple providers:
Smart LLM Fallback (Recommended)
# Automatically use the best available AI provider
npx site2rag docs.example.com --auto-fallback
# 🔄 Auto-fallback enabled, trying: gpt4o → gpt4o-mini → opus4 → gpt4-turbo → ollama
# ✅ gpt4o: openai/gpt-4o availableSpecific AI Provider Selection
# OpenAI Models (requires OPENAI_API_KEY)
npx site2rag docs.example.com --use_gpt4o
npx site2rag docs.example.com --use_gpt4o_mini
npx site2rag docs.example.com --use_gpt4_turbo
npx site2rag docs.example.com --use_o1_mini
# Anthropic Claude (requires ANTHROPIC_API_KEY)
npx site2rag docs.example.com --use_opus4
npx site2rag docs.example.com --use_haiku
# Other Providers
npx site2rag docs.example.com --use_mistral_large # MISTRAL_API_KEY
npx site2rag docs.example.com --use_perplexity # PERPLEXITY_API_KEY
npx site2rag docs.example.com --use_r1_grok # XAI_API_KEYLocal AI (Privacy First)
# Install Ollama: https://ollama.ai
ollama pull qwen2.5:14b
# AI features work automatically when Ollama is available
npx site2rag docs.example.com
# ✅ 🧠 AI Processing: qwen2.5:14b readyCustom Fallback Order
# Specify your preferred provider order
npx site2rag docs.example.com --auto-fallback --fallback-order "gpt4o,opus4,ollama"No AI? No Problem!
# Works perfectly without any AI
npx site2rag docs.example.com --no-enhancement
# ⚠ AI Processing: AI not available
# → Falls back to excellent heuristic-based content extractionPrivacy First: Local AI processing with Ollama means your data never leaves your machine! 🔒
📊 Real-World Performance
Large Documentation Sites
| Site | Pages | First Run | Update Time | Storage | | ------------------- | ---------- | ---------- | ----------- | ------- | | Kubernetes Docs | 47K pages | 12 minutes | 30 seconds | 450MB | | AWS Documentation | 89K pages | 23 minutes | 45 seconds | 890MB | | React Documentation | 1.2K pages | 45 seconds | 3 seconds | 12MB |
Update Efficiency
🔍 Change Detection Results:
├── Total pages checked: 47,000
├── HTTP requests made: 47,000 (HEAD only)
├── Pages changed: 3
├── Pages downloaded: 3
├── Time taken: 28 seconds
└── Bandwidth used: 234KB (vs 450MB full re-download)99.9% efficiency - only download what actually changed! ⚡
🛠️ Installation & Setup
Prerequisites
- Node.js 18+ (for npx)
- Optional: Ollama for AI-enhanced content processing
Quick Start
# No installation needed - just run!
npx site2rag docs.example.com
# Optional: AI setup for enhanced processing
ollama pull qwen2.5:14b
npx site2rag docs.example.com
# ✅ 🧠 AI Processing: qwen2.5:14b readyCheck Status
# View crawl status for a site
npx site2rag docs.example.com --status
# Clean crawl state (start fresh)
npx site2rag docs.example.com --clean💡 Practical Examples
Production RAG Pipeline
# Use auto-fallback for maximum reliability
npx site2rag docs.kubernetes.io --auto-fallback --flat --limit 1000
# 🔄 Tries: gpt4o → gpt4o-mini → opus4 → gpt4-turbo → ollama
# 📁 Flat structure perfect for vector databasesDevelopment & Testing
# Test mode with specific model and debugging
npx site2rag docs.example.com --use_gpt4o_mini --test --debug --limit 10
# 🧪 Detailed logging for development
# 💰 Cost-effective testing with mini modelHigh-Quality Content Processing
# Use premium models for best results
npx site2rag important-docs.com --use_opus4 --verbose
# 🎯 Claude 3.5 Sonnet for highest quality context enhancementCustom Provider Fallback
# Prefer local processing, fallback to cloud
npx site2rag docs.example.com --auto-fallback --fallback-order "ollama,gpt4o_mini,opus4"
# 🔒 Privacy-first with intelligent fallbackUpdate Existing Knowledge Base
# Only process changed content
npx site2rag docs.example.com --update --auto-fallback
# ⚡ Lightning-fast updates using smart change detection🔧 Configuration Options
CLI Options
| Option | Description | Example |
| --- | --- | --- |
| input | URL to crawl or file path to process | npx site2rag docs.example.com |
| -o, --output <path> | Output directory for URLs or file path for files | --output ./my-docs |
| --limit <num> | Limit the number of pages downloaded (URL mode only) - progress bar shows accurate count | --limit 100 |
| --flat | Store all files in top-level folder with path-derived names | --flat |
| --update | Update existing crawl (only changed content) | --update |
| -s, --status | Show status for a previous crawl | --status |
| -c, --clean | Clean crawl state before starting | --clean |
| -v, --verbose | Enable verbose logging | --verbose |
| --dry-run | Show what would be crawled without downloading | --dry-run |
| -d, --debug | Enable debug mode to save removed content blocks | --debug |
| --test | Enable test mode with detailed skip/download decision logging | --test |
| --no-enhancement | Extract entities only, do not enhance text content | --no-enhancement |
🎯 New Filtering Options (v0.4.0+)
Smart URL Filtering with Sitemap-First Architecture
| Option | Description | Example |
| --- | --- | --- |
| --exclude-paths <paths> | Comma-separated list of URL paths to exclude | --exclude-paths "/admin,/login,/api" |
| --include-patterns <patterns> | Comma-separated regex patterns for URLs to include | --include-patterns ".*docs.*,.*guides.*" |
| --exclude-patterns <patterns> | Comma-separated regex patterns for URLs to exclude | --exclude-patterns ".*\.pdf,.*admin.*" |
| --include-language <lang> | Only crawl pages in specified language (uses sitemap hreflang) | --include-language en |
🚀 Sitemap-First Filtering Examples
Filter out admin and user-specific content:
npx site2rag docs.example.com --exclude-paths "/admin,/login,/user,/profile"Crawl only English documentation:
npx site2rag docs.example.com --include-language en --exclude-paths "/blog,/news"Complex filtering with patterns:
npx site2rag example.com \
--exclude-paths "/contact,/terms,/privacy" \
--exclude-patterns ".*\.pdf,.*admin.*" \
--include-language en \
--limit 50Bandwidth-efficient crawling:
# Only crawl English pages, exclude common non-content paths
npx site2rag knowledge-site.com \
--include-language en \
--exclude-paths "/authors,/tags,/categories,/archive" \
--limit 100🎯 Pro Tip: The new sitemap-first architecture discovers all URLs from sitemaps first, then applies filters before downloading. This can save up to 79% bandwidth by avoiding downloads of pages that don't match your criteria!
AI Provider Options
| Option | Description | Requirements |
| --- | --- | --- |
| --auto-fallback | Enable smart LLM fallback (try best available LLM) | Any AI API key |
| --fallback-order <sequence> | Custom fallback order (comma-separated) | --fallback-order "gpt4o,opus4,ollama" |
| --use_gpt4o | Use OpenAI GPT-4o | OPENAI_API_KEY |
| --use_gpt4o_mini | Use OpenAI GPT-4o-mini | OPENAI_API_KEY |
| --use_gpt4_turbo | Use OpenAI GPT-4 Turbo | OPENAI_API_KEY |
| --use_o1_mini | Use OpenAI o1-mini | OPENAI_API_KEY |
| --use_opus4 | Use Anthropic Claude 3.5 Sonnet | ANTHROPIC_API_KEY |
| --use_haiku | Use Anthropic Claude 3.5 Haiku | ANTHROPIC_API_KEY |
| --use_mistral_large | Use Mistral Large | MISTRAL_API_KEY |
| --use_perplexity | Use Perplexity API | PERPLEXITY_API_KEY |
| --use_r1_grok | Use xAI Grok | XAI_API_KEY |
| --anthropic <env_var> | Use Anthropic Claude API with key from environment variable | Custom env var |
Site Configuration
Each site gets its own .site2rag/config.json:
{
"site_url": "https://docs.example.com",
"patterns": {
"include": ["*/docs/*", "*/api/*"],
"exclude": ["*/admin/*", "*/login/*"]
},
"processing": {
"ai_enhanced": true,
"content_classification": true,
"context_injection": true,
"rag_disambiguation": true,
"cache_context": true
},
"crawl_settings": {
"max_depth": 5,
"concurrency": 3,
"delay": 500
}
}🎯 RAG-Optimized Output
Why RAG Context Disambiguation Matters
Traditional content extraction creates ambiguous, context-less paragraphs that confuse RAG systems:
❌ Traditional RAG Content: "I started working on the project. It was challenging but rewarding."When your RAG system retrieves this, it has no idea:
- Who "I" refers to
- What "project" means
- When this happened
- What context "challenging" refers to
site2rag's Enhanced RAG Content
✅ site2rag Enhanced Content: "I (Chad Jones, author) started working on the project (Ocean search software). It was challenging but rewarding."Now your RAG system knows:
- Who: Chad Jones, the author
- What: Ocean search software project
- Context: Software development challenge
- Relevance: Perfect for queries about Ocean, Chad Jones, or software development
Result: 🎯 Dramatically better RAG search relevance and answer quality!
🎯 RAG Integration Examples
With LangChain
from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import MarkdownTextSplitter
# Load your site2rag output
loader = DirectoryLoader('./docs.example.com/pages/', glob="**/*.md")
docs = loader.load()
# The markdown is already clean and context-enhanced!
splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_documents(docs)With Local Vector DB
import chromadb
from pathlib import Path
# site2rag output is perfect for vector storage
docs_path = Path('./docs.example.com/pages/')
for md_file in docs_path.glob('**/*.md'):
content = md_file.read_text()
# YAML frontmatter has perfect metadata
# Context hints improve vector search quality
collection.add(documents=[content], metadatas=[yaml_frontmatter])🚀 Advanced Features
Smart Change Detection
- ETag & Last-Modified headers for HTTP-level change detection
- Content hashing for precise change identification
- Incremental updates - only download what actually changed
- Database state - resume interrupted crawls seamlessly
Asset Management
- Automatic asset discovery - images, PDFs, documents
- Local asset storage with proper path rewriting
- Asset deduplication by content hash
- Maintains link relationships for perfect citations
🤝 Contributing
We'd love your help making site2rag even better!
Quick Development Setup
git clone https://github.com/chadananda/site2rag
cd site2rag
npm install
npm testRunning Tests
npm test # All tests
npm run test:watch # Watch mode
npm run test:coverage # Coverage reportTask-Based Development
Each feature is broken into small, testable tasks (T01-T32). Perfect for:
- 🐛 Bug fixes
- ✨ New features
- 📚 Documentation
- 🧪 Test improvements
📝 License
MIT License - see LICENSE for details.
🙏 Acknowledgments
- Crawl4AI - Inspiration for AI-powered web crawling
- Turndown - Excellent HTML to Markdown conversion
- Ollama - Making local AI accessible to everyone
🚀 Get Started Now!
Ready to transform websites into RAG-ready knowledge bases?
npx site2rag docs.your-favorite-site.comThat's it! Your journey to effortless knowledge base maintenance starts now. 🎉
⭐ Star on GitHub • 📖 Documentation • 🐛 Report Issues • 💬 Discussions
Made with ❤️ by Chad Ananda
