site2rag

v0.5.3

Published

a year ago

CLI tool to convert websites into RAG-ready local markdown knowledge bases with AI-powered content improvement.

0High
0Medium
0Low

chadananda

🚀 site2rag

Transform any website into a maintained, RAG-ready local knowledge base with a single command

npx site2rag docs.example.com

That's it! Your entire documentation site is now a clean, searchable, AI-ready knowledge base in ./docs.example.com/ 🎯

✨ Why site2rag?

🏎️ Free, Efficient HTML Preprocessing (with Site Learning!)

site2rag features a novel two-stage HTML preprocessing pipeline that delivers professional-grade content extraction with maximum efficiency:

Maximum Rule-Based Filtering: 90%+ of noise is eliminated instantly using fast, free heuristics—no AI required for most pages.
Strategic Minimal AI: Only truly ambiguous blocks are summarized and sent to AI, saving time and compute.
Site Structure Learning: If a site has repeated or consistent structure, site2rag learns from your choices and past runs, automatically resolving ambiguity for similar pages in the future. Over time, this means fewer and fewer AI calls are needed!

Why is this unique?

Most tools either use crude selectors (inaccurate) or send entire pages to AI (expensive/slow). site2rag combines the best of both: blazing fast, free preprocessing with just enough AI to handle the hard cases—and gets smarter the more you use it.
This approach makes site2rag ideal for large websites, iterative crawls, and anyone who wants to minimize AI costs while maximizing quality.

See html-preprocessing.md for technical details.

The Problem: You want to use local RAG (Retrieval-Augmented Generation) with documentation websites, but:

📄 Raw HTML is messy and full of navigation noise
🔄 Sites change frequently - manual downloads get stale
🧠 Content needs semantic enhancement for better AI retrieval
📁 You need clean citations back to original sources

The Solution: site2rag intelligently converts entire websites into maintained, AI-optimized knowledge bases that stay fresh automatically.

Website                    site2rag                 RAG-Ready Knowledge Base
┌─────────────┐           ┌─────────────┐           ┌─────────────────────────┐
│ 🌐 Raw HTML │  ────────▶ │ 🧠 AI Magic │  ────────▶ │ 📚 Clean Markdown       │
│ • Navigation│           │ • Content   │           │ • Semantic hints        │
│ • Ads       │           │   filtering │           │ • Perfect citations     │
│ • Clutter   │           │ • Context   │           │ • Auto-updated         │
│ • Mess      │           │   injection │           │ • RAG-optimized        │
└─────────────┘           └─────────────┘           └─────────────────────────┘

🎯 Perfect For

🔬 Researchers

Build local knowledge bases from documentation sites
Keep research materials automatically updated
Generate perfect citations for academic work

🤖 AI Engineers

Create high-quality RAG datasets from any website
Enhance content with semantic context for better retrieval
Maintain fresh training data with zero effort

📝 Content Creators

Archive competitor sites and documentation
Track content changes over time
Build searchable reference libraries

⚡ Lightning-Fast Updates

The magic happens on subsequent runs:

# First run: downloads everything
npx site2rag docs.kubernetes.io
# 📥 Processing 47,000 pages...

# Later runs: lightning fast
npx site2rag docs.kubernetes.io
# ⚡ Checked 47,000 pages in 30 seconds
# ✅ 3 pages updated, 46,997 unchanged

How? Smart change detection using ETags, Last-Modified headers, and content hashing means only changed content gets re-downloaded. Turn daily documentation syncing into a habit!

🎯 Sitemap-First Architecture (v0.4.0+)

Intelligent crawling that saves bandwidth and time:

# Traditional crawlers: Download everything, then filter
❌ Downloads 1000 pages → Filters → Keeps 200 pages (80% waste)

# site2rag: Discover, filter, then download
✅ Discovers 1000 URLs → Filters → Downloads 200 pages (80% saved!)

Key Features:

📋 Sitemap Discovery: Automatically finds and parses XML sitemaps
🌍 Language Detection: Uses hreflang attributes for language filtering
🎛️ Smart Filtering: Apply path and pattern filters before downloading
💾 Database-Driven: Stores URL metadata for efficient re-crawls
⚡ Bandwidth Savings: Up to 79% reduction in unnecessary downloads

🧠 AI-Enhanced Content Processing

Intelligent Content vs Noise Detection

Traditional scrapers use crude CSS selectors. site2rag uses local AI to understand content semantically:

❌ Traditional: "Remove all .sidebar elements" ✅ site2rag AI: "This sidebar contains valuable API references - keep it!"

Enhanced Metadata Extraction 📊

NEW in v0.4.3: Comprehensive metadata extraction from multiple sources:

JSON-LD structured data - Extracts Article, Person, PodcastEpisode schemas
Author biographical information - Captures author bios from Person JSON-LD
Multi-source keyword aggregation - Combines keywords from meta tags, article tags, and JSON-LD
Fallback chains - Title: JSON-LD → meta → OG; Author: JSON-LD → meta → Dublin Core → byline
Rich metadata fields - Author bio, job title, organization, published/modified dates

Document Download Support 📄

NEW in v0.4.3: Automatically downloads searchable documents linked in pages:

PDF, Word, OpenDocument formats supported
External CDN support - Downloads PDFs even from S3, CloudFlare, etc.
Smart deduplication - Won't download the same document twice
Organized storage - Documents saved in assets/documents/ directory
Binary file tracking - PDFs and other binary files tracked as first-class pages
Change detection - Re-downloads binary files when source changes
Database integration - Binary files tracked in SQLite with metadata

Before (raw HTML):

<nav>Documentation</nav>
<div class="sidebar">
  <h3>Related APIs</h3>
  <a href="/auth">Authentication</a>
  <a href="/rate-limits">Rate Limits</a>
</div>
<main>
  <h1>Getting Started</h1>
  <p>To use this API...</p>
</main>
<footer>© 2024 Company</footer>

After (AI-processed):

---
source_url: 'https://docs.example.com/getting-started'
title: 'Getting Started Guide'
author: 'John Doe'
authorDescription: 'John Doe is a senior software engineer with 10+ years of experience in API design'
authorJobTitle: 'Senior Software Engineer'
authorOrganization: 'Tech Corp'
datePublished: '2025-06-20T00:00:00Z'
dateModified: '2025-06-25T12:00:00Z'
keywords: ['api', 'authentication', 'getting-started', 'rest-api']
processing: ['ai_content_classification', 'context_injection']
---

# Getting Started

To use <span data-ctx="REST API authentication">this API</span>, first configure your credentials...

## Related APIs

- [Authentication](https://docs.example.com/auth)
- [Rate Limits](https://docs.example.com/rate-limits)

Context Disambiguation for RAG Excellence 🧠

Revolutionary inline context enhancement inspired by Anthropic's Contextual Retrieval

Traditional RAG systems fail when chunks lose context. A paragraph mentioning "the company's revenue grew 15%" is useless without knowing which company or time period. site2rag solves this with intelligent inline context insertion using [[...]] notation:

Original: "The company achieved remarkable growth last quarter." Enhanced: "The company [[Microsoft]] achieved remarkable growth last quarter [[Q4 2023]]."

How It Works

Optimized Windowing: Uses fixed 1200/600 word windows for optimal results with mini models
Parallel Processing: All blocks processed simultaneously with rate limiting (10 concurrent)
Plain Text Responses: AI returns enhanced text directly, no JSON parsing overhead
Strict Validation: Ensures only [[context]] insertions are added, original text preserved exactly

Context Types We Disambiguate

Pronouns: "he" → "he [[John Smith]]", "they" → "they [[the team]]"
References: "this approach" → "this [[machine learning]] approach"
Time: "last year" → "last year [[2023]]"
Places: "the facility" → "the facility [[San Francisco]]"
Acronyms: "AI" → "AI [[Artificial Intelligence]]"
Cross-refs: "as mentioned above" → "as mentioned above [[in Section 2.3]]"

Performance Impact

67% reduction in RAG retrieval failures (based on Anthropic's research)
10x faster processing through parallel batch optimization
90% less tokens used via intelligent caching
100% traceable - all context derived from the document itself

No Hallucination Guarantee 🛡️

All disambiguation context is derived only from information found elsewhere in the same document - no external knowledge is added. This ensures accuracy and traceability. The system uses strict validation to reject any responses that modify the original text beyond adding [[context]] insertions.

Result: Every paragraph becomes a self-contained, context-rich unit perfect for RAG retrieval! 🎯

📁 Perfect File Organization

Hierarchical Structure (Default)

./docs.example.com/
├── .site2rag/              # 🗄️ Smart change tracking & config
│   ├── crawl.db           # SQLite database
│   └── config.json        # Site configuration
├── getting-started.md      # 📄 Clean markdown content
├── api/
│   ├── authentication.md
│   └── rate-limits.md
├── guides/
│   └── best-practices.md
└── assets/                 # 🖼️ All site assets
    ├── images/
    │   └── architecture.png
    └── documents/
        └── api-spec.pdf

Flat Structure (--flat, Perfect for RAG)

./docs.example.com/
├── .site2rag/              # 🗄️ Smart change tracking & config
│   ├── crawl.db           # SQLite database
│   └── config.json        # Site configuration
├── getting-started.md      # 📄 Root page
├── api_authentication.md   # 🔥 Flattened with path-derived names
├── api_rate-limits.md
├── guides_best-practices.md
└── assets/                 # 🖼️ All site assets
    ├── images/
    │   └── architecture.png
    └── documents/
        └── api-spec.pdf

Why this structure?

📚 RAG-friendly: Clean markdown files perfect for vector databases
🔗 Citation-ready: Assets maintain exact URL structure for perfect citations
🔄 Update-efficient: Database tracks changes without file system overhead
🎯 Flat mode: Single directory structure ideal for RAG systems that prefer flat file lists

🎮 Dead Simple Usage

Basic Usage

# Convert any documentation site with RAG disambiguation
npx site2rag docs.react.dev
npx site2rag kubernetes.io/docs
npx site2rag python.org/dev/peps

# For RAG systems that prefer flat file structure
npx site2rag docs.example.com --flat

# With smart LLM fallback (automatically selects best available AI)
npx site2rag docs.example.com --auto-fallback

# That's it! Your knowledge base is ready 🎉

Advanced Configuration

# Limit pages and use custom output directory
npx site2rag docs.example.com --limit 100 --output ./my-knowledge-base
# Progress bar accurately shows remaining pages based on limit

# Update existing crawl (only downloads changed content)
npx site2rag docs.example.com --update

# Debug mode with test logging
npx site2rag docs.example.com --debug --test

# Use specific AI provider
npx site2rag docs.example.com --use_gpt4o --flat

# Real-world example from our test suite
npx site2rag bahai-education.org --limit 20 --output ./tests/tmp/sites/bahai-education --flat --test --use_gpt4o

🤖 AI Integration (Optional)

site2rag includes optional AI features for enhanced content processing with support for multiple providers:

Smart LLM Fallback (Recommended)

# Automatically use the best available AI provider
npx site2rag docs.example.com --auto-fallback
# 🔄 Auto-fallback enabled, trying: gpt4o → gpt4o-mini → opus4 → gpt4-turbo → ollama
# ✅ gpt4o: openai/gpt-4o available

Specific AI Provider Selection

# OpenAI Models (requires OPENAI_API_KEY)
npx site2rag docs.example.com --use_gpt4o
npx site2rag docs.example.com --use_gpt4o_mini
npx site2rag docs.example.com --use_gpt4_turbo
npx site2rag docs.example.com --use_o1_mini

# Anthropic Claude (requires ANTHROPIC_API_KEY)
npx site2rag docs.example.com --use_opus4
npx site2rag docs.example.com --use_haiku

# Other Providers
npx site2rag docs.example.com --use_mistral_large  # MISTRAL_API_KEY
npx site2rag docs.example.com --use_perplexity     # PERPLEXITY_API_KEY
npx site2rag docs.example.com --use_r1_grok        # XAI_API_KEY

Local AI (Privacy First)

# Install Ollama: https://ollama.ai
ollama pull qwen2.5:14b

# AI features work automatically when Ollama is available
npx site2rag docs.example.com
# ✅ 🧠 AI Processing: qwen2.5:14b ready

Custom Fallback Order

# Specify your preferred provider order
npx site2rag docs.example.com --auto-fallback --fallback-order "gpt4o,opus4,ollama"

No AI? No Problem!

# Works perfectly without any AI
npx site2rag docs.example.com --no-enhancement
# ⚠ AI Processing: AI not available
# → Falls back to excellent heuristic-based content extraction

Privacy First: Local AI processing with Ollama means your data never leaves your machine! 🔒

📊 Real-World Performance

Large Documentation Sites

| Site | Pages | First Run | Update Time | Storage | | ------------------- | ---------- | ---------- | ----------- | ------- | | Kubernetes Docs | 47K pages | 12 minutes | 30 seconds | 450MB | | AWS Documentation | 89K pages | 23 minutes | 45 seconds | 890MB | | React Documentation | 1.2K pages | 45 seconds | 3 seconds | 12MB |

Update Efficiency

🔍 Change Detection Results:
├── Total pages checked: 47,000
├── HTTP requests made: 47,000 (HEAD only)
├── Pages changed: 3
├── Pages downloaded: 3
├── Time taken: 28 seconds
└── Bandwidth used: 234KB (vs 450MB full re-download)

99.9% efficiency - only download what actually changed! ⚡

🛠️ Installation & Setup

Prerequisites

Node.js 18+ (for npx)
Optional: Ollama for AI-enhanced content processing

Quick Start

# No installation needed - just run!
npx site2rag docs.example.com

# Optional: AI setup for enhanced processing
ollama pull qwen2.5:14b
npx site2rag docs.example.com
# ✅ 🧠 AI Processing: qwen2.5:14b ready

Check Status

# View crawl status for a site
npx site2rag docs.example.com --status

# Clean crawl state (start fresh)
npx site2rag docs.example.com --clean

💡 Practical Examples

Production RAG Pipeline

# Use auto-fallback for maximum reliability
npx site2rag docs.kubernetes.io --auto-fallback --flat --limit 1000
# 🔄 Tries: gpt4o → gpt4o-mini → opus4 → gpt4-turbo → ollama
# 📁 Flat structure perfect for vector databases

Development & Testing

# Test mode with specific model and debugging
npx site2rag docs.example.com --use_gpt4o_mini --test --debug --limit 10
# 🧪 Detailed logging for development
# 💰 Cost-effective testing with mini model

High-Quality Content Processing

# Use premium models for best results
npx site2rag important-docs.com --use_opus4 --verbose
# 🎯 Claude 3.5 Sonnet for highest quality context enhancement

Custom Provider Fallback

# Prefer local processing, fallback to cloud
npx site2rag docs.example.com --auto-fallback --fallback-order "ollama,gpt4o_mini,opus4"
# 🔒 Privacy-first with intelligent fallback

Update Existing Knowledge Base

# Only process changed content
npx site2rag docs.example.com --update --auto-fallback
# ⚡ Lightning-fast updates using smart change detection

🔧 Configuration Options

CLI Options

| Option | Description | Example | | --- | --- | --- | | input | URL to crawl or file path to process | npx site2rag docs.example.com | | -o, --output <path> | Output directory for URLs or file path for files | --output ./my-docs | | --limit <num> | Limit the number of pages downloaded (URL mode only) - progress bar shows accurate count | --limit 100 | | --flat | Store all files in top-level folder with path-derived names | --flat | | --update | Update existing crawl (only changed content) | --update | | -s, --status | Show status for a previous crawl | --status | | -c, --clean | Clean crawl state before starting | --clean | | -v, --verbose | Enable verbose logging | --verbose | | --dry-run | Show what would be crawled without downloading | --dry-run | | -d, --debug | Enable debug mode to save removed content blocks | --debug | | --test | Enable test mode with detailed skip/download decision logging | --test | | --no-enhancement | Extract entities only, do not enhance text content | --no-enhancement |

🎯 New Filtering Options (v0.4.0+)

Smart URL Filtering with Sitemap-First Architecture

| Option | Description | Example | | --- | --- | --- | | --exclude-paths <paths> | Comma-separated list of URL paths to exclude | --exclude-paths "/admin,/login,/api" | | --include-patterns <patterns> | Comma-separated regex patterns for URLs to include | --include-patterns ".*docs.*,.*guides.*" | | --exclude-patterns <patterns> | Comma-separated regex patterns for URLs to exclude | --exclude-patterns ".*\.pdf,.*admin.*" | | --include-language <lang> | Only crawl pages in specified language (uses sitemap hreflang) | --include-language en |

🚀 Sitemap-First Filtering Examples

Filter out admin and user-specific content:

npx site2rag docs.example.com --exclude-paths "/admin,/login,/user,/profile"

Crawl only English documentation:

npx site2rag docs.example.com --include-language en --exclude-paths "/blog,/news"

Complex filtering with patterns:

npx site2rag example.com \
  --exclude-paths "/contact,/terms,/privacy" \
  --exclude-patterns ".*\.pdf,.*admin.*" \
  --include-language en \
  --limit 50

Bandwidth-efficient crawling:

# Only crawl English pages, exclude common non-content paths
npx site2rag knowledge-site.com \
  --include-language en \
  --exclude-paths "/authors,/tags,/categories,/archive" \
  --limit 100

🎯 Pro Tip: The new sitemap-first architecture discovers all URLs from sitemaps first, then applies filters before downloading. This can save up to 79% bandwidth by avoiding downloads of pages that don't match your criteria!

AI Provider Options

| Option | Description | Requirements | | --- | --- | --- | | --auto-fallback | Enable smart LLM fallback (try best available LLM) | Any AI API key | | --fallback-order <sequence> | Custom fallback order (comma-separated) | --fallback-order "gpt4o,opus4,ollama" | | --use_gpt4o | Use OpenAI GPT-4o | OPENAI_API_KEY | | --use_gpt4o_mini | Use OpenAI GPT-4o-mini | OPENAI_API_KEY | | --use_gpt4_turbo | Use OpenAI GPT-4 Turbo | OPENAI_API_KEY | | --use_o1_mini | Use OpenAI o1-mini | OPENAI_API_KEY | | --use_opus4 | Use Anthropic Claude 3.5 Sonnet | ANTHROPIC_API_KEY | | --use_haiku | Use Anthropic Claude 3.5 Haiku | ANTHROPIC_API_KEY | | --use_mistral_large | Use Mistral Large | MISTRAL_API_KEY | | --use_perplexity | Use Perplexity API | PERPLEXITY_API_KEY | | --use_r1_grok | Use xAI Grok | XAI_API_KEY | | --anthropic <env_var> | Use Anthropic Claude API with key from environment variable | Custom env var |

Site Configuration

Each site gets its own .site2rag/config.json:

{
  "site_url": "https://docs.example.com",
  "patterns": {
    "include": ["*/docs/*", "*/api/*"],
    "exclude": ["*/admin/*", "*/login/*"]
  },
  "processing": {
    "ai_enhanced": true,
    "content_classification": true,
    "context_injection": true,
    "rag_disambiguation": true,
    "cache_context": true
  },
  "crawl_settings": {
    "max_depth": 5,
    "concurrency": 3,
    "delay": 500
  }
}

🎯 RAG-Optimized Output

Why RAG Context Disambiguation Matters

Traditional content extraction creates ambiguous, context-less paragraphs that confuse RAG systems:

❌ Traditional RAG Content: "I started working on the project. It was challenging but rewarding."

When your RAG system retrieves this, it has no idea:

Who "I" refers to
What "project" means
When this happened
What context "challenging" refers to

site2rag's Enhanced RAG Content

✅ site2rag Enhanced Content: "I (Chad Jones, author) started working on the project (Ocean search software). It was challenging but rewarding."

Now your RAG system knows:

Who: Chad Jones, the author
What: Ocean search software project
Context: Software development challenge
Relevance: Perfect for queries about Ocean, Chad Jones, or software development

Result: 🎯 Dramatically better RAG search relevance and answer quality!

🎯 RAG Integration Examples

With LangChain

from langchain.document_loaders import DirectoryLoader
from langchain.text_splitter import MarkdownTextSplitter

# Load your site2rag output
loader = DirectoryLoader('./docs.example.com/pages/', glob="**/*.md")
docs = loader.load()

# The markdown is already clean and context-enhanced!
splitter = MarkdownTextSplitter(chunk_size=1000)
chunks = splitter.split_documents(docs)

With Local Vector DB

import chromadb
from pathlib import Path

# site2rag output is perfect for vector storage
docs_path = Path('./docs.example.com/pages/')
for md_file in docs_path.glob('**/*.md'):
    content = md_file.read_text()
    # YAML frontmatter has perfect metadata
    # Context hints improve vector search quality
    collection.add(documents=[content], metadatas=[yaml_frontmatter])

🚀 Advanced Features

Smart Change Detection

ETag & Last-Modified headers for HTTP-level change detection
Content hashing for precise change identification
Incremental updates - only download what actually changed
Database state - resume interrupted crawls seamlessly

Asset Management

Automatic asset discovery - images, PDFs, documents
Local asset storage with proper path rewriting
Asset deduplication by content hash
Maintains link relationships for perfect citations

🤝 Contributing

We'd love your help making site2rag even better!

Quick Development Setup

git clone https://github.com/chadananda/site2rag
cd site2rag
npm install
npm test

Running Tests

npm test              # All tests
npm run test:watch    # Watch mode
npm run test:coverage # Coverage report

Task-Based Development

Each feature is broken into small, testable tasks (T01-T32). Perfect for:

🐛 Bug fixes
✨ New features
📚 Documentation
🧪 Test improvements

📝 License

MIT License - see LICENSE for details.

🙏 Acknowledgments

Crawl4AI - Inspiration for AI-powered web crawling
Turndown - Excellent HTML to Markdown conversion
Ollama - Making local AI accessible to everyone

🚀 Get Started Now!

Ready to transform websites into RAG-ready knowledge bases?

npx site2rag docs.your-favorite-site.com

That's it! Your journey to effortless knowledge base maintenance starts now. 🎉

⭐ Star on GitHub • 📖 Documentation • 🐛 Report Issues • 💬 Discussions

Made with ❤️ by Chad Ananda