siphon-knowledge

v1.0.1

Published

3 months ago

Universal documentation scraper with AI analysis, anti-hallucination protection, and Claude Skills generation

0High
0Medium
0Low

dexploarer

documentation scraper crawler ai analysis claude skills anti-hallucination package-validation cli interactive

🎯 Siphon Knowledge

Universal documentation scraper with AI-powered analysis, anti-hallucination protection, and Claude Skills generation

🌟 Features

🎨 Interactive & Command-Line Interface

✨ Interactive Mode - Beautiful persistent CLI with guided workflows
⚡ Command Mode - Traditional CLI for automation and scripts
🔄 Session Persistence - Settings remembered across menu cycles

📚 Documentation Scraping

🌐 Website Crawling - Intelligent web scraping with 3 crawl modes:
- Rapid - Fast shallow crawl (depth 2, 100 pages)
- Controlled - Balanced approach (depth 5, 1000 pages) - DEFAULT
- Worm - Deep exhaustive (unlimited depth, 10000 pages)
📦 GitHub Integration - Extract docs from repos without cloning
🎨 Content Extraction - Specialized extractors for branding, design, tone, API docs

🤖 AI-Powered Analysis

🛡️ Anti-Hallucination Protection - Real-time npm package validation
📊 Package Validation - Validates 25+ known hallucinations + npm registry checks
📈 Knowledge Gap Detection - Tracks 15+ AI model training cutoffs
🔧 Structured Output - JSON analysis with quality scores
🎯 Auto-Protection - Generates .cursorrules for AI coding assistants

🎯 Claude Skills Generation

✨ Auto-Generation - Create Skills from documentation analysis
📦 Distribution - Package for Claude Code or Claude.ai
🔍 Validation - Built-in Skill structure validation
📋 Management - List, validate, and package Skills

🏗️ Enterprise-Grade Architecture

⚡ 3x Performance - Optimized browser context management
🧪 Type-Safe - Full TypeScript with runtime validation
🚨 Error Handling - Structured errors with recovery strategies
🔐 Secure - API key validation and .env management

🏗️ Architecture Overview

Modular Command System

Clean Separation: Commands are isolated modules with consistent interfaces
Easy Extension: Add new commands without touching existing code
Type Safety: Full TypeScript coverage with runtime validation

Performance Optimizations

3x Faster Extraction: Concurrent processing with intelligent batching
Memory Efficient: Optimized browser context management
Smart Retry Logic: Exponential backoff with failure recovery

Enterprise Features

Configuration Management: Multi-source config with validation
Error Handling: Structured errors with recovery strategies
Comprehensive Testing: Unit and integration tests for reliability
Code Quality: ESLint + Prettier with strict TypeScript rules

🚀 Quick Start

Prerequisites

Bun runtime installed
GitHub Personal Access Token (create one)
- Required scopes: public_repo, read:org
Vercel AI Gateway API Key (REQUIRED - all AI inference goes through gateway)

Installation

# Clone the repository
git clone <your-repo-url>
cd siphon-knowledge

# Install dependencies
bun install

# Install Playwright browsers
bunx playwright install chromium

# Link globally for `siphon` and `siphon-knowledge` commands
bun link

Configuration

Create a .env file (or use interactive prompts):

# GitHub Personal Access Token
GITHUB_PAT=ghp_your_token_here

# Vercel AI Gateway API Key (REQUIRED - ALL MODEL INFERENCE GOES THROUGH GATEWAY)
AI_GATEWAY_API_KEY=your_api_key_here

# Model to use via AI Gateway (creator/model-name format)
# Examples: openai/gpt-4o, anthropic/claude-3-5-sonnet-20241022, openai/gpt-4-turbo
AI_MODEL=openai/gpt-4o

# Optional: Custom AI Gateway URL (defaults to Vercel's gateway)
# AI_GATEWAY_URL=https://ai-gateway.vercel.sh/v1/ai

📖 Usage

Global Commands

After linking, you can use either command:

siphon              # Interactive mode (recommended)
siphon <url>        # Command mode
siphon-knowledge    # Alias for siphon

🎨 Interactive Mode (Recommended)

Launch the beautiful interactive CLI with no arguments:

siphon

Navigate through menus:

🌐 Scrape Documentation - Guided website/GitHub scraping with crawl mode selection
🎨 Extract Content - Extract branding, design, tone, API docs, etc.
🎯 Generate Skills - Create and manage Claude Skills
🚀 Export Skills - Export to Claude Code or Claude.ai
🛡️ Anti-Hallucination - Package validation and protection
⚙️ Settings - Configure crawl mode, AI model, and preferences

Features:

✅ Persistent session - settings remembered
✅ Beautiful UI with @clack/prompts
✅ Guided workflows for complex tasks
✅ Discoverable - explore all features interactively

⚡ Command Mode (Automation)

Traditional CLI for scripts and automation:

Scrape Website Documentation

# Scrape from any documentation website
siphon https://docs.example.com

# The CLI will:
# 1. Validate GitHub PAT and AI Gateway API key (prompts if missing)
# 2. Crawl the website to discover all pages
# 3. Categorize URLs into developer/user contexts
# 4. Scrape content from all discovered pages
# 5. Generate structured documentation

Scrape GitHub Repository

# Scrape documentation from a GitHub repo (no cloning!)
siphon https://github.com/owner/repo

# Or specify a branch
siphon https://github.com/owner/repo/tree/develop

# The CLI will:
# 1. Validate credentials
# 2. Use GitHub API to fetch documentation files
# 3. Download only docs (README, /docs, .md files, etc.)
# 4. Organize by directory structure

AI-Powered Analysis with Anti-Hallucination Protection

After scraping, analyze the documentation with AI. Anti-hallucination protection runs automatically:

# Analyze all documentation in scraped-content/
bun scripts/ai-analyzer.ts scraped-content

# Output:
# - documentation-analysis.json (structured data with package validation)
# - documentation-analysis-report.md (human-readable with warnings)
# - .cursorrules (anti-hallucination protection rules)

Extract Specialized Content

Extract branding, design systems, and brand voice from websites:

# Extract branding (logos, colors, fonts)
siphon extract branding https://company.com

# Extract design system (components, patterns, tokens)
siphon extract design https://design-system.company.com

# Extract brand voice and tone
siphon extract tone https://brand-guidelines.company.com

# Output directories:
# - extracted-branding/
# - extracted-design/
# - extracted-tone/

See docs/EXTRACTION.md for detailed extraction capabilities.

Generate Agent Skills

Transform your analyzed documentation into Anthropic Claude Skills:

# Generate Skills from analysis results
bun scripts/generate-skills.ts documentation-analysis.json skills-output

# Output:
# ✅ Created skill: jeju-documentation-knowledge
# ✅ Created skill: jeju-quick-start
# ✅ Created skill: jeju-package-validation
# ✅ Created skill: jeju-api-reference
# ...and more category-specific skills

Skills are modular packages that extend Claude's capabilities. See docs/SKILLS.md for details.

The analyzer will:

Extract all package references from documentation
Validate against 25+ known hallucinations database
Check npm registry in real-time for package existence
Detect deprecated packages and suggest alternatives
Automatically generate .cursorrules to protect future AI interactions
Display protection summary with critical warnings

Individual Scripts

# Just crawl URLs
bun scripts/crawl.ts

# Just categorize existing links
bun scripts/categorize.ts

# Just scrape content
bun scripts/scrape-content.ts

# Enhanced scraping with metadata
bun scripts/scrape-content-enhanced.ts

# Scrape GitHub only
bun scripts/scrape-github.ts https://github.com/owner/repo

# Extract specialized content
bun scripts/extractors/branding-extractor.ts https://company.com
bun scripts/extractors/design-extractor.ts https://design-system.com
bun scripts/extractors/tone-extractor.ts https://brand-guidelines.com

# Generate Skills from documentation
bun scripts/generate-skills.ts documentation-analysis.json skills-output

# Create a new skill manually
bun scripts/init-skill.ts my-new-skill --path skills

# Validate a skill
bun scripts/validate-skill.ts skills/my-skill

# Package a skill for distribution
bun scripts/package-skill.ts skills/my-skill

🛡️ Anti-Hallucination Protection

What is it?

AI models can "hallucinate" package names that don't exist or suggest deprecated packages they learned during training. Siphon Knowledge automatically protects you from these issues.

How it Works

1. Known Hallucination Database (25+ Packages)

Tracks commonly hallucinated packages:

{
  wrong: "request",
  correct: "axios or node-fetch",
  severity: "critical",
  reason: "request package is fully deprecated"
}

Examples:

tslint → eslint with @typescript-eslint
protractor → playwright or cypress
node-sass → sass (Dart Sass)
@testing-library/react-hooks → @testing-library/react
enzyme → @testing-library/react

2. Real-Time npm Validation

Every package reference is validated against the actual npm registry:

// Checks if package exists
const result = await validatePackageRealtime("some-package");

// Returns:
{
  packageName: "some-package",
  exists: true,
  isDeprecated: false,
  validatedAt: "2025-10-26T...",
  source: "npm-api"
}

Features:

Rate limiting (max 10 requests/second)
Caching with 15-minute TTL
Parallel validation with concurrency control
Automatic retry on errors

3. Model Training Cutoff Tracking (15+ Models)

Knows when each AI model's training data was cut off:

{
  id: "gpt-4o",
  cutoffDate: "2023-10-01",
  provider: "openai",
  commonIssues: [
    "May not know packages released after October 2023",
    "Unaware of Vite 5, React 19, Next.js 14+"
  ]
}

Automatically calculates knowledge gap and warns about potentially outdated suggestions.

4. Auto-Generated .cursorrules

After analysis, automatically creates protection rules:

# Anti-Hallucination Protection Rules
# Model: openai/gpt-4o
# Knowledge Gap: 14 months

## 🚨 CRITICAL PACKAGE WARNINGS

- ❌ NEVER use `request` - use `axios or node-fetch` instead
  Reason: request package is fully deprecated

## 📦 Package Validation Rules

1. ALWAYS verify package names against npm registry
2. Check deprecation status before suggesting
3. Knowledge gap awareness: This model's knowledge is 14 months old

Protection Summary

After running the analyzer, you'll see:

============================================================
🛡️  ANTI-HALLUCINATION PROTECTION SUMMARY
============================================================
📦 Total packages validated: 47
✅ Valid packages: 43
⚠️  Deprecated packages: 2
❌ Not found: 2
🚨 Critical warnings: 2
🔶 High priority warnings: 3

🚨 CRITICAL ISSUES FOUND:
  - request: request package is fully deprecated
    💡 Use: axios or node-fetch
  - protractor: Protractor is officially deprecated by Angular team
    💡 Use: playwright or cypress
============================================================

CI/CD Integration

Integrate into your pipeline to block hallucinated packages from reaching production.

See docs/CI_CD_INTEGRATION.md for:

GitHub Actions workflows
Pre-commit hooks
GitLab CI/CD
Jenkins pipelines

Example GitHub Action:

- name: Validate packages
  run: bun scripts/validate-packages-ci.ts
  # Exits with code 1 if critical hallucinations detected

🤖 AI Features

Structured Output

The AI analyzer uses Zod schemas to generate structured analysis:

{
  title: "Getting Started Guide",
  category: "getting-started",
  summary: "Introduction to the platform...",
  keyTopics: ["installation", "configuration", "first app"],
  complexity: "beginner",
  qualityScore: 85,
  improvements: ["Add more code examples", "Include troubleshooting section"],

  // Anti-hallucination fields
  packageReferences: ["vite", "react", "typescript"],
  analyzedWithModel: "openai/gpt-4o",
  knowledgeGapMonths: 14,
  hallucinationWarnings: [...]
}

AI Agent Tools

The documentation agent has access to:

readDocFile - Read documentation files
analyzeStructure - Analyze headings, code blocks, links
extractCodeExamples - Extract and categorize code samples
findGaps - Identify missing topics
validatePackages - Check packages against npm registry (anti-hallucination)

Multi-Step Reasoning

Agents automatically:

Read documentation content
Extract package references
Validate packages in real-time
Analyze structure and extract examples
Compare against best practices
Generate quality scores and warnings
Suggest improvements

📁 Output Structure

scraped-content/
├── developer-context/
│   ├── architecture-&-core-concepts/
│   │   ├── 1-overview.md
│   │   ├── 2-architecture.md
│   │   └── _summary.md
│   ├── api-reference/
│   └── ...
├── user-context/
│   ├── getting-started/
│   └── ...
├── README.md
└── .cursorrules                 # ⭐ Anti-hallucination rules (auto-generated)

extracted-branding/              # ⭐ Extracted branding content
├── _extraction_summary.json
├── logos.json
├── colors.json
├── fonts.json
└── screenshots/

extracted-design/                # ⭐ Extracted design systems
├── _extraction_summary.json
├── components.json
├── patterns.json
├── tokens.json
└── assets/

extracted-tone/                  # ⭐ Extracted brand voice
├── _extraction_summary.json
├── tones.json
├── samples.json
├── voice-guidelines.json
└── personality.json

documentation-analysis.json       # AI analysis with package validation
documentation-analysis-report.md  # Report with hallucination warnings

github-owner-repo/               # GitHub scrapes
├── README.md
├── docs/
├── .cursorrules                 # ⭐ Protection rules for this repo
└── ...

skills-output/                   # ⭐ Generated Claude Skills
├── jeju-documentation-knowledge/
│   └── SKILL.md
├── jeju-quick-start/
│   └── SKILL.md
├── jeju-package-validation/
│   └── SKILL.md
└── ...

New Anti-Hallucination Files:

.cursorrules - Protection rules for AI coding assistants
documentation-analysis.json - Now includes packageValidation and hallucinationWarnings fields
documentation-analysis-report.md - Now includes anti-hallucination summary section

🛠️ Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | GITHUB_PAT | Yes | GitHub Personal Access Token | | AI_GATEWAY_API_KEY | Yes | Vercel AI Gateway API Key | | AI_GATEWAY_URL | No | Custom AI Gateway URL (defaults to Vercel) | | START_URL | No | Target URL (can pass as argument) | | MAX_PAGES | No | Max pages to crawl (default: 1000) | | CONCURRENCY | No | Browser workers (default: 10) |

🎯 Examples

Example 1: Scrape and Analyze ElizaOS

# Scrape
siphon https://github.com/elizaOS/eliza

# The CLI automatically:
# 1. Scrapes the repository
# 2. Runs AI analysis
# 3. Validates all package references
# 4. Generates .cursorrules protection

# View protection summary
cat github-elizaOS-eliza/.cursorrules

# View full analysis
cat documentation-analysis-report.md

Sample Output:

============================================================
🛡️  ANTI-HALLUCINATION PROTECTION SUMMARY
============================================================
📦 Total packages validated: 47
✅ Valid packages: 43
⚠️  Deprecated packages: 2
❌ Not found: 2
🚨 Critical warnings: 2

Example 2: Stripe Documentation with Package Validation

# Scrape and analyze (runs automatically)
siphon https://docs.stripe.com

# Check for hallucinations in the analysis
grep "hallucination" documentation-analysis.json

# View generated protection rules
cat scraped-content/.cursorrules

Example 3: Protect Your Own Documentation

# Analyze your project's documentation
bun scripts/ai-analyzer.ts ./docs

# This will:
# - Extract all package references from your markdown files
# - Validate them against npm registry
# - Warn about deprecated packages
# - Generate .cursorrules to prevent future issues

Example 4: Interactive Mode

# Run without .env file - CLI will prompt for credentials
siphon https://docs.example.com

# Enter GitHub PAT: ghp_xxxxx
# Enter AI Gateway Key: sk_xxxxx
# Save to .env? Yes ✓

# Credentials saved for future use!

🔧 Advanced Usage

Custom AI Gateway

# Use your own AI Gateway instance
export AI_GATEWAY_URL=https://your-gateway.example.com/v1/ai
siphon https://docs.example.com

Skip AI Processing

# Scrape only, no AI analysis
siphon --skip-ai https://docs.example.com

Estimate Costs

# Estimate AI processing costs before running
bun scripts/cost-estimator.ts

🧪 Development

# Run in development mode
bun run dev

# Type check
bun run type-check

# Build
bun run build

# Clean output
bun run clean

📚 Project Structure

siphon-knowledge/
├── cli.ts                    # Main CLI entry point
├── logger.ts                 # Logging utilities
├── run-all.ts               # Pipeline orchestrator
├── scripts/
│   ├── crawl.ts             # Website crawler (Playwright)
│   ├── categorize.ts        # URL categorization
│   ├── scrape-content.ts    # Content scraper
│   ├── scrape-github.ts     # GitHub API scraper
│   ├── ai-analyzer.ts       # AI-powered analysis
│   └── generate-docs.ts     # Doc generation
├── utils/
│   ├── env-validator.ts     # Credential validation
│   └── ui.ts                # TUI components
├── menus/                   # Interactive menus
└── packages/
    └── core/
        └── utils/           # Shared utilities

🤝 Contributing

Fork the repository
Create a feature branch
Make your changes
Run type checking: bun run type-check
Submit a pull request

📄 License

MIT

🙏 Credits

Built with:

Bun - JavaScript runtime
Vercel AI SDK - AI agents and structured output
Playwright - Web scraping
@clack/prompts - Interactive CLI

Happy Documentation Siphoning! 🎯✨