siphon-knowledge
v1.0.1
Published
Universal documentation scraper with AI analysis, anti-hallucination protection, and Claude Skills generation
Downloads
21
Maintainers
Readme
🎯 Siphon Knowledge
Universal documentation scraper with AI-powered analysis, anti-hallucination protection, and Claude Skills generation
🌟 Features
🎨 Interactive & Command-Line Interface
- ✨ Interactive Mode - Beautiful persistent CLI with guided workflows
- ⚡ Command Mode - Traditional CLI for automation and scripts
- 🔄 Session Persistence - Settings remembered across menu cycles
📚 Documentation Scraping
- 🌐 Website Crawling - Intelligent web scraping with 3 crawl modes:
- Rapid - Fast shallow crawl (depth 2, 100 pages)
- Controlled - Balanced approach (depth 5, 1000 pages) - DEFAULT
- Worm - Deep exhaustive (unlimited depth, 10000 pages)
- 📦 GitHub Integration - Extract docs from repos without cloning
- 🎨 Content Extraction - Specialized extractors for branding, design, tone, API docs
🤖 AI-Powered Analysis
- 🛡️ Anti-Hallucination Protection - Real-time npm package validation
- 📊 Package Validation - Validates 25+ known hallucinations + npm registry checks
- 📈 Knowledge Gap Detection - Tracks 15+ AI model training cutoffs
- 🔧 Structured Output - JSON analysis with quality scores
- 🎯 Auto-Protection - Generates .cursorrules for AI coding assistants
🎯 Claude Skills Generation
- ✨ Auto-Generation - Create Skills from documentation analysis
- 📦 Distribution - Package for Claude Code or Claude.ai
- 🔍 Validation - Built-in Skill structure validation
- 📋 Management - List, validate, and package Skills
🏗️ Enterprise-Grade Architecture
- ⚡ 3x Performance - Optimized browser context management
- 🧪 Type-Safe - Full TypeScript with runtime validation
- 🚨 Error Handling - Structured errors with recovery strategies
- 🔐 Secure - API key validation and .env management
🏗️ Architecture Overview
Modular Command System
- Clean Separation: Commands are isolated modules with consistent interfaces
- Easy Extension: Add new commands without touching existing code
- Type Safety: Full TypeScript coverage with runtime validation
Performance Optimizations
- 3x Faster Extraction: Concurrent processing with intelligent batching
- Memory Efficient: Optimized browser context management
- Smart Retry Logic: Exponential backoff with failure recovery
Enterprise Features
- Configuration Management: Multi-source config with validation
- Error Handling: Structured errors with recovery strategies
- Comprehensive Testing: Unit and integration tests for reliability
- Code Quality: ESLint + Prettier with strict TypeScript rules
🚀 Quick Start
Prerequisites
- Bun runtime installed
- GitHub Personal Access Token (create one)
- Required scopes:
public_repo,read:org
- Required scopes:
- Vercel AI Gateway API Key (REQUIRED - all AI inference goes through gateway)
Installation
# Clone the repository
git clone <your-repo-url>
cd siphon-knowledge
# Install dependencies
bun install
# Install Playwright browsers
bunx playwright install chromium
# Link globally for `siphon` and `siphon-knowledge` commands
bun linkConfiguration
Create a .env file (or use interactive prompts):
# GitHub Personal Access Token
GITHUB_PAT=ghp_your_token_here
# Vercel AI Gateway API Key (REQUIRED - ALL MODEL INFERENCE GOES THROUGH GATEWAY)
AI_GATEWAY_API_KEY=your_api_key_here
# Model to use via AI Gateway (creator/model-name format)
# Examples: openai/gpt-4o, anthropic/claude-3-5-sonnet-20241022, openai/gpt-4-turbo
AI_MODEL=openai/gpt-4o
# Optional: Custom AI Gateway URL (defaults to Vercel's gateway)
# AI_GATEWAY_URL=https://ai-gateway.vercel.sh/v1/ai📖 Usage
Global Commands
After linking, you can use either command:
siphon # Interactive mode (recommended)
siphon <url> # Command mode
siphon-knowledge # Alias for siphon🎨 Interactive Mode (Recommended)
Launch the beautiful interactive CLI with no arguments:
siphonNavigate through menus:
- 🌐 Scrape Documentation - Guided website/GitHub scraping with crawl mode selection
- 🎨 Extract Content - Extract branding, design, tone, API docs, etc.
- 🎯 Generate Skills - Create and manage Claude Skills
- 🚀 Export Skills - Export to Claude Code or Claude.ai
- 🛡️ Anti-Hallucination - Package validation and protection
- ⚙️ Settings - Configure crawl mode, AI model, and preferences
Features:
- ✅ Persistent session - settings remembered
- ✅ Beautiful UI with @clack/prompts
- ✅ Guided workflows for complex tasks
- ✅ Discoverable - explore all features interactively
⚡ Command Mode (Automation)
Traditional CLI for scripts and automation:
Scrape Website Documentation
# Scrape from any documentation website
siphon https://docs.example.com
# The CLI will:
# 1. Validate GitHub PAT and AI Gateway API key (prompts if missing)
# 2. Crawl the website to discover all pages
# 3. Categorize URLs into developer/user contexts
# 4. Scrape content from all discovered pages
# 5. Generate structured documentationScrape GitHub Repository
# Scrape documentation from a GitHub repo (no cloning!)
siphon https://github.com/owner/repo
# Or specify a branch
siphon https://github.com/owner/repo/tree/develop
# The CLI will:
# 1. Validate credentials
# 2. Use GitHub API to fetch documentation files
# 3. Download only docs (README, /docs, .md files, etc.)
# 4. Organize by directory structureAI-Powered Analysis with Anti-Hallucination Protection
After scraping, analyze the documentation with AI. Anti-hallucination protection runs automatically:
# Analyze all documentation in scraped-content/
bun scripts/ai-analyzer.ts scraped-content
# Output:
# - documentation-analysis.json (structured data with package validation)
# - documentation-analysis-report.md (human-readable with warnings)
# - .cursorrules (anti-hallucination protection rules)Extract Specialized Content
Extract branding, design systems, and brand voice from websites:
# Extract branding (logos, colors, fonts)
siphon extract branding https://company.com
# Extract design system (components, patterns, tokens)
siphon extract design https://design-system.company.com
# Extract brand voice and tone
siphon extract tone https://brand-guidelines.company.com
# Output directories:
# - extracted-branding/
# - extracted-design/
# - extracted-tone/See docs/EXTRACTION.md for detailed extraction capabilities.
Generate Agent Skills
Transform your analyzed documentation into Anthropic Claude Skills:
# Generate Skills from analysis results
bun scripts/generate-skills.ts documentation-analysis.json skills-output
# Output:
# ✅ Created skill: jeju-documentation-knowledge
# ✅ Created skill: jeju-quick-start
# ✅ Created skill: jeju-package-validation
# ✅ Created skill: jeju-api-reference
# ...and more category-specific skillsSkills are modular packages that extend Claude's capabilities. See docs/SKILLS.md for details.
The analyzer will:
- Extract all package references from documentation
- Validate against 25+ known hallucinations database
- Check npm registry in real-time for package existence
- Detect deprecated packages and suggest alternatives
- Automatically generate .cursorrules to protect future AI interactions
- Display protection summary with critical warnings
Individual Scripts
# Just crawl URLs
bun scripts/crawl.ts
# Just categorize existing links
bun scripts/categorize.ts
# Just scrape content
bun scripts/scrape-content.ts
# Enhanced scraping with metadata
bun scripts/scrape-content-enhanced.ts
# Scrape GitHub only
bun scripts/scrape-github.ts https://github.com/owner/repo
# Extract specialized content
bun scripts/extractors/branding-extractor.ts https://company.com
bun scripts/extractors/design-extractor.ts https://design-system.com
bun scripts/extractors/tone-extractor.ts https://brand-guidelines.com
# Generate Skills from documentation
bun scripts/generate-skills.ts documentation-analysis.json skills-output
# Create a new skill manually
bun scripts/init-skill.ts my-new-skill --path skills
# Validate a skill
bun scripts/validate-skill.ts skills/my-skill
# Package a skill for distribution
bun scripts/package-skill.ts skills/my-skill🛡️ Anti-Hallucination Protection
What is it?
AI models can "hallucinate" package names that don't exist or suggest deprecated packages they learned during training. Siphon Knowledge automatically protects you from these issues.
How it Works
1. Known Hallucination Database (25+ Packages)
Tracks commonly hallucinated packages:
{
wrong: "request",
correct: "axios or node-fetch",
severity: "critical",
reason: "request package is fully deprecated"
}Examples:
tslint→eslint with @typescript-eslintprotractor→playwright or cypressnode-sass→sass (Dart Sass)@testing-library/react-hooks→@testing-library/reactenzyme→@testing-library/react
2. Real-Time npm Validation
Every package reference is validated against the actual npm registry:
// Checks if package exists
const result = await validatePackageRealtime("some-package");
// Returns:
{
packageName: "some-package",
exists: true,
isDeprecated: false,
validatedAt: "2025-10-26T...",
source: "npm-api"
}Features:
- Rate limiting (max 10 requests/second)
- Caching with 15-minute TTL
- Parallel validation with concurrency control
- Automatic retry on errors
3. Model Training Cutoff Tracking (15+ Models)
Knows when each AI model's training data was cut off:
{
id: "gpt-4o",
cutoffDate: "2023-10-01",
provider: "openai",
commonIssues: [
"May not know packages released after October 2023",
"Unaware of Vite 5, React 19, Next.js 14+"
]
}Automatically calculates knowledge gap and warns about potentially outdated suggestions.
4. Auto-Generated .cursorrules
After analysis, automatically creates protection rules:
# Anti-Hallucination Protection Rules
# Model: openai/gpt-4o
# Knowledge Gap: 14 months
## 🚨 CRITICAL PACKAGE WARNINGS
- ❌ NEVER use `request` - use `axios or node-fetch` instead
Reason: request package is fully deprecated
## 📦 Package Validation Rules
1. ALWAYS verify package names against npm registry
2. Check deprecation status before suggesting
3. Knowledge gap awareness: This model's knowledge is 14 months oldProtection Summary
After running the analyzer, you'll see:
============================================================
🛡️ ANTI-HALLUCINATION PROTECTION SUMMARY
============================================================
📦 Total packages validated: 47
✅ Valid packages: 43
⚠️ Deprecated packages: 2
❌ Not found: 2
🚨 Critical warnings: 2
🔶 High priority warnings: 3
🚨 CRITICAL ISSUES FOUND:
- request: request package is fully deprecated
💡 Use: axios or node-fetch
- protractor: Protractor is officially deprecated by Angular team
💡 Use: playwright or cypress
============================================================CI/CD Integration
Integrate into your pipeline to block hallucinated packages from reaching production.
See docs/CI_CD_INTEGRATION.md for:
- GitHub Actions workflows
- Pre-commit hooks
- GitLab CI/CD
- Jenkins pipelines
Example GitHub Action:
- name: Validate packages
run: bun scripts/validate-packages-ci.ts
# Exits with code 1 if critical hallucinations detected🤖 AI Features
Structured Output
The AI analyzer uses Zod schemas to generate structured analysis:
{
title: "Getting Started Guide",
category: "getting-started",
summary: "Introduction to the platform...",
keyTopics: ["installation", "configuration", "first app"],
complexity: "beginner",
qualityScore: 85,
improvements: ["Add more code examples", "Include troubleshooting section"],
// Anti-hallucination fields
packageReferences: ["vite", "react", "typescript"],
analyzedWithModel: "openai/gpt-4o",
knowledgeGapMonths: 14,
hallucinationWarnings: [...]
}AI Agent Tools
The documentation agent has access to:
- readDocFile - Read documentation files
- analyzeStructure - Analyze headings, code blocks, links
- extractCodeExamples - Extract and categorize code samples
- findGaps - Identify missing topics
- validatePackages - Check packages against npm registry (anti-hallucination)
Multi-Step Reasoning
Agents automatically:
- Read documentation content
- Extract package references
- Validate packages in real-time
- Analyze structure and extract examples
- Compare against best practices
- Generate quality scores and warnings
- Suggest improvements
📁 Output Structure
scraped-content/
├── developer-context/
│ ├── architecture-&-core-concepts/
│ │ ├── 1-overview.md
│ │ ├── 2-architecture.md
│ │ └── _summary.md
│ ├── api-reference/
│ └── ...
├── user-context/
│ ├── getting-started/
│ └── ...
├── README.md
└── .cursorrules # ⭐ Anti-hallucination rules (auto-generated)
extracted-branding/ # ⭐ Extracted branding content
├── _extraction_summary.json
├── logos.json
├── colors.json
├── fonts.json
└── screenshots/
extracted-design/ # ⭐ Extracted design systems
├── _extraction_summary.json
├── components.json
├── patterns.json
├── tokens.json
└── assets/
extracted-tone/ # ⭐ Extracted brand voice
├── _extraction_summary.json
├── tones.json
├── samples.json
├── voice-guidelines.json
└── personality.json
documentation-analysis.json # AI analysis with package validation
documentation-analysis-report.md # Report with hallucination warnings
github-owner-repo/ # GitHub scrapes
├── README.md
├── docs/
├── .cursorrules # ⭐ Protection rules for this repo
└── ...
skills-output/ # ⭐ Generated Claude Skills
├── jeju-documentation-knowledge/
│ └── SKILL.md
├── jeju-quick-start/
│ └── SKILL.md
├── jeju-package-validation/
│ └── SKILL.md
└── ...New Anti-Hallucination Files:
.cursorrules- Protection rules for AI coding assistantsdocumentation-analysis.json- Now includespackageValidationandhallucinationWarningsfieldsdocumentation-analysis-report.md- Now includes anti-hallucination summary section
🛠️ Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| GITHUB_PAT | Yes | GitHub Personal Access Token |
| AI_GATEWAY_API_KEY | Yes | Vercel AI Gateway API Key |
| AI_GATEWAY_URL | No | Custom AI Gateway URL (defaults to Vercel) |
| START_URL | No | Target URL (can pass as argument) |
| MAX_PAGES | No | Max pages to crawl (default: 1000) |
| CONCURRENCY | No | Browser workers (default: 10) |
🎯 Examples
Example 1: Scrape and Analyze ElizaOS
# Scrape
siphon https://github.com/elizaOS/eliza
# The CLI automatically:
# 1. Scrapes the repository
# 2. Runs AI analysis
# 3. Validates all package references
# 4. Generates .cursorrules protection
# View protection summary
cat github-elizaOS-eliza/.cursorrules
# View full analysis
cat documentation-analysis-report.mdSample Output:
============================================================
🛡️ ANTI-HALLUCINATION PROTECTION SUMMARY
============================================================
📦 Total packages validated: 47
✅ Valid packages: 43
⚠️ Deprecated packages: 2
❌ Not found: 2
🚨 Critical warnings: 2Example 2: Stripe Documentation with Package Validation
# Scrape and analyze (runs automatically)
siphon https://docs.stripe.com
# Check for hallucinations in the analysis
grep "hallucination" documentation-analysis.json
# View generated protection rules
cat scraped-content/.cursorrulesExample 3: Protect Your Own Documentation
# Analyze your project's documentation
bun scripts/ai-analyzer.ts ./docs
# This will:
# - Extract all package references from your markdown files
# - Validate them against npm registry
# - Warn about deprecated packages
# - Generate .cursorrules to prevent future issuesExample 4: Interactive Mode
# Run without .env file - CLI will prompt for credentials
siphon https://docs.example.com
# Enter GitHub PAT: ghp_xxxxx
# Enter AI Gateway Key: sk_xxxxx
# Save to .env? Yes ✓
# Credentials saved for future use!🔧 Advanced Usage
Custom AI Gateway
# Use your own AI Gateway instance
export AI_GATEWAY_URL=https://your-gateway.example.com/v1/ai
siphon https://docs.example.comSkip AI Processing
# Scrape only, no AI analysis
siphon --skip-ai https://docs.example.comEstimate Costs
# Estimate AI processing costs before running
bun scripts/cost-estimator.ts🧪 Development
# Run in development mode
bun run dev
# Type check
bun run type-check
# Build
bun run build
# Clean output
bun run clean📚 Project Structure
siphon-knowledge/
├── cli.ts # Main CLI entry point
├── logger.ts # Logging utilities
├── run-all.ts # Pipeline orchestrator
├── scripts/
│ ├── crawl.ts # Website crawler (Playwright)
│ ├── categorize.ts # URL categorization
│ ├── scrape-content.ts # Content scraper
│ ├── scrape-github.ts # GitHub API scraper
│ ├── ai-analyzer.ts # AI-powered analysis
│ └── generate-docs.ts # Doc generation
├── utils/
│ ├── env-validator.ts # Credential validation
│ └── ui.ts # TUI components
├── menus/ # Interactive menus
└── packages/
└── core/
└── utils/ # Shared utilities🤝 Contributing
- Fork the repository
- Create a feature branch
- Make your changes
- Run type checking:
bun run type-check - Submit a pull request
📄 License
MIT
🙏 Credits
Built with:
- Bun - JavaScript runtime
- Vercel AI SDK - AI agents and structured output
- Playwright - Web scraping
- @clack/prompts - Interactive CLI
Happy Documentation Siphoning! 🎯✨
