npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

siphon-knowledge

v1.0.1

Published

Universal documentation scraper with AI analysis, anti-hallucination protection, and Claude Skills generation

Downloads

21

Readme

🎯 Siphon Knowledge

Universal documentation scraper with AI-powered analysis, anti-hallucination protection, and Claude Skills generation

License: MIT Bun TypeScript

🌟 Features

🎨 Interactive & Command-Line Interface

  • ✨ Interactive Mode - Beautiful persistent CLI with guided workflows
  • ⚡ Command Mode - Traditional CLI for automation and scripts
  • 🔄 Session Persistence - Settings remembered across menu cycles

📚 Documentation Scraping

  • 🌐 Website Crawling - Intelligent web scraping with 3 crawl modes:
    • Rapid - Fast shallow crawl (depth 2, 100 pages)
    • Controlled - Balanced approach (depth 5, 1000 pages) - DEFAULT
    • Worm - Deep exhaustive (unlimited depth, 10000 pages)
  • 📦 GitHub Integration - Extract docs from repos without cloning
  • 🎨 Content Extraction - Specialized extractors for branding, design, tone, API docs

🤖 AI-Powered Analysis

  • 🛡️ Anti-Hallucination Protection - Real-time npm package validation
  • 📊 Package Validation - Validates 25+ known hallucinations + npm registry checks
  • 📈 Knowledge Gap Detection - Tracks 15+ AI model training cutoffs
  • 🔧 Structured Output - JSON analysis with quality scores
  • 🎯 Auto-Protection - Generates .cursorrules for AI coding assistants

🎯 Claude Skills Generation

  • ✨ Auto-Generation - Create Skills from documentation analysis
  • 📦 Distribution - Package for Claude Code or Claude.ai
  • 🔍 Validation - Built-in Skill structure validation
  • 📋 Management - List, validate, and package Skills

🏗️ Enterprise-Grade Architecture

  • ⚡ 3x Performance - Optimized browser context management
  • 🧪 Type-Safe - Full TypeScript with runtime validation
  • 🚨 Error Handling - Structured errors with recovery strategies
  • 🔐 Secure - API key validation and .env management

🏗️ Architecture Overview

Modular Command System

  • Clean Separation: Commands are isolated modules with consistent interfaces
  • Easy Extension: Add new commands without touching existing code
  • Type Safety: Full TypeScript coverage with runtime validation

Performance Optimizations

  • 3x Faster Extraction: Concurrent processing with intelligent batching
  • Memory Efficient: Optimized browser context management
  • Smart Retry Logic: Exponential backoff with failure recovery

Enterprise Features

  • Configuration Management: Multi-source config with validation
  • Error Handling: Structured errors with recovery strategies
  • Comprehensive Testing: Unit and integration tests for reliability
  • Code Quality: ESLint + Prettier with strict TypeScript rules

🚀 Quick Start

Prerequisites

  • Bun runtime installed
  • GitHub Personal Access Token (create one)
    • Required scopes: public_repo, read:org
  • Vercel AI Gateway API Key (REQUIRED - all AI inference goes through gateway)

Installation

# Clone the repository
git clone <your-repo-url>
cd siphon-knowledge

# Install dependencies
bun install

# Install Playwright browsers
bunx playwright install chromium

# Link globally for `siphon` and `siphon-knowledge` commands
bun link

Configuration

Create a .env file (or use interactive prompts):

# GitHub Personal Access Token
GITHUB_PAT=ghp_your_token_here

# Vercel AI Gateway API Key (REQUIRED - ALL MODEL INFERENCE GOES THROUGH GATEWAY)
AI_GATEWAY_API_KEY=your_api_key_here

# Model to use via AI Gateway (creator/model-name format)
# Examples: openai/gpt-4o, anthropic/claude-3-5-sonnet-20241022, openai/gpt-4-turbo
AI_MODEL=openai/gpt-4o

# Optional: Custom AI Gateway URL (defaults to Vercel's gateway)
# AI_GATEWAY_URL=https://ai-gateway.vercel.sh/v1/ai

📖 Usage

Global Commands

After linking, you can use either command:

siphon              # Interactive mode (recommended)
siphon <url>        # Command mode
siphon-knowledge    # Alias for siphon

🎨 Interactive Mode (Recommended)

Launch the beautiful interactive CLI with no arguments:

siphon

Navigate through menus:

  • 🌐 Scrape Documentation - Guided website/GitHub scraping with crawl mode selection
  • 🎨 Extract Content - Extract branding, design, tone, API docs, etc.
  • 🎯 Generate Skills - Create and manage Claude Skills
  • 🚀 Export Skills - Export to Claude Code or Claude.ai
  • 🛡️ Anti-Hallucination - Package validation and protection
  • ⚙️ Settings - Configure crawl mode, AI model, and preferences

Features:

  • ✅ Persistent session - settings remembered
  • ✅ Beautiful UI with @clack/prompts
  • ✅ Guided workflows for complex tasks
  • ✅ Discoverable - explore all features interactively

⚡ Command Mode (Automation)

Traditional CLI for scripts and automation:

Scrape Website Documentation

# Scrape from any documentation website
siphon https://docs.example.com

# The CLI will:
# 1. Validate GitHub PAT and AI Gateway API key (prompts if missing)
# 2. Crawl the website to discover all pages
# 3. Categorize URLs into developer/user contexts
# 4. Scrape content from all discovered pages
# 5. Generate structured documentation

Scrape GitHub Repository

# Scrape documentation from a GitHub repo (no cloning!)
siphon https://github.com/owner/repo

# Or specify a branch
siphon https://github.com/owner/repo/tree/develop

# The CLI will:
# 1. Validate credentials
# 2. Use GitHub API to fetch documentation files
# 3. Download only docs (README, /docs, .md files, etc.)
# 4. Organize by directory structure

AI-Powered Analysis with Anti-Hallucination Protection

After scraping, analyze the documentation with AI. Anti-hallucination protection runs automatically:

# Analyze all documentation in scraped-content/
bun scripts/ai-analyzer.ts scraped-content

# Output:
# - documentation-analysis.json (structured data with package validation)
# - documentation-analysis-report.md (human-readable with warnings)
# - .cursorrules (anti-hallucination protection rules)

Extract Specialized Content

Extract branding, design systems, and brand voice from websites:

# Extract branding (logos, colors, fonts)
siphon extract branding https://company.com

# Extract design system (components, patterns, tokens)
siphon extract design https://design-system.company.com

# Extract brand voice and tone
siphon extract tone https://brand-guidelines.company.com

# Output directories:
# - extracted-branding/
# - extracted-design/
# - extracted-tone/

See docs/EXTRACTION.md for detailed extraction capabilities.

Generate Agent Skills

Transform your analyzed documentation into Anthropic Claude Skills:

# Generate Skills from analysis results
bun scripts/generate-skills.ts documentation-analysis.json skills-output

# Output:
# ✅ Created skill: jeju-documentation-knowledge
# ✅ Created skill: jeju-quick-start
# ✅ Created skill: jeju-package-validation
# ✅ Created skill: jeju-api-reference
# ...and more category-specific skills

Skills are modular packages that extend Claude's capabilities. See docs/SKILLS.md for details.

The analyzer will:

  1. Extract all package references from documentation
  2. Validate against 25+ known hallucinations database
  3. Check npm registry in real-time for package existence
  4. Detect deprecated packages and suggest alternatives
  5. Automatically generate .cursorrules to protect future AI interactions
  6. Display protection summary with critical warnings

Individual Scripts

# Just crawl URLs
bun scripts/crawl.ts

# Just categorize existing links
bun scripts/categorize.ts

# Just scrape content
bun scripts/scrape-content.ts

# Enhanced scraping with metadata
bun scripts/scrape-content-enhanced.ts

# Scrape GitHub only
bun scripts/scrape-github.ts https://github.com/owner/repo

# Extract specialized content
bun scripts/extractors/branding-extractor.ts https://company.com
bun scripts/extractors/design-extractor.ts https://design-system.com
bun scripts/extractors/tone-extractor.ts https://brand-guidelines.com

# Generate Skills from documentation
bun scripts/generate-skills.ts documentation-analysis.json skills-output

# Create a new skill manually
bun scripts/init-skill.ts my-new-skill --path skills

# Validate a skill
bun scripts/validate-skill.ts skills/my-skill

# Package a skill for distribution
bun scripts/package-skill.ts skills/my-skill

🛡️ Anti-Hallucination Protection

What is it?

AI models can "hallucinate" package names that don't exist or suggest deprecated packages they learned during training. Siphon Knowledge automatically protects you from these issues.

How it Works

1. Known Hallucination Database (25+ Packages)

Tracks commonly hallucinated packages:

{
  wrong: "request",
  correct: "axios or node-fetch",
  severity: "critical",
  reason: "request package is fully deprecated"
}

Examples:

  • tslinteslint with @typescript-eslint
  • protractorplaywright or cypress
  • node-sasssass (Dart Sass)
  • @testing-library/react-hooks@testing-library/react
  • enzyme@testing-library/react

2. Real-Time npm Validation

Every package reference is validated against the actual npm registry:

// Checks if package exists
const result = await validatePackageRealtime("some-package");

// Returns:
{
  packageName: "some-package",
  exists: true,
  isDeprecated: false,
  validatedAt: "2025-10-26T...",
  source: "npm-api"
}

Features:

  • Rate limiting (max 10 requests/second)
  • Caching with 15-minute TTL
  • Parallel validation with concurrency control
  • Automatic retry on errors

3. Model Training Cutoff Tracking (15+ Models)

Knows when each AI model's training data was cut off:

{
  id: "gpt-4o",
  cutoffDate: "2023-10-01",
  provider: "openai",
  commonIssues: [
    "May not know packages released after October 2023",
    "Unaware of Vite 5, React 19, Next.js 14+"
  ]
}

Automatically calculates knowledge gap and warns about potentially outdated suggestions.

4. Auto-Generated .cursorrules

After analysis, automatically creates protection rules:

# Anti-Hallucination Protection Rules
# Model: openai/gpt-4o
# Knowledge Gap: 14 months

## 🚨 CRITICAL PACKAGE WARNINGS

- ❌ NEVER use `request` - use `axios or node-fetch` instead
  Reason: request package is fully deprecated

## 📦 Package Validation Rules

1. ALWAYS verify package names against npm registry
2. Check deprecation status before suggesting
3. Knowledge gap awareness: This model's knowledge is 14 months old

Protection Summary

After running the analyzer, you'll see:

============================================================
🛡️  ANTI-HALLUCINATION PROTECTION SUMMARY
============================================================
📦 Total packages validated: 47
✅ Valid packages: 43
⚠️  Deprecated packages: 2
❌ Not found: 2
🚨 Critical warnings: 2
🔶 High priority warnings: 3

🚨 CRITICAL ISSUES FOUND:
  - request: request package is fully deprecated
    💡 Use: axios or node-fetch
  - protractor: Protractor is officially deprecated by Angular team
    💡 Use: playwright or cypress
============================================================

CI/CD Integration

Integrate into your pipeline to block hallucinated packages from reaching production.

See docs/CI_CD_INTEGRATION.md for:

  • GitHub Actions workflows
  • Pre-commit hooks
  • GitLab CI/CD
  • Jenkins pipelines

Example GitHub Action:

- name: Validate packages
  run: bun scripts/validate-packages-ci.ts
  # Exits with code 1 if critical hallucinations detected

🤖 AI Features

Structured Output

The AI analyzer uses Zod schemas to generate structured analysis:

{
  title: "Getting Started Guide",
  category: "getting-started",
  summary: "Introduction to the platform...",
  keyTopics: ["installation", "configuration", "first app"],
  complexity: "beginner",
  qualityScore: 85,
  improvements: ["Add more code examples", "Include troubleshooting section"],

  // Anti-hallucination fields
  packageReferences: ["vite", "react", "typescript"],
  analyzedWithModel: "openai/gpt-4o",
  knowledgeGapMonths: 14,
  hallucinationWarnings: [...]
}

AI Agent Tools

The documentation agent has access to:

  • readDocFile - Read documentation files
  • analyzeStructure - Analyze headings, code blocks, links
  • extractCodeExamples - Extract and categorize code samples
  • findGaps - Identify missing topics
  • validatePackages - Check packages against npm registry (anti-hallucination)

Multi-Step Reasoning

Agents automatically:

  1. Read documentation content
  2. Extract package references
  3. Validate packages in real-time
  4. Analyze structure and extract examples
  5. Compare against best practices
  6. Generate quality scores and warnings
  7. Suggest improvements

📁 Output Structure

scraped-content/
├── developer-context/
│   ├── architecture-&-core-concepts/
│   │   ├── 1-overview.md
│   │   ├── 2-architecture.md
│   │   └── _summary.md
│   ├── api-reference/
│   └── ...
├── user-context/
│   ├── getting-started/
│   └── ...
├── README.md
└── .cursorrules                 # ⭐ Anti-hallucination rules (auto-generated)

extracted-branding/              # ⭐ Extracted branding content
├── _extraction_summary.json
├── logos.json
├── colors.json
├── fonts.json
└── screenshots/

extracted-design/                # ⭐ Extracted design systems
├── _extraction_summary.json
├── components.json
├── patterns.json
├── tokens.json
└── assets/

extracted-tone/                  # ⭐ Extracted brand voice
├── _extraction_summary.json
├── tones.json
├── samples.json
├── voice-guidelines.json
└── personality.json

documentation-analysis.json       # AI analysis with package validation
documentation-analysis-report.md  # Report with hallucination warnings

github-owner-repo/               # GitHub scrapes
├── README.md
├── docs/
├── .cursorrules                 # ⭐ Protection rules for this repo
└── ...

skills-output/                   # ⭐ Generated Claude Skills
├── jeju-documentation-knowledge/
│   └── SKILL.md
├── jeju-quick-start/
│   └── SKILL.md
├── jeju-package-validation/
│   └── SKILL.md
└── ...

New Anti-Hallucination Files:

  • .cursorrules - Protection rules for AI coding assistants
  • documentation-analysis.json - Now includes packageValidation and hallucinationWarnings fields
  • documentation-analysis-report.md - Now includes anti-hallucination summary section

🛠️ Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | GITHUB_PAT | Yes | GitHub Personal Access Token | | AI_GATEWAY_API_KEY | Yes | Vercel AI Gateway API Key | | AI_GATEWAY_URL | No | Custom AI Gateway URL (defaults to Vercel) | | START_URL | No | Target URL (can pass as argument) | | MAX_PAGES | No | Max pages to crawl (default: 1000) | | CONCURRENCY | No | Browser workers (default: 10) |

🎯 Examples

Example 1: Scrape and Analyze ElizaOS

# Scrape
siphon https://github.com/elizaOS/eliza

# The CLI automatically:
# 1. Scrapes the repository
# 2. Runs AI analysis
# 3. Validates all package references
# 4. Generates .cursorrules protection

# View protection summary
cat github-elizaOS-eliza/.cursorrules

# View full analysis
cat documentation-analysis-report.md

Sample Output:

============================================================
🛡️  ANTI-HALLUCINATION PROTECTION SUMMARY
============================================================
📦 Total packages validated: 47
✅ Valid packages: 43
⚠️  Deprecated packages: 2
❌ Not found: 2
🚨 Critical warnings: 2

Example 2: Stripe Documentation with Package Validation

# Scrape and analyze (runs automatically)
siphon https://docs.stripe.com

# Check for hallucinations in the analysis
grep "hallucination" documentation-analysis.json

# View generated protection rules
cat scraped-content/.cursorrules

Example 3: Protect Your Own Documentation

# Analyze your project's documentation
bun scripts/ai-analyzer.ts ./docs

# This will:
# - Extract all package references from your markdown files
# - Validate them against npm registry
# - Warn about deprecated packages
# - Generate .cursorrules to prevent future issues

Example 4: Interactive Mode

# Run without .env file - CLI will prompt for credentials
siphon https://docs.example.com

# Enter GitHub PAT: ghp_xxxxx
# Enter AI Gateway Key: sk_xxxxx
# Save to .env? Yes ✓

# Credentials saved for future use!

🔧 Advanced Usage

Custom AI Gateway

# Use your own AI Gateway instance
export AI_GATEWAY_URL=https://your-gateway.example.com/v1/ai
siphon https://docs.example.com

Skip AI Processing

# Scrape only, no AI analysis
siphon --skip-ai https://docs.example.com

Estimate Costs

# Estimate AI processing costs before running
bun scripts/cost-estimator.ts

🧪 Development

# Run in development mode
bun run dev

# Type check
bun run type-check

# Build
bun run build

# Clean output
bun run clean

📚 Project Structure

siphon-knowledge/
├── cli.ts                    # Main CLI entry point
├── logger.ts                 # Logging utilities
├── run-all.ts               # Pipeline orchestrator
├── scripts/
│   ├── crawl.ts             # Website crawler (Playwright)
│   ├── categorize.ts        # URL categorization
│   ├── scrape-content.ts    # Content scraper
│   ├── scrape-github.ts     # GitHub API scraper
│   ├── ai-analyzer.ts       # AI-powered analysis
│   └── generate-docs.ts     # Doc generation
├── utils/
│   ├── env-validator.ts     # Credential validation
│   └── ui.ts                # TUI components
├── menus/                   # Interactive menus
└── packages/
    └── core/
        └── utils/           # Shared utilities

🤝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make your changes
  4. Run type checking: bun run type-check
  5. Submit a pull request

📄 License

MIT

🙏 Credits

Built with:


Happy Documentation Siphoning! 🎯✨