npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@promptordie/siphon-knowledge

v1.0.2

Published

AI-powered documentation generation system for AI Coding Agents.

Readme

Universal Documentation Generator

A comprehensive, AI-powered documentation generation system that can crawl, categorize, scrape, and polish documentation from any website.

📦 Installation

Global Installation (Recommended)

# Install globally with Bun (recommended)
bun i -g @promptordie/siphon-knowledge

# Or with npm
npm i -g @promptordie/siphon-knowledge

This will install all necessary dependencies including:

  • ✅ Playwright browsers (for web scraping)
  • ✅ OpenAI SDK (for AI polishing)
  • ✅ All required Node.js modules

Manual Installation

# Clone the repository
git clone https://github.com/Dexploarer/siphon-knowledge
cd siphon-knowledge

# Install dependencies
bun install

# Run setup (installs Playwright browsers)
bun run setup
# or
npm run setup

🚀 Quick Start

Using Global Installation

# Generate documentation from any website
siphon-knowledge https://docs.example.com

# Skip AI polishing for cost savings
siphon-knowledge --skip-ai https://docs.example.com

# Run comprehensive setup check
siphon-knowledge --setup

# Show help
siphon-knowledge --help

Using Local Installation

# Generate documentation from any website
START_URL="https://docs.example.com" bun run-all.ts

# Or run individual steps
bun scripts/crawl.ts
bun scripts/categorize.ts
bun scripts/scrape-content.ts
bun scripts/generate-docs.ts
bun scripts/cost-estimator.ts
bun scripts/polish-docs.ts

🔨 Building & Development

Build Commands

# Build verification (confirms project is ready)
bun run build

# Clean build artifacts and output
bun run clean

# Show all available commands
bun run help

Development Workflow

# Install dependencies
bun install

# Run setup (installs Playwright browsers)
bun run setup

# Verify build (optional - Bun runs TypeScript natively)
bun run build

# Run individual scripts directly
bun scripts/crawl.ts
bun scripts/categorize.ts
bun scripts/scrape-content.ts
bun scripts/generate-docs.ts
bun scripts/polish-docs.ts

🤖 AI Features & OpenAI Setup

The system includes AI-powered documentation polishing using OpenAI's GPT-4o-mini model.

Automatic API Key Prompt

When you run the system with AI features enabled (default), you'll be securely prompted to enter your OpenAI API key:

siphon-knowledge https://docs.example.com
# Enter your OpenAI API key: ***************

Your input is masked with asterisks for security.

Using Environment Variable

You can also set the API key as an environment variable:

# Linux/macOS
export OPENAI_API_KEY=sk-your-api-key-here
siphon-knowledge https://docs.example.com

# Windows
set OPENAI_API_KEY=sk-your-api-key-here
siphon-knowledge https://docs.example.com

# Or inline
OPENAI_API_KEY=sk-your-api-key-here siphon-knowledge https://docs.example.com

Cost Estimation

  • GPT-4o-mini pricing: $0.00015 per 1K input tokens, $0.0006 per 1K output tokens
  • The system includes a cost estimator to help you understand expenses
  • Use --skip-ai to disable AI features and avoid costs

AI Polishing Features

  • Grammar & Style: Fixes errors and improves readability
  • Structure: Enhances document organization
  • Clarity: Makes technical concepts more accessible
  • Consistency: Ensures uniform terminology and formatting
  • Cross-references: Adds helpful related information links

Polished documents are saved with _polished suffix (e.g., document_polished.md).

📁 Directory Structure

universal-doc-generator/
├── scripts/                    # Processing scripts
│   ├── crawl.ts               # Universal web crawler
│   ├── categorize.ts          # Smart URL categorization
│   ├── scrape-content.ts      # Content extraction
│   ├── generate-docs.ts       # Documentation generation
│   ├── cost-estimator.ts      # AI cost estimation
│   └── polish-docs.ts         # AI-powered polishing
├── output/                    # Generated content
│   └── scraped-content/       # Final documentation
├── docs/                      # Additional documentation
├── tools/                     # Utility tools
├── package.json               # Dependencies
├── tsconfig.json              # TypeScript configuration
├── run-all.ts                 # Master execution script
└── README.md                  # This file

🎯 Features

🔍 Universal Crawling

  • Any Website: Works with any documentation site
  • Concurrent Processing: Up to 10 simultaneous browser instances (customizable)
  • Intelligent Filtering: Only crawls relevant documentation pages
  • Rate Limiting: Respectful crawling with delays
  • Error Handling: Graceful failure recovery

📊 Smart Categorization

  • Universal Patterns: Works with most documentation structures
  • Context Separation: Developer vs User documentation
  • Category Organization: 11 detailed categories
  • Pattern Matching: Smart URL classification
  • Summary Generation: Automated overview creation

🤖 AI-Powered Enhancement

  • Quality Assessment: 1-10 scoring system
  • Content Polishing: Grammar, clarity, and structure improvements
  • Cost Optimization: Token usage optimization
  • Rate Limit Protection: Automatic retry and delays

💰 Cost Effective

  • GPT-4o-mini: Latest cost-efficient models
  • Token Optimization: Content truncation and limits
  • Batch Processing: Controlled API usage
  • Cost Estimation: Pre-execution cost analysis

📋 Prerequisites

System Requirements

  • Node.js: 18+ (or Bun runtime)
  • Bun: Recommended for faster execution
  • Chromium: For web scraping
  • OpenAI API Key: For AI polishing (optional)

Installation

# Install dependencies
bun install

# Install Chromium for Playwright
bunx playwright install chromium

🔧 Configuration

Environment Variables

# Target website URL
START_URL=https://docs.example.com

# Crawling configuration
MAX_PAGES=2000
CONCURRENCY=10

# OpenAI API Key (for AI polishing)
OPENAI_API_KEY=your_api_key_here

Cost Management

The system includes built-in cost controls:

  • Max files per session: 20 (configurable)
  • Token limits: 2000 per request
  • Rate limiting: 3-second delays
  • Model selection: GPT-4o-mini for efficiency

📖 Usage

Complete Pipeline

# Generate docs from any website
START_URL="https://docs.example.com" bun run-all.ts

# Skip AI polishing for cost savings
START_URL="https://docs.example.com" bun run-all.ts --skip-ai

# Custom crawling limits
START_URL="https://docs.example.com" MAX_PAGES=500 CONCURRENCY=8 bun run-all.ts

Individual Steps

# 1. Crawl documentation URLs
START_URL="https://docs.example.com" bun scripts/crawl.ts

# 2. Categorize URLs
bun scripts/categorize.ts

# 3. Scrape content
bun scripts/scrape-content.ts

# 4. Generate documentation structure
bun scripts/generate-docs.ts

# 5. Estimate AI costs
bun scripts/cost-estimator.ts

# 6. Polish with AI
bun scripts/polish-docs.ts

📊 Output Structure

Generated Documentation

output/scraped-content/
├── developer-context/
│   ├── architecture-&-core-concepts/
│   │   ├── rules.md
│   │   ├── workflows.md
│   │   ├── knowledge.md
│   │   ├── guiding-docs.md
│   │   ├── sanity-checks.md
│   │   ├── architectural-docs.md
│   │   ├── llm.txt
│   │   ├── agent.md
│   │   ├── README.md
│   │   └── *.md (content files)
│   └── [other categories...]
├── user-context/
│   └── [user categories...]
├── README.md
├── DOCUMENTATION_OVERVIEW.md
└── POLISHING_REPORT.md

File Types

  • rules.md: Coding standards and best practices
  • workflows.md: Development and deployment processes
  • knowledge.md: Reference materials and troubleshooting
  • guiding-docs.md: Usage guidelines and principles
  • sanity-checks.md: Validation procedures and checklists
  • architectural-docs.md: System design and architecture
  • llm.txt: AI context for language models
  • agent.md: AI agent configuration
  • README.md: Category overview and navigation

💡 Cost Analysis

Typical Costs (GPT-4o-mini)

  • 89 files: ~$0.026 (2.6 cents)
  • 20 files (batch): ~$0.005 (0.5 cents)
  • Per file: ~$0.0003 (0.03 cents)

Cost Optimizations

  • ✅ Content truncation for long files
  • ✅ Skipped final review step
  • ✅ Batch processing with delays
  • ✅ Token usage limits
  • ✅ Model selection optimization

🔍 Quality Assurance

Automated Checks

  • Content Validation: Technical accuracy preservation
  • Format Consistency: Markdown formatting standards
  • Link Verification: URL accessibility checks
  • Quality Scoring: 1-10 assessment system

Manual Review

  • Backup Files: Original content preserved
  • Assessment Reports: Detailed quality metrics
  • Execution Summary: Complete process overview
  • Error Logging: Comprehensive error tracking

🛠️ Customization

Adding New Sources

  1. Simply change the START_URL environment variable
  2. The system automatically adapts to different website structures
  3. Universal patterns work with most documentation sites

Extending Categories

  1. Edit scripts/categorize.ts for new categories
  2. Add pattern matching rules
  3. Update documentation templates
  4. Re-run categorization

AI Enhancement

  1. Modify prompts in scripts/polish-docs.ts
  2. Adjust model parameters and settings
  3. Add new quality criteria
  4. Implement custom assessment logic

🚨 Troubleshooting

Common Issues

  • Rate Limiting: Automatic retry with delays
  • API Errors: Graceful fallback to original content
  • Browser Issues: Chromium installation verification
  • Memory Usage: Batch processing to manage resources

Debug Mode

# Enable verbose logging
DEBUG=1 bun scripts/crawl.ts

# Check system requirements
bun scripts/cost-estimator.ts

📈 Performance

Benchmarks

  • Crawling: 52 URLs in ~30 seconds
  • Content Scraping: 52 pages in ~2 minutes
  • AI Polishing: 20 files in ~2 minutes
  • Total Pipeline: ~5-10 minutes for complete run

Optimization Tips

  • Use SSD storage for faster I/O
  • Increase concurrency for faster crawling
  • Adjust batch sizes based on system resources
  • Monitor memory usage during processing

🤝 Contributing

Development Setup

# Clone and setup
git clone <repository>
cd universal-doc-generator
bun install

# Run tests
bun test

# Format code
bun run format

Code Standards

  • TypeScript for type safety
  • Comprehensive error handling
  • Detailed logging and documentation
  • Performance optimization
  • Cost efficiency considerations

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

🙏 Acknowledgments

  • OpenAI: For AI model capabilities
  • Playwright: For web scraping functionality
  • Bun: For fast JavaScript runtime
  • Community: For feedback and contributions

📞 Support

For issues, questions, or contributions:

  1. Check the troubleshooting section
  2. Review execution logs
  3. Examine generated documentation
  4. Create an issue with detailed information

🔮 Future Enhancements

Planned Features

  • Multi-Source Support: Crawl multiple documentation sites simultaneously
  • Advanced AI Models: Support for newer OpenAI models
  • Custom Templates: User-defined documentation formats
  • Real-time Updates: Continuous documentation monitoring
  • API Integration: REST API for programmatic access

Potential Improvements

  • Machine Learning: Automated category detection
  • Content Analysis: Advanced quality assessment
  • Collaboration: Multi-user editing and review
  • Version Control: Git integration for documentation
  • Deployment: Automated deployment to various platforms

Generated: 2025-08-31
Version: 1.0.0
Status: Production Ready