@promptordie/siphon-knowledge
v1.0.2
Published
AI-powered documentation generation system for AI Coding Agents.
Maintainers
Readme
Universal Documentation Generator
A comprehensive, AI-powered documentation generation system that can crawl, categorize, scrape, and polish documentation from any website.
📦 Installation
Global Installation (Recommended)
# Install globally with Bun (recommended)
bun i -g @promptordie/siphon-knowledge
# Or with npm
npm i -g @promptordie/siphon-knowledgeThis will install all necessary dependencies including:
- ✅ Playwright browsers (for web scraping)
- ✅ OpenAI SDK (for AI polishing)
- ✅ All required Node.js modules
Manual Installation
# Clone the repository
git clone https://github.com/Dexploarer/siphon-knowledge
cd siphon-knowledge
# Install dependencies
bun install
# Run setup (installs Playwright browsers)
bun run setup
# or
npm run setup🚀 Quick Start
Using Global Installation
# Generate documentation from any website
siphon-knowledge https://docs.example.com
# Skip AI polishing for cost savings
siphon-knowledge --skip-ai https://docs.example.com
# Run comprehensive setup check
siphon-knowledge --setup
# Show help
siphon-knowledge --helpUsing Local Installation
# Generate documentation from any website
START_URL="https://docs.example.com" bun run-all.ts
# Or run individual steps
bun scripts/crawl.ts
bun scripts/categorize.ts
bun scripts/scrape-content.ts
bun scripts/generate-docs.ts
bun scripts/cost-estimator.ts
bun scripts/polish-docs.ts🔨 Building & Development
Build Commands
# Build verification (confirms project is ready)
bun run build
# Clean build artifacts and output
bun run clean
# Show all available commands
bun run helpDevelopment Workflow
# Install dependencies
bun install
# Run setup (installs Playwright browsers)
bun run setup
# Verify build (optional - Bun runs TypeScript natively)
bun run build
# Run individual scripts directly
bun scripts/crawl.ts
bun scripts/categorize.ts
bun scripts/scrape-content.ts
bun scripts/generate-docs.ts
bun scripts/polish-docs.ts🤖 AI Features & OpenAI Setup
The system includes AI-powered documentation polishing using OpenAI's GPT-4o-mini model.
Automatic API Key Prompt
When you run the system with AI features enabled (default), you'll be securely prompted to enter your OpenAI API key:
siphon-knowledge https://docs.example.com
# Enter your OpenAI API key: ***************Your input is masked with asterisks for security.
Using Environment Variable
You can also set the API key as an environment variable:
# Linux/macOS
export OPENAI_API_KEY=sk-your-api-key-here
siphon-knowledge https://docs.example.com
# Windows
set OPENAI_API_KEY=sk-your-api-key-here
siphon-knowledge https://docs.example.com
# Or inline
OPENAI_API_KEY=sk-your-api-key-here siphon-knowledge https://docs.example.comCost Estimation
- GPT-4o-mini pricing: $0.00015 per 1K input tokens, $0.0006 per 1K output tokens
- The system includes a cost estimator to help you understand expenses
- Use
--skip-aito disable AI features and avoid costs
AI Polishing Features
- Grammar & Style: Fixes errors and improves readability
- Structure: Enhances document organization
- Clarity: Makes technical concepts more accessible
- Consistency: Ensures uniform terminology and formatting
- Cross-references: Adds helpful related information links
Polished documents are saved with _polished suffix (e.g., document_polished.md).
📁 Directory Structure
universal-doc-generator/
├── scripts/ # Processing scripts
│ ├── crawl.ts # Universal web crawler
│ ├── categorize.ts # Smart URL categorization
│ ├── scrape-content.ts # Content extraction
│ ├── generate-docs.ts # Documentation generation
│ ├── cost-estimator.ts # AI cost estimation
│ └── polish-docs.ts # AI-powered polishing
├── output/ # Generated content
│ └── scraped-content/ # Final documentation
├── docs/ # Additional documentation
├── tools/ # Utility tools
├── package.json # Dependencies
├── tsconfig.json # TypeScript configuration
├── run-all.ts # Master execution script
└── README.md # This file🎯 Features
🔍 Universal Crawling
- Any Website: Works with any documentation site
- Concurrent Processing: Up to 10 simultaneous browser instances (customizable)
- Intelligent Filtering: Only crawls relevant documentation pages
- Rate Limiting: Respectful crawling with delays
- Error Handling: Graceful failure recovery
📊 Smart Categorization
- Universal Patterns: Works with most documentation structures
- Context Separation: Developer vs User documentation
- Category Organization: 11 detailed categories
- Pattern Matching: Smart URL classification
- Summary Generation: Automated overview creation
🤖 AI-Powered Enhancement
- Quality Assessment: 1-10 scoring system
- Content Polishing: Grammar, clarity, and structure improvements
- Cost Optimization: Token usage optimization
- Rate Limit Protection: Automatic retry and delays
💰 Cost Effective
- GPT-4o-mini: Latest cost-efficient models
- Token Optimization: Content truncation and limits
- Batch Processing: Controlled API usage
- Cost Estimation: Pre-execution cost analysis
📋 Prerequisites
System Requirements
- Node.js: 18+ (or Bun runtime)
- Bun: Recommended for faster execution
- Chromium: For web scraping
- OpenAI API Key: For AI polishing (optional)
Installation
# Install dependencies
bun install
# Install Chromium for Playwright
bunx playwright install chromium🔧 Configuration
Environment Variables
# Target website URL
START_URL=https://docs.example.com
# Crawling configuration
MAX_PAGES=2000
CONCURRENCY=10
# OpenAI API Key (for AI polishing)
OPENAI_API_KEY=your_api_key_hereCost Management
The system includes built-in cost controls:
- Max files per session: 20 (configurable)
- Token limits: 2000 per request
- Rate limiting: 3-second delays
- Model selection: GPT-4o-mini for efficiency
📖 Usage
Complete Pipeline
# Generate docs from any website
START_URL="https://docs.example.com" bun run-all.ts
# Skip AI polishing for cost savings
START_URL="https://docs.example.com" bun run-all.ts --skip-ai
# Custom crawling limits
START_URL="https://docs.example.com" MAX_PAGES=500 CONCURRENCY=8 bun run-all.tsIndividual Steps
# 1. Crawl documentation URLs
START_URL="https://docs.example.com" bun scripts/crawl.ts
# 2. Categorize URLs
bun scripts/categorize.ts
# 3. Scrape content
bun scripts/scrape-content.ts
# 4. Generate documentation structure
bun scripts/generate-docs.ts
# 5. Estimate AI costs
bun scripts/cost-estimator.ts
# 6. Polish with AI
bun scripts/polish-docs.ts📊 Output Structure
Generated Documentation
output/scraped-content/
├── developer-context/
│ ├── architecture-&-core-concepts/
│ │ ├── rules.md
│ │ ├── workflows.md
│ │ ├── knowledge.md
│ │ ├── guiding-docs.md
│ │ ├── sanity-checks.md
│ │ ├── architectural-docs.md
│ │ ├── llm.txt
│ │ ├── agent.md
│ │ ├── README.md
│ │ └── *.md (content files)
│ └── [other categories...]
├── user-context/
│ └── [user categories...]
├── README.md
├── DOCUMENTATION_OVERVIEW.md
└── POLISHING_REPORT.mdFile Types
- rules.md: Coding standards and best practices
- workflows.md: Development and deployment processes
- knowledge.md: Reference materials and troubleshooting
- guiding-docs.md: Usage guidelines and principles
- sanity-checks.md: Validation procedures and checklists
- architectural-docs.md: System design and architecture
- llm.txt: AI context for language models
- agent.md: AI agent configuration
- README.md: Category overview and navigation
💡 Cost Analysis
Typical Costs (GPT-4o-mini)
- 89 files: ~$0.026 (2.6 cents)
- 20 files (batch): ~$0.005 (0.5 cents)
- Per file: ~$0.0003 (0.03 cents)
Cost Optimizations
- ✅ Content truncation for long files
- ✅ Skipped final review step
- ✅ Batch processing with delays
- ✅ Token usage limits
- ✅ Model selection optimization
🔍 Quality Assurance
Automated Checks
- Content Validation: Technical accuracy preservation
- Format Consistency: Markdown formatting standards
- Link Verification: URL accessibility checks
- Quality Scoring: 1-10 assessment system
Manual Review
- Backup Files: Original content preserved
- Assessment Reports: Detailed quality metrics
- Execution Summary: Complete process overview
- Error Logging: Comprehensive error tracking
🛠️ Customization
Adding New Sources
- Simply change the
START_URLenvironment variable - The system automatically adapts to different website structures
- Universal patterns work with most documentation sites
Extending Categories
- Edit
scripts/categorize.tsfor new categories - Add pattern matching rules
- Update documentation templates
- Re-run categorization
AI Enhancement
- Modify prompts in
scripts/polish-docs.ts - Adjust model parameters and settings
- Add new quality criteria
- Implement custom assessment logic
🚨 Troubleshooting
Common Issues
- Rate Limiting: Automatic retry with delays
- API Errors: Graceful fallback to original content
- Browser Issues: Chromium installation verification
- Memory Usage: Batch processing to manage resources
Debug Mode
# Enable verbose logging
DEBUG=1 bun scripts/crawl.ts
# Check system requirements
bun scripts/cost-estimator.ts📈 Performance
Benchmarks
- Crawling: 52 URLs in ~30 seconds
- Content Scraping: 52 pages in ~2 minutes
- AI Polishing: 20 files in ~2 minutes
- Total Pipeline: ~5-10 minutes for complete run
Optimization Tips
- Use SSD storage for faster I/O
- Increase concurrency for faster crawling
- Adjust batch sizes based on system resources
- Monitor memory usage during processing
🤝 Contributing
Development Setup
# Clone and setup
git clone <repository>
cd universal-doc-generator
bun install
# Run tests
bun test
# Format code
bun run formatCode Standards
- TypeScript for type safety
- Comprehensive error handling
- Detailed logging and documentation
- Performance optimization
- Cost efficiency considerations
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- OpenAI: For AI model capabilities
- Playwright: For web scraping functionality
- Bun: For fast JavaScript runtime
- Community: For feedback and contributions
📞 Support
For issues, questions, or contributions:
- Check the troubleshooting section
- Review execution logs
- Examine generated documentation
- Create an issue with detailed information
🔮 Future Enhancements
Planned Features
- Multi-Source Support: Crawl multiple documentation sites simultaneously
- Advanced AI Models: Support for newer OpenAI models
- Custom Templates: User-defined documentation formats
- Real-time Updates: Continuous documentation monitoring
- API Integration: REST API for programmatic access
Potential Improvements
- Machine Learning: Automated category detection
- Content Analysis: Advanced quality assessment
- Collaboration: Multi-user editing and review
- Version Control: Git integration for documentation
- Deployment: Automated deployment to various platforms
Generated: 2025-08-31
Version: 1.0.0
Status: Production Ready
