npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

crava

v1.0.0

Published

AI-powered web scraping that extracts structured data as JSON

Readme

Crava 🕷️✨

AI-powered web scraping that extracts structured data as JSON. Crava uses artificial intelligence to automatically detect and extract data from web pages without manual selector configuration.

🚀 Features

  • 🤖 AI-Powered Extraction: Automatically generates CSS selectors using Google Gemini AI
  • 🥷 Stealth Scraping: Uses Puppeteer with stealth plugins to avoid bot detection
  • 📊 JSON Output: Clean, structured JSON data output
  • 🔄 Smart Retry Logic: Built-in retry mechanism with exponential backoff
  • 🧩 Extensible LLM Support: Ready for OpenAI, Anthropic, and other AI providers
  • TypeScript: Full TypeScript support with comprehensive type definitions
  • 🛠️ CLI Interface: Use via command line or programmatically
  • 🌐 Global Installation: Available as crava command or npx crava

📦 Installation

Global Installation (Recommended)

npm install -g crava

Project Installation

npm install crava

🎯 Quick Start

CLI Usage

# Console output
crava https://example-shop.com --keys "Product Name,Price,Rating" --api-key YOUR_GEMINI_API_KEY

# Save to file
crava https://example-shop.com --keys "Product Name,Price" --api-key YOUR_API_KEY --output results.json

# With custom prompt
crava https://news-site.com --keys "Headline,Author,Date" --api-key YOUR_API_KEY --custom-prompt "Focus on main articles only"

Programmatic Usage

import { Crava } from "crava";

const crava = new Crava();

const config = {
    keys: ["Product Name", "Price", "Product Category"],
    llm: {
        provider: "gemini",
        apiKey: "your-gemini-api-key",
        model: "gemini-2.5-pro-preview-06-05",
    },
};

// Scrape data
const result = await crava.scrape("https://example-shop.com", config);
console.log(`Extracted ${result.metadata.totalRecords} records`);
console.log(result.data); // Array of extracted objects

⚙️ Configuration

CLI Options

Options:
  --keys <string>        Comma-separated list of data fields to extract
  --api-key <string>     Gemini API key
  --output <filename>    Save JSON to file (default: console output only)
  --model <string>       AI model to use (default: gemini-2.5-pro-preview-06-05)
  --timeout <number>     Page load timeout in ms (default: 30000)
  --custom-prompt <str>  Additional instructions for the AI
  --help                 Show help message

ScrapingConfig Interface

interface ScrapingConfig {
    keys: string[]; // Data fields to extract
    llm: LLMConfig; // AI provider configuration
    customPrompt?: string; // Additional AI instructions
    maxRetries?: number; // Retry attempts (default: 3)
    timeout?: number; // Page load timeout in ms (default: 30000)
}

LLMConfig Interface

interface LLMConfig {
    provider: "gemini" | "openai" | "anthropic"; // AI provider
    apiKey: string; // API key
    model?: string; // Model name
    temperature?: number; // Response creativity (0-1)
}

🌟 Examples

E-commerce Product Scraping

import { Crava } from "crava";

const crava = new Crava();

const config = {
    keys: ["Product Name", "Price", "Rating", "Availability"],
    llm: {
        provider: "gemini",
        apiKey: process.env.GEMINI_API_KEY,
        model: "gemini-2.5-pro-preview-06-05",
    },
    customPrompt:
        "Focus on product listings. Extract numerical ratings and stock status.",
};

const result = await crava.scrape("https://shop.example.com/products", config);

// Save to file
import { OutputManager } from "crava/dist/output/output-manager";
await OutputManager.exportToJson(result, "products.json");

News Article Scraping

const config = {
    keys: ["Headline", "Author", "Publication Date", "Summary"],
    llm: {
        provider: "gemini",
        apiKey: process.env.GEMINI_API_KEY,
    },
    customPrompt: "Extract news articles. Format dates as ISO strings.",
};

const result = await crava.scrapeWithRetry("https://news.example.com", config);

CLI Examples

# Basic scraping with console output
crava https://quotes.toscrape.com --keys "Quote,Author,Tags" --api-key YOUR_API_KEY

# Save results to file
crava https://books.toscrape.com --keys "Title,Price,Rating" --api-key YOUR_API_KEY --output books.json

# With custom model and prompt
crava https://news.ycombinator.com --keys "Title,Points,Comments" \
  --api-key YOUR_API_KEY \
  --model gemini-2.5-pro-preview-06-05 \
  --custom-prompt "Focus on the main story listings"

# Using npx (no installation required)
npx crava https://example.com --keys "Title,Description" --api-key YOUR_API_KEY

📚 API Reference

Crava.scrape(url, config)

Scrapes data from a single URL.

Parameters:

  • url (string): Target URL to scrape
  • config (ScrapingConfig): Scraping configuration

Returns: Promise<ScrapingResult>

Crava.scrapeWithRetry(url, config)

Scrapes data with automatic retry logic and exponential backoff.

Parameters:

  • url (string): Target URL to scrape
  • config (ScrapingConfig): Scraping configuration

Returns: Promise<ScrapingResult>

ScrapingResult Interface

interface ScrapingResult {
    data: Record<string, any>[]; // Array of extracted objects
    metadata: {
        url: string; // Source URL
        timestamp: string; // Extraction timestamp
        totalRecords: number; // Number of records found
        keys: string[]; // Requested data fields
    };
}

OutputManager Utilities

import { OutputManager } from "crava/dist/output/output-manager";

// Save as JSON file
await OutputManager.exportToJson(result, "output.json");

// Save as CSV file
await OutputManager.exportToCsv(result, "output.csv");

// Console formatting
console.log(OutputManager.formatConsoleOutput(result));

🤖 Supported AI Providers

Google Gemini (Default & Recommended)

const config = {
    llm: {
        provider: "gemini",
        apiKey: "your-gemini-api-key",
        model: "gemini-2.5-pro-preview-06-05", // Latest model
        temperature: 0.3, // Optional: Controls creativity (0-1)
    },
};

Getting a Gemini API Key:

  1. Visit Google AI Studio
  2. Create a new API key
  3. Set it as environment variable: export GEMINI_API_KEY="your-key"

OpenAI (Architecture Ready)

const config = {
    llm: {
        provider: "openai",
        apiKey: "your-openai-api-key",
        model: "gpt-4o", // or gpt-3.5-turbo
        temperature: 0.3,
    },
};

Anthropic (Architecture Ready)

const config = {
    llm: {
        provider: "anthropic",
        apiKey: "your-anthropic-api-key",
        model: "claude-3-5-sonnet-20241022",
        temperature: 0.3,
    },
};

🔧 How It Works

  1. 🌐 Page Loading: Crava uses Puppeteer with stealth plugins to load the target webpage, avoiding bot detection
  2. 🧠 AI Analysis: The page HTML is cleaned and sent to AI (Gemini) to analyze content structure and generate extraction selectors
  3. 🎯 Smart Extraction: Generated selectors are used to extract structured data, with fallback strategies for dynamic content
  4. 📋 Data Processing: Extracted data is cleaned, validated, and formatted as structured JSON
  5. 💾 Output: Results can be displayed in console or saved to JSON/CSV files

💡 Best Practices

✅ Do's

  • Use Descriptive Keys: "Product Name" instead of "name"
  • Add Custom Prompts: Provide context like "Focus on main product listings"
  • Handle Errors: Always wrap scraping calls in try-catch blocks
  • Store API Keys Securely: Use environment variables or secret management
  • Test on Simple Pages First: Start with well-structured sites
  • Respect Rate Limits: Add delays between requests for the same domain

❌ Don'ts

  • Don't scrape sites without checking robots.txt
  • Don't use overly generic key names like "text" or "link"
  • Don't ignore error responses - they contain valuable debugging info
  • Don't exceed reasonable timeout values (>60s)
  • Don't hardcode API keys in your source code

🎯 Pro Tips

// Use specific, descriptive field names
const goodConfig = {
    keys: ["Product Title", "Sale Price", "Customer Rating", "Stock Status"],
};

// Add context with custom prompts
const betterConfig = {
    keys: ["Product Title", "Sale Price"],
    customPrompt: "Extract only products that are currently on sale",
};

// Handle dynamic content
const robustConfig = {
    keys: ["Article Title", "Author"],
    timeout: 45000, // Longer timeout for slow sites
    maxRetries: 5, // More retries for unreliable sites
};

⚠️ Limitations & Considerations

  • AI Dependency: Requires AI provider API key and internet connection
  • Performance: Speed depends on page complexity and AI response time
  • Anti-Bot Measures: Some websites may block automated scraping despite stealth mode
  • Dynamic Content: Heavy JavaScript sites may need longer timeout values
  • Rate Limits: AI providers have rate limits that may affect high-volume usage
  • Data Quality: AI extraction accuracy depends on page structure and content clarity

🚀 Performance Tips

// For better performance on similar pages
const config = {
    keys: ["Title", "Price"],
    llm: {
        provider: "gemini",
        apiKey: process.env.GEMINI_API_KEY,
        temperature: 0.1, // Lower temperature = more consistent results
    },
    timeout: 20000, // Shorter timeout for fast sites
    maxRetries: 2, // Fewer retries for reliable sites
};

// For complex or slow sites
const robustConfig = {
    keys: ["Article Title", "Full Content", "Author"],
    llm: {
        provider: "gemini",
        apiKey: process.env.GEMINI_API_KEY,
        temperature: 0.3,
    },
    timeout: 60000, // Longer timeout
    maxRetries: 5, // More retries
    customPrompt:
        "Wait for all content to load. Focus on main article content.",
};

🛠️ Development & Testing

Running Tests

cd /path/to/crava/package
npm test

Building from Source

git clone <repository-url>
cd crava/package
npm install
npm run build

Local Development

# Install dependencies
npm install

# Build TypeScript
npm run build

# Install globally for testing
npm install -g .

# Test CLI
crava --help

🤝 Contributing

We welcome contributions! Here's how to get started:

  1. Fork the Repository
  2. Create a Feature Branch
    git checkout -b feature/amazing-feature
  3. Make Your Changes
  4. Add Tests
  5. Ensure All Tests Pass
    npm test
    npm run build
  6. Submit a Pull Request

Contribution Ideas

  • Add support for more AI providers (OpenAI, Anthropic)
  • Improve error handling and retry logic
  • Add more output formats (XML, YAML)
  • Enhance documentation and examples
  • Performance optimizations

📄 License

MIT License - see LICENSE file for details.

🆘 Support & Resources

  • 🐛 Issues: GitHub Issues - Report bugs and request features
  • 📖 Documentation: Check the examples/ directory for more use cases
  • 🔑 API Keys:
  • 💬 Discussions: GitHub Discussions for questions and ideas

🎉 Changelog

v1.0.0

  • ✅ Initial release with Gemini AI integration
  • ✅ CLI interface with global command support
  • ✅ TypeScript support with full type definitions
  • ✅ Puppeteer stealth mode for bot detection avoidance
  • ✅ JSON output with optional file saving
  • ✅ Comprehensive error handling and retry logic
  • ✅ Extensible architecture for multiple AI providers

Made with ❤️ by the Crava team

Star ⭐ this repo if you find it useful!