crava
v1.0.0
Published
AI-powered web scraping that extracts structured data as JSON
Maintainers
Readme
Crava 🕷️✨
AI-powered web scraping that extracts structured data as JSON. Crava uses artificial intelligence to automatically detect and extract data from web pages without manual selector configuration.
🚀 Features
- 🤖 AI-Powered Extraction: Automatically generates CSS selectors using Google Gemini AI
- 🥷 Stealth Scraping: Uses Puppeteer with stealth plugins to avoid bot detection
- 📊 JSON Output: Clean, structured JSON data output
- 🔄 Smart Retry Logic: Built-in retry mechanism with exponential backoff
- 🧩 Extensible LLM Support: Ready for OpenAI, Anthropic, and other AI providers
- ⚡ TypeScript: Full TypeScript support with comprehensive type definitions
- 🛠️ CLI Interface: Use via command line or programmatically
- 🌐 Global Installation: Available as
cravacommand ornpx crava
📦 Installation
Global Installation (Recommended)
npm install -g cravaProject Installation
npm install crava🎯 Quick Start
CLI Usage
# Console output
crava https://example-shop.com --keys "Product Name,Price,Rating" --api-key YOUR_GEMINI_API_KEY
# Save to file
crava https://example-shop.com --keys "Product Name,Price" --api-key YOUR_API_KEY --output results.json
# With custom prompt
crava https://news-site.com --keys "Headline,Author,Date" --api-key YOUR_API_KEY --custom-prompt "Focus on main articles only"Programmatic Usage
import { Crava } from "crava";
const crava = new Crava();
const config = {
keys: ["Product Name", "Price", "Product Category"],
llm: {
provider: "gemini",
apiKey: "your-gemini-api-key",
model: "gemini-2.5-pro-preview-06-05",
},
};
// Scrape data
const result = await crava.scrape("https://example-shop.com", config);
console.log(`Extracted ${result.metadata.totalRecords} records`);
console.log(result.data); // Array of extracted objects⚙️ Configuration
CLI Options
Options:
--keys <string> Comma-separated list of data fields to extract
--api-key <string> Gemini API key
--output <filename> Save JSON to file (default: console output only)
--model <string> AI model to use (default: gemini-2.5-pro-preview-06-05)
--timeout <number> Page load timeout in ms (default: 30000)
--custom-prompt <str> Additional instructions for the AI
--help Show help messageScrapingConfig Interface
interface ScrapingConfig {
keys: string[]; // Data fields to extract
llm: LLMConfig; // AI provider configuration
customPrompt?: string; // Additional AI instructions
maxRetries?: number; // Retry attempts (default: 3)
timeout?: number; // Page load timeout in ms (default: 30000)
}LLMConfig Interface
interface LLMConfig {
provider: "gemini" | "openai" | "anthropic"; // AI provider
apiKey: string; // API key
model?: string; // Model name
temperature?: number; // Response creativity (0-1)
}🌟 Examples
E-commerce Product Scraping
import { Crava } from "crava";
const crava = new Crava();
const config = {
keys: ["Product Name", "Price", "Rating", "Availability"],
llm: {
provider: "gemini",
apiKey: process.env.GEMINI_API_KEY,
model: "gemini-2.5-pro-preview-06-05",
},
customPrompt:
"Focus on product listings. Extract numerical ratings and stock status.",
};
const result = await crava.scrape("https://shop.example.com/products", config);
// Save to file
import { OutputManager } from "crava/dist/output/output-manager";
await OutputManager.exportToJson(result, "products.json");News Article Scraping
const config = {
keys: ["Headline", "Author", "Publication Date", "Summary"],
llm: {
provider: "gemini",
apiKey: process.env.GEMINI_API_KEY,
},
customPrompt: "Extract news articles. Format dates as ISO strings.",
};
const result = await crava.scrapeWithRetry("https://news.example.com", config);CLI Examples
# Basic scraping with console output
crava https://quotes.toscrape.com --keys "Quote,Author,Tags" --api-key YOUR_API_KEY
# Save results to file
crava https://books.toscrape.com --keys "Title,Price,Rating" --api-key YOUR_API_KEY --output books.json
# With custom model and prompt
crava https://news.ycombinator.com --keys "Title,Points,Comments" \
--api-key YOUR_API_KEY \
--model gemini-2.5-pro-preview-06-05 \
--custom-prompt "Focus on the main story listings"
# Using npx (no installation required)
npx crava https://example.com --keys "Title,Description" --api-key YOUR_API_KEY📚 API Reference
Crava.scrape(url, config)
Scrapes data from a single URL.
Parameters:
url(string): Target URL to scrapeconfig(ScrapingConfig): Scraping configuration
Returns: Promise<ScrapingResult>
Crava.scrapeWithRetry(url, config)
Scrapes data with automatic retry logic and exponential backoff.
Parameters:
url(string): Target URL to scrapeconfig(ScrapingConfig): Scraping configuration
Returns: Promise<ScrapingResult>
ScrapingResult Interface
interface ScrapingResult {
data: Record<string, any>[]; // Array of extracted objects
metadata: {
url: string; // Source URL
timestamp: string; // Extraction timestamp
totalRecords: number; // Number of records found
keys: string[]; // Requested data fields
};
}OutputManager Utilities
import { OutputManager } from "crava/dist/output/output-manager";
// Save as JSON file
await OutputManager.exportToJson(result, "output.json");
// Save as CSV file
await OutputManager.exportToCsv(result, "output.csv");
// Console formatting
console.log(OutputManager.formatConsoleOutput(result));🤖 Supported AI Providers
Google Gemini (Default & Recommended)
const config = {
llm: {
provider: "gemini",
apiKey: "your-gemini-api-key",
model: "gemini-2.5-pro-preview-06-05", // Latest model
temperature: 0.3, // Optional: Controls creativity (0-1)
},
};Getting a Gemini API Key:
- Visit Google AI Studio
- Create a new API key
- Set it as environment variable:
export GEMINI_API_KEY="your-key"
OpenAI (Architecture Ready)
const config = {
llm: {
provider: "openai",
apiKey: "your-openai-api-key",
model: "gpt-4o", // or gpt-3.5-turbo
temperature: 0.3,
},
};Anthropic (Architecture Ready)
const config = {
llm: {
provider: "anthropic",
apiKey: "your-anthropic-api-key",
model: "claude-3-5-sonnet-20241022",
temperature: 0.3,
},
};🔧 How It Works
- 🌐 Page Loading: Crava uses Puppeteer with stealth plugins to load the target webpage, avoiding bot detection
- 🧠 AI Analysis: The page HTML is cleaned and sent to AI (Gemini) to analyze content structure and generate extraction selectors
- 🎯 Smart Extraction: Generated selectors are used to extract structured data, with fallback strategies for dynamic content
- 📋 Data Processing: Extracted data is cleaned, validated, and formatted as structured JSON
- 💾 Output: Results can be displayed in console or saved to JSON/CSV files
💡 Best Practices
✅ Do's
- Use Descriptive Keys: "Product Name" instead of "name"
- Add Custom Prompts: Provide context like "Focus on main product listings"
- Handle Errors: Always wrap scraping calls in try-catch blocks
- Store API Keys Securely: Use environment variables or secret management
- Test on Simple Pages First: Start with well-structured sites
- Respect Rate Limits: Add delays between requests for the same domain
❌ Don'ts
- Don't scrape sites without checking robots.txt
- Don't use overly generic key names like "text" or "link"
- Don't ignore error responses - they contain valuable debugging info
- Don't exceed reasonable timeout values (>60s)
- Don't hardcode API keys in your source code
🎯 Pro Tips
// Use specific, descriptive field names
const goodConfig = {
keys: ["Product Title", "Sale Price", "Customer Rating", "Stock Status"],
};
// Add context with custom prompts
const betterConfig = {
keys: ["Product Title", "Sale Price"],
customPrompt: "Extract only products that are currently on sale",
};
// Handle dynamic content
const robustConfig = {
keys: ["Article Title", "Author"],
timeout: 45000, // Longer timeout for slow sites
maxRetries: 5, // More retries for unreliable sites
};⚠️ Limitations & Considerations
- AI Dependency: Requires AI provider API key and internet connection
- Performance: Speed depends on page complexity and AI response time
- Anti-Bot Measures: Some websites may block automated scraping despite stealth mode
- Dynamic Content: Heavy JavaScript sites may need longer timeout values
- Rate Limits: AI providers have rate limits that may affect high-volume usage
- Data Quality: AI extraction accuracy depends on page structure and content clarity
🚀 Performance Tips
// For better performance on similar pages
const config = {
keys: ["Title", "Price"],
llm: {
provider: "gemini",
apiKey: process.env.GEMINI_API_KEY,
temperature: 0.1, // Lower temperature = more consistent results
},
timeout: 20000, // Shorter timeout for fast sites
maxRetries: 2, // Fewer retries for reliable sites
};
// For complex or slow sites
const robustConfig = {
keys: ["Article Title", "Full Content", "Author"],
llm: {
provider: "gemini",
apiKey: process.env.GEMINI_API_KEY,
temperature: 0.3,
},
timeout: 60000, // Longer timeout
maxRetries: 5, // More retries
customPrompt:
"Wait for all content to load. Focus on main article content.",
};🛠️ Development & Testing
Running Tests
cd /path/to/crava/package
npm testBuilding from Source
git clone <repository-url>
cd crava/package
npm install
npm run buildLocal Development
# Install dependencies
npm install
# Build TypeScript
npm run build
# Install globally for testing
npm install -g .
# Test CLI
crava --help🤝 Contributing
We welcome contributions! Here's how to get started:
- Fork the Repository
- Create a Feature Branch
git checkout -b feature/amazing-feature - Make Your Changes
- Add Tests
- Ensure All Tests Pass
npm test npm run build - Submit a Pull Request
Contribution Ideas
- Add support for more AI providers (OpenAI, Anthropic)
- Improve error handling and retry logic
- Add more output formats (XML, YAML)
- Enhance documentation and examples
- Performance optimizations
📄 License
MIT License - see LICENSE file for details.
🆘 Support & Resources
- 🐛 Issues: GitHub Issues - Report bugs and request features
- 📖 Documentation: Check the
examples/directory for more use cases - 🔑 API Keys:
- Google AI Studio - Get your Gemini API key
- OpenAI Platform - Get your OpenAI API key
- 💬 Discussions: GitHub Discussions for questions and ideas
🎉 Changelog
v1.0.0
- ✅ Initial release with Gemini AI integration
- ✅ CLI interface with global command support
- ✅ TypeScript support with full type definitions
- ✅ Puppeteer stealth mode for bot detection avoidance
- ✅ JSON output with optional file saving
- ✅ Comprehensive error handling and retry logic
- ✅ Extensible architecture for multiple AI providers
Made with ❤️ by the Crava team
Star ⭐ this repo if you find it useful!
