crawlee-scraper-toolkit
v2.0.2
Published
A comprehensive TypeScript toolkit for building robust web scrapers with Crawlee, featuring maximum configurability and CLI generator
Downloads
26
Maintainers
Readme
Crawlee Scraper Toolkit
A comprehensive TypeScript toolkit for building robust web scrapers with Crawlee, featuring maximum configurability, plugin system, and CLI generator.
🚀 Features
- 🎯 TypeScript First: Full TypeScript support with comprehensive type definitions
- 🔧 Maximum Configurability: Flexible configuration system with profiles and environment variables
- 🔌 Plugin System: Extensible architecture with built-in plugins for retry, caching, metrics, and more
- 🛠️ CLI Generator: Interactive command-line tool to generate scraper templates
- 🌐 Multiple Navigation Strategies: Support for direct navigation, form submission, and API interception
- ⚡ Browser Pool Management: Efficient browser instance pooling and resource management
- 📊 Built-in Monitoring: Metrics collection, logging, and error handling
- 🔄 Retry Logic: Configurable retry strategies with exponential backoff
- 💾 Result Caching: Optional caching system to avoid redundant requests
- 🎨 Multiple Templates: Pre-built templates for common scraping scenarios
📦 Installation
Prerequisites
- Node.js >= 20.0.0
- pnpm >= 8.0.0 (recommended package manager)
Install with pnpm (recommended)
pnpm add crawlee-scraper-toolkitInstall with npm
npm install crawlee-scraper-toolkit🏃 Quick Start
1. Initialize a New Project
pnpm dlx crawlee-scraper init my-scraper-project
cd my-scraper-project
pnpm install2. Generate Your First Scraper
After initializing and installing dependencies, you can use the CLI bundled with the project:
pnpm crawlee-scraper generate
# or npx crawlee-scraper generateFollow the interactive prompts to configure your scraper.
3. Run the Scraper
pnpm crawlee-scraper run --scraper=my-scraper --input="search term"
# or npx crawlee-scraper run --scraper=my-scraper --input="search term"🔧 Programmatic Usage
Basic Example
import { CrawleeScraperEngine, ScraperDefinition, configManager } from 'crawlee-scraper-toolkit';
// Define your scraper
const myScraper: ScraperDefinition<string, any> = {
id: 'example-scraper',
name: 'Example Scraper',
url: 'https://example.com',
navigation: { type: 'direct' },
waitStrategy: { type: 'timeout', config: { duration: 2000 } },
requiresCaptcha: false,
parse: async (context) => {
const { page } = context;
const title = await page.textContent('h1');
const description = await page.textContent('p');
return {
title,
description,
url: page.url(),
timestamp: new Date().toISOString(),
};
},
};
// Initialize the engine
const engine = new CrawleeScraperEngine(
configManager.getConfig(),
configManager.getLogger()
);
// Register and execute
engine.register(myScraper);
const result = await engine.execute(myScraper, 'input-data');
console.log(result.data);
await engine.shutdown();Advanced Configuration
import { createConfig, CrawleeScraperEngine, RetryPlugin, CachePlugin } from 'crawlee-scraper-toolkit';
// Build custom configuration
const config = createConfig()
.browserPool({
maxSize: 10,
maxAge: 30 * 60 * 1000,
launchOptions: {
headless: true,
args: ['--no-sandbox', '--disable-setuid-sandbox'],
},
})
.defaultOptions({
retries: 5,
timeout: 60000,
loadImages: false,
})
.logging({
level: 'info',
format: 'json',
})
.build();
// Create engine with custom config
// const logger = configManager.getLogger(); // Assuming you want to use the logger from configManager
// Or, create a new logger instance if you have specific needs for this engine
const engine = new CrawleeScraperEngine(config, configManager.getLogger());
// Install plugins
engine.use(new RetryPlugin({ maxBackoffDelay: 30000 }));
engine.use(new CachePlugin({ defaultTtl: 10 * 60 * 1000 }));📋 Scraper Templates
The toolkit includes several pre-built templates to get you started quickly:
Basic Template
Ideal for simple page scraping tasks. Allows configuration of selectors to extract data from static or dynamically rendered pages.
API Template
Designed for scenarios where data can be more reliably extracted by intercepting and processing API (e.g., JSON) responses triggered by website interactions.
Form Template
Useful for scrapers that need to interact with forms. Provides a structure for filling input fields, submitting forms, and then extracting data from the results.
Advanced Template
A comprehensive template that includes setup for custom hooks, plugins, and more complex configurations. Suitable for sophisticated scraping tasks requiring fine-grained control.
Infinite Scroll Template
Handles pages with infinite scrolling pagination. Ideal for sites that load more content as the user scrolls.
JS-Heavy Site Template
Provides a robust setup for sites that rely heavily on JavaScript for rendering content, focusing on advanced waitFor strategies and interaction patterns.
⚙️ Configuration
Configuration File
Create a scraper.config.yaml file:
browserPool:
maxSize: 5
maxAge: 1800000 # 30 minutes
launchOptions:
headless: true
args:
- "--no-sandbox"
- "--disable-setuid-sandbox"
defaultOptions:
retries: 3
timeout: 30000
loadImages: false
logging:
level: info
format: text
profiles:
development:
name: development
config:
browserPool:
maxSize: 2
logging:
level: debug
production:
name: production
config:
browserPool:
maxSize: 10
logging:
level: errorEnvironment Variables
BROWSER_POOL_SIZE=5
BROWSER_HEADLESS=true
SCRAPING_MAX_RETRIES=3
LOG_LEVEL=info🔌 Plugins
Built-in Plugins
- RetryPlugin: Exponential backoff retry logic
- CachePlugin: Result caching with TTL
- ProxyPlugin: Proxy rotation support
- RateLimitPlugin: Request rate limiting
- MetricsPlugin: Performance metrics collection
Custom Plugins
import { ScraperPlugin, ScraperEngine } from 'crawlee-scraper-toolkit';
class MyCustomPlugin implements ScraperPlugin {
name = 'my-plugin';
version = '1.0.0';
install(scraper: ScraperEngine): void {
scraper.addHook('beforeRequest', async (context) => {
// Custom logic before each request
console.log(`Processing: ${context.input}`);
});
}
}
engine.use(new MyCustomPlugin());🎯 Navigation Strategies
Direct Navigation
navigation: {
type: 'direct'
}Form Submission
navigation: {
type: 'form',
config: {
inputSelector: 'input[name="search"]',
submitSelector: 'button[type="submit"]'
}
}API Interception
navigation: {
type: 'api',
config: {
paramName: 'q'
}
}⏱️ Wait Strategies
Wait for Selector
waitStrategy: {
type: 'selector',
config: {
selector: '.results'
}
}Wait for Response
waitStrategy: {
type: 'response',
config: {
urlPattern: '/api/search'
}
}Wait for Timeout
waitStrategy: {
type: 'timeout',
config: {
duration: 5000
}
}🎣 Hooks
Add custom logic at different stages of scraping:
const scraper: ScraperDefinition = {
// ... other config
hooks: {
beforeRequest: [
async (context) => {
// Set custom headers
await context.page.setExtraHTTPHeaders({
'X-Custom-Header': 'value'
});
}
],
afterRequest: [
async (context) => {
// Save screenshot
await context.page.screenshot({
path: `screenshots/${context.input}.png`
});
}
],
onError: [
async (context) => {
console.error('Scraping failed:', context.error);
}
]
}
};📊 Monitoring and Metrics
import { MetricsPlugin } from 'crawlee-scraper-toolkit';
const metricsPlugin = new MetricsPlugin();
engine.use(metricsPlugin);
// After scraping
const metrics = metricsPlugin.getMetrics();
console.log(`Success rate: ${metrics.successfulRequests / metrics.totalRequests * 100}%`);🛠️ CLI Commands
The crawlee-scraper CLI provides several commands to manage your scraping projects.
These commands can be run using pnpm crawlee-scraper <command> or npx crawlee-scraper <command> if crawlee-scraper-toolkit is a project dependency (as shown in the Quick Start).
If you've installed crawlee-scraper-toolkit globally (e.g., pnpm add -g crawlee-scraper-toolkit), you can invoke the commands directly: crawlee-scraper <command>.
When installing from a local clone of this repository, run pnpm install && pnpm run build beforehand so the compiled CLI is generated with its path aliases properly resolved.
Alternatively, pnpm dlx crawlee-scraper <command> can be used to ensure you're executing the latest version of the CLI without local installation.
Generate Scraper
pnpm crawlee-scraper generate --template=basic --name=my-scraperInitialize Project
pnpm dlx crawlee-scraper init --name=my-project --template=advanced
# (Using pnpm dlx for init is recommended to get the latest project scaffolder)Validate Configuration
pnpm crawlee-scraper validate --config=./config/scraper.yamlRun Scraper
pnpm crawlee-scraper run --scraper=my-scraper --input="search term" --profile=productionUpdate Project Configurations (In Development)
pnpm crawlee-scraper updateHelps migrate existing project and scraper configurations to the latest version of the toolkit. This command is currently under development.
🔍 Examples
Check the examples/ directory for complete working examples:
- Basic web scraping
- API data extraction
- Form-based scraping
- E-commerce product scraping
- News article extraction
📚 Documentation
Comprehensive documentation is available in multiple formats:
🔗 Quick Links
- 📖 Complete API Documentation - Full API reference
- 🌐 Interactive HTML Docs - Browse documentation interactively
- 📊 Coverage Report - Test coverage analysis
- 💡 Usage Examples - Detailed examples documentation
🚀 Generate Documentation
# Generate all documentation
pnpm run docs:build
# Generate specific formats
pnpm run docs # Markdown API docs
pnpm run docs:html # HTML documentation
pnpm run docs:json # JSON API schema
# Serve documentation locally
pnpm run docs:serve # Available at http://localhost:8080📖 Documentation Scripts
| Command | Description |
|---------|-------------|
| docs:build | Generate complete documentation suite |
| docs:watch | Watch mode for development |
| docs:serve | Serve docs locally with HTTP server |
| docs:clean | Clean documentation directory |
| docs:preview | Build and serve in one command |
| docs:package | Create compressed documentation archive |
🚀 Release Process
This project uses semantic-release for fully automated versioning and publishing. All releases are handled automatically by CI/CD based on conventional commits.
For a more comprehensive guide to the release process, including advanced troubleshooting, emergency procedures, and configuration details, please see the Detailed Release Process Documentation.
📋 Conventional Commits
Use conventional commit messages to trigger automatic releases. The commit message should be structured as follows:
<type>(<scope>): <description>
[optional body]
[optional footer(s) e.g. BREAKING CHANGE:, Closes #issue]Commit Types & Version Impact:
| Type | Description | Version Bump | Example |
|--------------|----------------------------------------------------------------------|-----------------------|----------------------------------------------------|
| feat | A new feature | Minor (1.0.0 → 1.1.0) | feat: add caching plugin |
| fix | A bug fix | Patch (1.0.0 → 1.0.1) | fix: resolve memory leak |
| feat! | A commit that introduces a breaking API change (correlates with feat) | Major (1.0.0 → 2.0.0) | feat!: redesign configuration API |
| fix! | A commit that introduces a breaking API change (correlates with fix) | Major (1.0.0 → 2.0.0) | fix!: change default behavior of existing method |
| perf | A code change that improves performance | Patch (sometimes None) | perf: improve scraping speed by 20% |
| docs | Documentation only changes | None | docs: update README examples |
| style | Changes that do not affect the meaning of the code (white-space, etc) | None | style: fix code formatting |
| refactor | A code change that neither fixes a bug nor adds a feature | None | refactor: simplify parser logic |
| test | Adding missing tests or correcting existing tests | None | test: add unit tests for browser pool |
| chore | Changes to the build process or auxiliary tools and libraries | None | chore: update dependencies |
| ci | Changes to CI configuration files and scripts | None | ci: update GitHub Actions workflow |
Scope Examples: core, cli, docs, api, utils. e.g., feat(core): add browser pool management.
Breaking Changes:
Must be indicated at the very beginning of the commit body or footer, starting with BREAKING CHANGE:.
feat!: redesign configuration API
BREAKING CHANGE: The configuration schema has changed.
Old format is no longer supported. See MIGRATION.md for upgrade guide.Or (for non-! commits):
refactor: internal restructuring of user module
BREAKING CHANGE: User session handling has been modified.
Refer to the updated documentation for details.🔄 Automated Release Flow
- Development: Create feature branch, make changes with conventional commits
- Pull Request: Open PR to
mainbranch - CI Validation: Automated tests, linting, examples validation
- Release Preview: Comment on PR shows what would be released
- Merge: Merge PR to
mainbranch - Automatic Release: CI/CD automatically:
- Analyzes commits since last release
- Determines next version (patch/minor/major)
- Generates CHANGELOG.md
- Creates GitHub release with notes
- Publishes to npm registry
- Deploys documentation to GitHub Pages
- Sends notifications
🛠️ Release Commands
# 🔍 Local Analysis (Recommended for developers)
pnpm run release:analyze # Fast offline analysis of commits
# 🧪 Release Simulation
pnpm run release:dry # Safe offline dry-run (no auth required)
pnpm run release:dry-full # Full dry-run with GitHub/npm simulation (may fail locally)
# 🔄 CI/CD Integration
pnpm run release:preview # CI-based release preview
pnpm run release # Automated release (CI only)
# 🛠️ Manual Tools
pnpm run changelog # Generate changelog manually
pnpm run release:legacy # Emergency manual release
pnpm run health-check # Validate CI/CD configurationCommand Details
| Command | Environment | Auth Required | Output |
|---------|-------------|---------------|---------|
| release:analyze | Local | ❌ No | Commit analysis + version prediction |
| release:dry | Local | ❌ No | Basic semantic-release simulation |
| release:dry-full | Local | ⚠️ Optional | Full simulation (may fail without auth) |
| release:preview | CI/CD | ✅ Yes | Complete release preview |
| release | CI/CD | ✅ Yes | Actual release |
📊 Release Validation
Every release automatically runs:
- ✅ ESLint code quality checks
- ✅ Prettier formatting validation
- ✅ Jest test suite with coverage
- ✅ TypeScript compilation
- ✅ Examples validation
- ✅ Documentation generation
- ✅ Security audit
- ✅ Bundle size analysis
🎯 Best Practices
- Use conventional commits for all changes
- Write descriptive commit messages with clear scope
- Include breaking change notes when applicable
- Keep commits atomic and focused
- Test locally before pushing
- Let CI/CD handle releases (avoid manual versioning)
🔧 Troubleshooting Common Release Issues
- Release Not Triggered:
- Ensure your commit messages strictly follow the Conventional Commits format.
- Verify the commit was pushed to the
mainbranch (for production releases) or the relevant branch for previews. - Check that the commit message does not include
[skip ci]or similar tags that prevent CI execution.
- Incorrect Version Bump:
- Carefully review your commit history to ensure
feat,fix, andBREAKING CHANGE:annotations are used correctly according to the version impact table. - Ensure any breaking changes are clearly documented in the commit body or footer, starting with
BREAKING CHANGE:.
- Carefully review your commit history to ensure
🤝 Contributing
We welcome contributions! Please follow these guidelines:
💻 Development Setup
Fork and Clone
git clone https://github.com/devalexanderdaza/crawlee-scraper-toolkit.git cd crawlee-scraper-toolkitCore Prerequisites & Environment Setup
- Node.js: This project uses Node.js version
>=20.0.0(as specified in.nvmrc). We recommend using nvm (Node Version Manager).# Install and use the project's Node.js version (reads .nvmrc) nvm install nvm use # Verify node --version - pnpm: This project uses pnpm (version
>=8.0.0, seepackage.jsonpackageManagerfield) as the package manager.
This project includes an# Install pnpm globally (if you don't have it) npm install -g pnpm@latest # Or, enable corepack (recommended, comes with Node.js >=16.10) # corepack enable && corepack prepare pnpm@latest --activate # Verify pnpm --version.npmrcfile with settings likeauto-install-peers=true. Key structure files:.nvmrc,.npmrc,pnpm-lock.yaml,tsconfig.json.
- Node.js: This project uses Node.js version
Install Dependencies & Setup Hooks
# Install all project dependencies using pnpm pnpm install # Setup Husky git hooks (for linting, commit messages, etc.) pnpm run prepareVerify Setup
# Run tests to ensure everything is working pnpm test
🛠️ Common Development Commands
A list of common commands you might use during development:
pnpm dev: Run the application in development mode (e.g., with hot reloading, if applicable).pnpm build: Build the project for production.pnpm test: Run the full test suite.pnpm test:watch: Run tests in interactive watch mode.pnpm test:coverage: Generate a test coverage report.pnpm lint: Check code for linting issues using ESLint.pnpm lint:fix: Attempt to automatically fix linting issues.pnpm format: Format code using Prettier.pnpm example:news: Run the basic news scraper example.pnpm example:products: Run the advanced product scraper example.pnpm cli -- --help: Run the toolkit's CLI locally to see its commands. (Note:pnpm cliitself might run a default command; use--to pass arguments to the CLI script).
🎯 Development Workflow
Create Feature Branch
git checkout -b feat/your-feature-nameMake Changes following conventional commits:
# Examples of good commit messages git commit -m "feat: add retry mechanism to browser pool" git commit -m "fix: resolve memory leak in scraper engine" git commit -m "docs: add configuration examples" git commit -m "test: add integration tests for API scraper"Validate Changes
pnpm run lint # Check code style pnpm run test # Run test suite pnpm run build # Verify build worksSubmit Pull Request
- Push your branch:
git push origin feat/your-feature-name - Open PR against
mainbranch - CI will automatically validate and show release preview
- Address any feedback from reviewers
- Push your branch:
📝 Commit Message Convention
We use Conventional Commits for automatic versioning:
<type>(<scope>): <description>
[optional body]
[optional footer(s)]Types:
feat: New feature (minor version bump)fix: Bug fix (patch version bump)docs: Documentation changesstyle: Code style changes (formatting, etc.)refactor: Code refactoringtest: Adding or updating testschore: Maintenance tasksperf: Performance improvementsci: CI/CD changes
Breaking Changes:
feat!: redesign configuration API
BREAKING CHANGE: The configuration schema has changed.
Migration guide available in MIGRATION.md🛡️ Quality Gates
All contributions must pass:
- Pre-commit hooks: Linting and formatting
- Commit message validation: Conventional commit format
- CI pipeline: Tests, build, examples validation
- Code review: Maintainer approval required
🚫 What Not to Do
- ❌ Don't manually update version numbers
- ❌ Don't edit CHANGELOG.md directly
- ❌ Don't skip conventional commit format
- ❌ Don't push directly to main branch
- ❌ Don't bypass pre-commit hooks
📄 License
This project is licensed under the MIT License - see the LICENSE file for details.
🙏 Acknowledgments
- Built on top of Crawlee and Playwright
- Inspired by the need for a more configurable and extensible scraping toolkit
- Thanks to all contributors and the open-source community
📞 Support
Made with ❤️ by devalexanderdaza
