@firekid/scraper
v1.0.1
Published
The most advanced web scraping machine ever built
Maintainers
Readme
Firekid Scraper
Advanced web scraping framework built on Playwright with intelligent anti-detection, automatic healing, and distributed crawling capabilities.
GitHub: Firekid-is-him/firekid-scraper-sdk
Features
Anti-Detection & Stealth:
- Ghost fingerprinting system spoofs canvas, WebGL, audio, fonts, and navigator properties
- Cloudflare bypass with automatic fallback to manual solving
- Behavioral profiles that mimic human interaction patterns
- Network forensics cleaning removes tracking artifacts
Intelligent Automation:
- Self-healing selectors with 7 fallback strategies
- Pattern caching with SQLite storage for learned behaviors
- Smart fetch with automatic referer chain management
- Action recorder captures and replays user interactions
Distributed Crawling:
- Queue-based task distribution across multiple workers
- Browser worker pool with resource management
- Rate limiting with configurable windows and thresholds
- Session persistence and recovery
Developer Experience:
- Simple command-based scripting language
- Plugin system for extensibility
- Multiple scraping modes: auto, manual, SSR, infinite scroll, pagination
- Built-in scheduler for recurring tasks
- Webhook notifications and database export
Installation
npm install @firekid/scraper
npx playwright install chromiumGlobal CLI Installation
npm install -g @firekid/scraper
firekid-scraper --helpDocker Installation
docker pull firekid/scraper:latest
docker run -v $(pwd)/data:/data firekid/scraperQuick Start
Basic Scraping
import { FirekidScraper } from '@firekid/scraper'
const scraper = new FirekidScraper({
headless: true,
bypassCloudflare: true
})
await scraper.init()
const data = await scraper.scrape('https://example.com', {
selectors: {
title: 'h1',
content: '.article-body',
author: '.author-name'
}
})
console.log(data)
await scraper.close()Command-Based Scripting
const scraper = new FirekidScraper()
await scraper.init()
await scraper.runCommands(`
GOTO https://example.com
WAIT .product-list
EXTRACT .product-title text AS titles
EXTRACT .product-price text AS prices
SCREENSHOT products.png
`)
await scraper.close()Auto Mode
const scraper = new FirekidScraper()
await scraper.init()
const data = await scraper.auto('https://example.com/products', {
depth: 2,
extractPattern: 'product'
})
await scraper.close()Core API
FirekidScraper
Main scraper class that orchestrates all operations.
Constructor Options:
new FirekidScraper({
headless: boolean, // Run browser in headless mode (default: true)
bypassCloudflare: boolean, // Enable Cloudflare bypass (default: false)
useGhost: boolean, // Enable fingerprint spoofing (default: true)
browserArgs: string[], // Additional Chromium arguments
timeout: number, // Default timeout in ms (default: 30000)
userAgent: string, // Custom user agent
viewport: { width, height }, // Browser viewport size
proxy: string, // Proxy URL (http://user:pass@host:port)
rateLimit: { // Rate limiting configuration
enabled: boolean,
max: number,
window: number
}
})Methods:
await scraper.init(): Initialize browser and context.
await scraper.close(): Close browser and cleanup resources.
await scraper.goto(url): Navigate to URL with anti-detection measures.
await scraper.scrape(url, options): Extract data using CSS selectors.
Options:
selectors: Object mapping field names to CSS selectorsattribute: Extract attribute instead of text (default: text)multiple: Return array of all matches (default: false)screenshot: Take screenshot after extraction
await scraper.runCommands(script): Execute command-based script.
await scraper.auto(url, options): Automatically detect and extract data.
Options:
depth: Maximum crawl depth (default: 1)extractPattern: Pattern hint (product, article, listing, etc)followLinks: Follow pagination/navigation links
await scraper.paginate(url, selector, options): Scrape paginated content.
Options:
maxPages: Maximum pages to scrapewaitBetween: Delay between pages in msnextSelector: Selector for next page button
await scraper.infiniteScroll(url, options): Scrape infinite scroll pages.
Options:
maxScrolls: Maximum scroll iterationsitemSelector: Selector for items to extractscrollDelay: Delay between scrolls in ms
Plugin System
Extend functionality through plugins.
Loading Plugins:
const scraper = new FirekidScraper()
await scraper.loadPlugin('./plugins/custom-plugin.js')Plugin Structure:
export default {
name: 'custom-extractor',
type: 'extractor',
async execute(page, options) {
const data = await page.evaluate(() => {
return {
title: document.title,
meta: Array.from(document.querySelectorAll('meta'))
.map(m => ({ name: m.name, content: m.content }))
}
})
return data
}
}Plugin Types:
scraping: Custom scraping logicaction: Custom page actionsextractor: Data extraction methodsfilter: Data filtering and validationoutput: Custom output formatsparser: Data parsing and transformation
Distributed Scraping
Scale scraping across multiple workers.
import { DistributedEngine } from '@firekid/scraper'
const engine = new DistributedEngine({
workers: 5,
queueSize: 100,
retries: 3
})
await engine.init()
engine.addTask({
id: 'task-1',
url: 'https://example.com',
mode: 'scrape',
options: {
selectors: { title: 'h1' }
},
priority: 10
})
engine.on('taskComplete', (result) => {
console.log('Task completed:', result)
})
engine.on('taskFailed', (error) => {
console.error('Task failed:', error)
})
await engine.start()Command Reference
Commands use simple syntax for browser automation.
Navigation Commands
GOTO url: Navigate to URL.
Example: GOTO https://example.com
BACK: Go back in history.
FORWARD: Go forward in history.
REFRESH: Reload current page.
Interaction Commands
CLICK selector: Click element.
Example: CLICK button.submit
TYPE selector text: Type text into input.
Example: TYPE input[name="search"] laptop
PRESS key: Press keyboard key.
Example: PRESS Enter
SELECT selector value: Select dropdown option.
Example: SELECT select[name="country"] US
CHECK selector: Check checkbox.
Example: CHECK input[type="checkbox"]
UPLOAD selector filepath: Upload file.
Example: UPLOAD input[type="file"] ./document.pdf
Wait Commands
WAIT selector: Wait for element to appear.
Example: WAIT .product-list
WAITLOAD: Wait for page load.
Scroll Commands
SCROLL selector: Scroll element into view.
Example: SCROLL .footer
SCROLLDOWN pixels: Scroll down by pixels.
Example: SCROLLDOWN 500
Extraction Commands
SCAN: Analyze page structure.
EXTRACT selector type AS variable: Extract data.
Types: text, html, attr:name, href, src
Example: EXTRACT h1 text AS title
SCREENSHOT filename: Take screenshot.
Example: SCREENSHOT page.png
Advanced Commands
PAGINATE selector: Auto-paginate through results.
Example: PAGINATE .next-page
INFINITESCROLL count: Scroll and load more items.
Example: INFINITESCROLL 10
FETCH url: Fetch URL with smart referer.
Example: FETCH https://api.example.com/data
DOWNLOAD url: Download file.
Example: DOWNLOAD https://example.com/file.pdf
REFERER url: Set custom referer.
Example: REFERER https://google.com
BYPASS_CLOUDFLARE: Attempt Cloudflare bypass.
Flow Control
REPEAT selector: Loop over matching elements.
REPEAT .product
EXTRACT .title text AS titles
EXTRACT .price text AS pricesIF selector: Conditional execution.
IF .login-button
CLICK .login-button
TYPE input[name="username"] adminLOOP count: Repeat commands N times.
LOOP 5
SCROLLDOWN 300
WAIT 1000Configuration
Environment Variables
HEADLESS: Run in headless mode (true/false)
MAX_QUEUE_WORKERS: Maximum concurrent workers (number)
BROWSER_TIMEOUT: Browser timeout in ms (number)
CF_BYPASS: Cloudflare bypass mode (auto/manual/skip)
TURNSTILE_SOLVER: Turnstile solver (skip/manual/2captcha/capsolver)
CAPTCHA_API_KEY: API key for captcha solver
API_ENABLED: Enable web API (true/false)
API_PORT: API server port (number)
API_KEY: API authentication key
PROXY_ENABLED: Enable proxy (true/false)
PROXY_URL: Proxy URL
DATA_DIR: Data storage directory
PATTERNS_DB: Pattern cache database path
SESSIONS_DB: Session storage database path
LOG_LEVEL: Logging level (error/warn/info/debug)
RECORD_SCREENSHOTS: Record screenshots (true/false)
RATE_LIMIT_ENABLED: Enable rate limiting (true/false)
RATE_LIMIT_MAX: Max requests per window (number)
RATE_LIMIT_WINDOW: Rate limit window in ms (number)
Configuration File
Create .env file in project root:
HEADLESS=true
MAX_QUEUE_WORKERS=5
BROWSER_TIMEOUT=30000
CF_BYPASS=auto
LOG_LEVEL=info
RATE_LIMIT_ENABLED=true
RATE_LIMIT_MAX=100
RATE_LIMIT_WINDOW=3600000Advanced Usage
Custom Behavioral Profiles
const scraper = new FirekidScraper()
await scraper.init()
await scraper.setProfile('human')
await scraper.goto('https://example.com')Available profiles:
fast: 30-60ms delays, minimal randomizationnormal: 80-120ms delays, moderate randomizationcareful: 120-180ms delays, high randomizationhuman: 50-150ms delays, natural patterns
Pattern Learning
const scraper = new FirekidScraper()
await scraper.init()
await scraper.goto('https://example.com/products')
const pattern = await scraper.learnPattern('product', {
containerSelector: '.product-card',
fields: ['title', 'price', 'image']
})
const products = await scraper.applyPattern('product')Self-Healing Selectors
const scraper = new FirekidScraper()
await scraper.init()
const healer = scraper.getHealer()
const element = await healer.find('.old-selector', {
strategies: ['id', 'className', 'text', 'position'],
savePattern: true
})Webhook Integration
const scraper = new FirekidScraper({
webhook: {
url: 'https://your-api.com/webhook',
events: ['scrapeComplete', 'error']
}
})
await scraper.init()
await scraper.scrape('https://example.com')Database Export
const scraper = new FirekidScraper()
await scraper.init()
const data = await scraper.scrape('https://example.com', {
selectors: { title: 'h1' }
})
await scraper.exportToDatabase(data, {
type: 'postgresql',
connection: {
host: 'localhost',
database: 'scraping',
user: 'user',
password: 'pass'
},
table: 'products'
})Scheduled Tasks
import { TaskScheduler } from '@firekid/scraper'
const scheduler = new TaskScheduler()
scheduler.schedule('daily-scrape', '0 0 * * *', async () => {
const scraper = new FirekidScraper()
await scraper.init()
await scraper.scrape('https://example.com')
await scraper.close()
})Examples
Product Scraper
import { FirekidScraper } from '@firekid/scraper'
const scraper = new FirekidScraper({ headless: true })
await scraper.init()
const products = await scraper.paginate('https://store.example.com/products', '.next-page', {
maxPages: 10,
selectors: {
title: '.product-title',
price: '.product-price',
image: 'img.product-image',
rating: '.product-rating'
}
})
await scraper.export(products, 'json', './products.json')
await scraper.close()Login and Scrape
const scraper = new FirekidScraper()
await scraper.init()
await scraper.runCommands(`
GOTO https://example.com/login
TYPE input[name="username"] myuser
TYPE input[name="password"] mypass
CLICK button[type="submit"]
WAITLOAD
GOTO https://example.com/dashboard
EXTRACT .data-table text AS tableData
`)
await scraper.close()Infinite Scroll
const scraper = new FirekidScraper()
await scraper.init()
const items = await scraper.infiniteScroll('https://example.com/feed', {
maxScrolls: 20,
itemSelector: '.feed-item',
scrollDelay: 1000,
extractFields: {
content: '.feed-content',
author: '.feed-author',
timestamp: '.feed-time'
}
})
await scraper.close()API Hunting
import { APIHunter } from '@firekid/scraper'
const hunter = new APIHunter()
await hunter.init()
const apis = await hunter.hunt('https://example.com', {
captureXHR: true,
captureFetch: true,
captureWebSocket: true
})
console.log('Discovered APIs:', apis)
await hunter.close()Video Download
const scraper = new FirekidScraper()
await scraper.init()
await scraper.runCommands(`
GOTO https://video-site.com/video/123
WAIT video
BYPASS_CLOUDFLARE
DOWNLOAD https://cdn.video-site.com/videos/file.mp4
`)
await scraper.close()TypeScript Support
Full TypeScript definitions included.
import { FirekidScraper, ScraperOptions, ScrapeResult } from '@firekid/scraper'
interface Product {
title: string
price: number
image: string
}
const scraper = new FirekidScraper({
headless: true,
bypassCloudflare: true
})
await scraper.init()
const result: ScrapeResult<Product> = await scraper.scrape('https://example.com', {
selectors: {
title: 'h1.product-title',
price: '.price',
image: 'img.main'
}
})
await scraper.close()Docker Usage
Using Docker Compose
version: '3.8'
services:
scraper:
image: firekid/scraper:latest
volumes:
- ./data:/data
- ./output:/output
environment:
- HEADLESS=true
- LOG_LEVEL=info
command: firekid-scraper run ./scripts/scrape.cmdCustom Dockerfile
FROM firekid/scraper:latest
COPY ./scripts /app/scripts
COPY ./plugins /app/plugins
WORKDIR /app
CMD ["firekid-scraper", "run", "./scripts/main.cmd"]Performance Optimization
Connection Pooling
const scraper = new FirekidScraper({
browserArgs: [
'--disable-dev-shm-usage',
'--no-sandbox',
'--disable-setuid-sandbox',
'--disable-gpu'
]
})Resource Blocking
await scraper.optimizeRequests({
blockImages: true,
blockFonts: true,
blockMedia: true
})Parallel Scraping
import { DistributedEngine } from '@firekid/scraper'
const engine = new DistributedEngine({ workers: 10 })
await engine.init()
const urls = ['url1', 'url2', 'url3']
urls.forEach((url, i) => {
engine.addTask({
id: `task-${i}`,
url,
mode: 'scrape',
priority: 10
})
})
await engine.start()Troubleshooting
Cloudflare Challenges
If automatic bypass fails, enable manual solving:
const scraper = new FirekidScraper({
bypassCloudflare: true,
cloudflareMode: 'manual'
})The browser will open in headed mode for manual solving.
Memory Issues
Reduce memory usage by limiting concurrent workers:
const scraper = new FirekidScraper({
maxWorkers: 3,
timeout: 15000
})Rate Limiting
Implement delays between requests:
const scraper = new FirekidScraper({
rateLimit: {
enabled: true,
max: 10,
window: 60000
}
})Selector Not Found
Enable self-healing selectors:
const element = await scraper.healSelector('.old-selector', {
savePattern: true,
strategies: ['id', 'className', 'text']
})Contributing
Contributions are welcome. Please read the contributing guidelines before submitting pull requests.
License
MIT License. See LICENSE file for details.
Support
For issues and questions:
- GitHub Repository: https://github.com/Firekid-is-him/firekid-scraper-sdk
- GitHub Issues: Report bugs and request features
- Documentation: Complete guides in the docs folder
- Examples: Sample scripts in the examples folder
Changelog
See CHANGELOG.md for version history and release notes.
