imperium-crawl

v2.6.1

Published

21 days ago

41-tool open-source CLI for web scraping, PDF extraction, content monitoring, reusable browser flows, RSS aggregation, and custom skills. CamoFox C++ anti-detection engine. Zero API keys for core tools.

0High
0Medium
0Low

imperiumhub

scraping crawling web-search brave-search firecrawl cli pdf-extract web-monitoring url-watch content-diff intelligence-digest browser-workflows workflow-recorder flow-api

imperium-crawl

The most powerful open-source CLI tool for web scraping, crawling, and data extraction.

41 tools. CamoFox C++ anti-detection. Zero API keys required. One npx command.

What's new in 2.6.0

CamoFox browser engine — Firefox fork with C++ anti-fingerprinting that bypasses Cloudflare, Google, and most bot detection:

C++-level patches — navigator.hardwareConcurrency, WebGL, AudioContext, WebRTC spoofed before JavaScript sees them
Engine abstraction — Switch between Playwright and CamoFox with a single engine flag
Auto-update — imperiumcrawl camofox-update pulls the latest CamoFox release from npm
Engine factory — import { resolveEngine } from "imperium-crawl/engines" for agent-native use
Zero breaking changes — Same tool API, same response format, same env vars. Just add engine: "camofox".
Codebase reorganized — CLI in src/cli/, core in src/core/, tests in 14 category folders.
41 tools total — Added camofox_status and camofox_update.

# Download ALL images from any page (100% coverage)
imperium-crawl download <url> --images --output ./slike

# Target exactly the 3rd image
imperium-crawl download <url> --images --index 3

# Auto-click "Prikaži više" + scan iframes
imperium-crawl download <url> --images --auto-click --iframe-scan

See CHANGELOG.md for the full release notes.

Quick Start

Get running in 30 seconds.

CLI (zero install):

npx -y imperium-crawl scrape --url https://example.com

Global install:

npm install -g imperium-crawl

Install from a local tarball (e.g. pre-release testing):

npm install -g ./imperium-crawl-2.5.2.tgz

That's it. 33 of 41 tools work with zero API keys. Add optional keys later to unlock search, AI extraction, and CAPTCHA solving.

Power Examples

Real results. Copy-paste and try.

Scrape through Cloudflare

imperium-crawl scrape --url https://blog.cloudflare.com

Level 1 (headers) → blocked
Level 2 (TLS fingerprint) → blocked
Level 3 (browser + stealth) → success ✅
→ Full markdown content extracted, 213K characters
→ Next visit: skips straight to Level 3 (learned)

Discover hidden APIs on any website

imperium-crawl discover-apis --url https://weather.com

Found 11 hidden API endpoints:
  • api.weather.com — main weather API (exposed API key!)
  • mParticle analytics endpoints
  • Taboola content recommendation API
  • OneTrust consent management API
  • DAA/AdChoices opt-out endpoints
→ Call any endpoint directly with query_api — 10x faster than DOM scraping

AI extraction in plain English

imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
  --schema "extract product name, price, rating, and review count"

{
  "product_name": "Apple AirPods Pro 2",
  "price": "$189.99",
  "rating": "4.7 out of 5",
  "review_count": "45,297"
}

Extract ALL images from any page (100% coverage)

imperium-crawl download https://www.njuskalo.hr/nekretnine/stan-Zagreb --images --output ./slike

Discovered 23 unique images
  ✅ njuskalo.hr-001.jpg — 142KB
  ✅ njuskalo.hr-002.jpg — 89KB
  ✅ njuskalo.hr-003.jpg — 256KB
→ 23/23 downloaded. Total: 4.2MB

Target a specific image:

imperium-crawl download https://olx.ba/artikal/12345 \
  --images --selector "img.gallery-main" --output ./oglas.jpg

Auto-click "Load more" + iframe scan:

imperium-crawl download https://www.leboncoin.fr/ad/12345 \
  --images --auto-click --iframe-scan --limit 50

Batch scrape with resume

imperium-crawl batch-scrape \
  --urls '["https://bbc.com","https://cnn.com","https://reuters.com","https://techcrunch.com"]' \
  --concurrency 3

Scraping 4 URLs (concurrency: 3)...
  ✅ bbc.com — 47K chars
  ✅ cnn.com — 52K chars
  ✅ reuters.com — 38K chars
  ✅ techcrunch.com — 61K chars
→ 4/4 succeeded. Job ID: abc123 (resume with --job-id if interrupted)

Why imperium-crawl?

🔓 Zero API Keys Required 33 of 41 tools work out of the box. No accounts, no tokens, no credit cards. Just npx and go.

🛡️ 3-Level Auto-Escalating Stealth Headers → TLS fingerprinting → headless browser + CAPTCHA solving. Automatically escalates until it gets through.

🧠 Self-Improving Adaptive learning engine remembers what works per domain. Second visit is 3x faster. The more you use it, the smarter it gets.

🧰 41 Tools, 2 Modes CLI tool or interactive TUI. Scraping, crawling, search, extraction, API discovery, WebSocket monitoring, browser automation, batch processing.

📜 14 Built-in Recipes Pre-built workflows for common tasks — news extraction, e-commerce scraping, API reverse engineering, and more.

⚡ Skills System Teach it once, run forever. Auto-detect patterns on any page, save as reusable skills, get fresh data on demand.

vs. The Competition

| Feature | imperium-crawl | Firecrawl | Crawl4AI | Browserbase | Puppeteer | |---------|:------------------:|:---------:|:--------:|:-----------:|:---------:| | Price | Free forever | $19+/month | Free | $0.01/min | Free | | Total tools | 41 | 5 | 2 | 4 | N/A | | Stealth levels | 3 (auto-escalate) | Cloud-based | 1 | Cloud-based | None | | Anti-bot detection | 7 systems | Partial | Partial | Partial | None | | TLS fingerprinting | JA3/JA4 | No | No | No | No | | CAPTCHA auto-solving | Yes | No | No | No | No | | API discovery | Yes | No | No | No | No | | WebSocket monitoring | Yes | No | No | No | No | | AI-powered extraction | Yes | No | No | No | No | | Adaptive learning | Yes | No | No | No | No | | Batch processing | Yes | No | No | No | No | | ARIA Snapshots | Yes | No | No | No | No | | Session Encryption | Yes | No | No | No | No | | Self-hosted | Yes | No | Yes | No | Yes | | Requires external service | No | Yes | No | Yes | No |

Stealth Engine

Request → [L1: Headers + UA rotation]
              │
              ├─ success → Done
              ↓ fail
          [L2: TLS Fingerprint (JA3/JA4)]
              │
              ├─ success → Done
              ↓ fail
          [L3: Browser + Fingerprint Injection + CAPTCHA]
              │
              ├─ success → Done
              ↓
          [Learning Engine records optimal level for next time]

Stealth Levels

| Level | Method | What It Defeats | |-------|--------|-----------------| | 1 | header-generator — Bayesian realistic headers + UA rotation | Basic bot detection, simple WAFs | | 2 | impit — browser-identical TLS fingerprints (JA3/JA4) | Cloudflare, Akamai, TLS fingerprinting WAFs | | 3 | rebrowser-playwright + fingerprint-injector + auto CAPTCHA | JavaScript challenges, SPAs, advanced anti-bot, CAPTCHAs |

Anti-Bot System Detection

Automatically identifies which anti-bot system a site uses and chooses the optimal strategy:

| System | Detection Method | |--------|-----------------| | Cloudflare | cf_clearance cookies, cf-mitigated header, challenge page title | | Akamai | _abck, bm_sz cookies | | PerimeterX / HUMAN | _px cookies, _pxhd headers | | DataDome | datadome cookies, datadome response header | | Kasada | x-kpsdk-* headers | | AWS WAF | aws-waf-token cookie | | F5 / Shape Security | TS prefix cookies |

Smart Rendering Cache

Once imperium-crawl determines a domain needs Level 3 (browser), it caches that decision for 1 hour. Subsequent requests to the same domain skip straight to browser rendering — no wasted time on failed lower levels.

Adaptive Learning Engine

imperium-crawl learns from every request and gets smarter over time. No configuration needed — fully automatic.

Every time you scrape a website, the engine records which stealth level worked, which anti-bot system was detected, whether a proxy was needed, response timing, and success/failure. Next time you hit the same domain, it predicts the optimal configuration — skipping failed levels and going straight to what works.

First visit to cloudflare.com:
  Level 1 → blocked ❌
  Level 2 → blocked ❌
  Level 3 → success ✅ (Cloudflare detected)
  → Engine records: cloudflare.com needs Level 3

Second visit to cloudflare.com:
  → Engine predicts: Level 3, confidence 85%, Cloudflare
  → Skips Level 1 and 2 entirely — goes straight to browser
  → 3x faster than first visit

Smart Features

Time decay — Knowledge older than 7 days loses weight, adapts when sites change defenses
Confidence scoring — Low data = start from level 1. High confidence = skip to optimal level
Auto-prune — Domains unused for 30 days are cleaned up. Max 2,000 domains stored
Atomic persistence — Knowledge saved via atomic write (tmp → rename). Never corrupts

The more you use it, the faster it gets.

All 41 Tools

📄 Scraping (no API key needed)

| Tool | What It Does | |------|-------------| | scrape | URL to clean Markdown/HTML with 3-level auto-escalating stealth. Structured data (JSON-LD, OpenGraph, Microdata), metadata, and links. | | crawl | Priority-based crawling with depth control, concurrency limiting, and smart URL scoring. | | map | Discover all URLs on a domain via sitemap.xml + page link extraction. | | extract | CSS selectors to structured JSON. Point at any repeating pattern and get clean data. | | readability | Mozilla Readability article extraction — title, author, content, publish date. | | screenshot | Full-page or viewport PNG screenshots via headless Chromium. |

🔍 Search (requires free Brave API key)

| Tool | What It Does | |------|-------------| | search | Web search via Brave Search API. | | news_search | News-specific search with freshness ranking. | | image_search | Image search with thumbnails and source URLs. | | video_search | Video search across platforms. |

⚡ Skills (no API key needed)

| Tool | What It Does | |------|-------------| | create_skill | Analyze any page, auto-detect repeating patterns, generate CSS selectors, save as reusable skill. | | run_skill | Run a saved skill for fresh structured data. Supports pagination. | | list_skills | List all saved skills with configurations. |

🔓 API Discovery & Real-Time (no API key needed, requires Playwright)

| Tool | What It Does | |------|-------------| | discover_apis | Navigate to any page, intercept XHR/fetch calls, map hidden REST/GraphQL endpoints. Auto-detects GraphQL, filters noise, returns response previews. | | query_api | Call any API endpoint directly with stealth headers. Bypass DOM rendering for 10x faster data access. | | monitor_websocket | Capture real-time WebSocket messages — financial tickers, chat feeds, live dashboards. |

🧠 AI Extraction (requires LLM API key)

| Tool | What It Does | |------|-------------| | ai_extract | Describe what you want in natural language or JSON schema. 3 providers (Anthropic, OpenAI, MiniMax). The extract tool also supports llm_fallback: true for hybrid CSS→AI extraction. |

🖱️ Interaction (no API key needed, requires Playwright)

| Tool | What It Does | |------|-------------| | interact | Browser automation with 20 action types (click, type, scroll, wait, screenshot, evaluate, select, hover, press, navigate, drag, upload, storage, cookies, pdf, auth_login, refresh, auto_click). Ref targeting via ARIA snapshot, session encryption, action policy, domain filter, network interception, device emulation. auto_click finds and clicks "load more" / "gallery" buttons with multilingual keyword matching. | | snapshot | ARIA-based page snapshot with interactive element refs. Use refs in interact for precise targeting. Annotated screenshots. |

📱 Social Media (no API key needed)

| Tool | What It Does | |------|-------------| | youtube | Search videos, get video details, comments, transcripts, chapters, and channel info. Parses ytInitialData — no API key needed. Add OPENAI_API_KEY to unlock Whisper AI transcription for videos without captions. | | reddit | Search Reddit, browse subreddits, get posts and comments via Reddit's public JSON API. | | instagram | Search profiles, get detailed profile info with engagement metrics, and discover influencers by niche/location. Search/discover require BRAVE_API_KEY. |

📥 Media & Feeds (no API key needed)

| Tool | What It Does | |------|-------------| | download | Download media files from any URL — images, video, YouTube, TikTok, bulk. v2.5.1: Browser-based image extraction with 100% coverage (lazy-load, shadow DOM, iframes, JSON-LD, CSS backgrounds). Target specific images via --selector, --index, --alt-match. Auto-click "load more" buttons. Referer injection fixes 403 on CDNs. | | batch_download | Download multiple files (PDFs, images, documents) in parallel with session cookie support. Uses L1 HTTP fetch — 10x faster than browser-based downloads. Ideal for bulk file retrieval from authenticated sessions. | | rss | Fetch and parse RSS/Atom feeds. Filter by date, output as JSON or Markdown. |

📦 Batch Processing (no API key needed)

| Tool | What It Does | |------|-------------| | batch_scrape | Parallel URL scraping with configurable concurrency, soft failure, and resume via job_id. Optional AI extraction per URL. | | list_jobs | List all batch jobs with status and progress. | | job_status | Full results for a specific batch job including per-URL outcomes. | | delete_job | Clean up completed or failed batch jobs. |

🧠 Knowledge Engine (no API key needed)

| Tool | What It Does | |------|-------------| | knowledge | Dump adaptive knowledge engine stats — per-domain success rates, optimal stealth levels, anti-bot detection history, rate limits. Use to debug scraping issues and understand problematic domains. |

📄 Documents (no API key needed)

| Tool | What It Does | |------|-------------| | pdf_extract | Extract text, pages, tables, and metadata from a local or remote PDF. Native text-layer strategy via pdfjs-dist. OCR + Claude Vision fallbacks deferred to v2.6.0. Use for sustainability reports, invoices, regulatory PDFs. |

imperium-crawl pdf-extract --input ./report.pdf --output ./extracted.json
imperium-crawl pdf-extract --input https://example.com/report.pdf --max-pages 20

👀 Change Tracking (no API key needed)

| Tool | What It Does | |------|-------------| | watch | One-shot change detector: scrape a URL, hash its content (readability / markdown / full), compare against the last snapshot, fire a webhook on change. Pair with cron for periodic monitoring. | | monitor | Portfolio-level change tracker across many URLs grouped by topic. Reads a JSON config, runs watch on each URL, emits a markdown digest filtered by minimum change percentage. |

# Watch a single URL — run periodically via cron
imperium-crawl watch --url https://carbonchain.com/pricing \
  --output-dir ./data/watch \
  --webhook https://hooks.example.com/on-change

# Monitor many URLs grouped by topic, emit a daily digest
imperium-crawl monitor --config ./monitor.json --output-dir ./data/monitor

monitor.json:

{
  "topics": [
    {
      "name": "Competitor pricing",
      "urls": ["https://carbonchain.com/pricing", "https://spherasolutions.com/cbam"]
    }
  ]
}

🔁 Imperium Flows (no API key needed; browser workflows may require Playwright)

| Tool | What It Does | |------|-------------| | record_flow | Record a headed browser workflow as a generic flow family/variant. Stores smart selector metadata and reusable input placeholders. | | run_flow | Run a saved flow with runtime JSON input, CAPTCHA policy, browser mode, and evidence collection. | | serve_flow | Expose saved flows through a local HTTP API. Requires bearer auth when bound publicly. | | list_flows | List project-local and global flow definitions. | | inspect_flow | Inspect a saved flow JSON definition. | | validate_flow | Validate a flow schema and report inputs, steps, and storage path. |

imperium-crawl record-flow --family generic-search --variant site-a --url https://example.com
imperium-crawl run-flow generic-search/site-a --input '{"query":"example"}'
imperium-crawl serve-flow generic-search --port 8787

Setup

API Keys

| Key | What It Unlocks | Where to Get It | |-----|----------------|-----------------| | BRAVE_API_KEY | 4 search tools (web, news, image, video) | brave.com/search/api (free tier available) | | TWOCAPTCHA_API_KEY | Auto CAPTCHA solving (reCAPTCHA v2/v3, hCaptcha, Turnstile) | 2captcha.com | | LLM_API_KEY | AI-powered data extraction (ai_extract tool) | Anthropic, OpenAI, or MiniMax API key | | OPENAI_API_KEY | Whisper AI transcription — transcribe any YouTube video, even without captions | platform.openai.com | | CHROME_PROFILE_PATH | Authenticated browser sessions (use your Chrome cookies) | Path to Chrome user data dir | | PROXY_URL | Route all requests through a proxy (http/https/socks4/socks5) | Any proxy provider |

Enable Full Stealth (Level 3)

npm i rebrowser-playwright
npx playwright install chromium

CLI Usage

With subcommand = runs that tool. No args in TTY = interactive TUI. No args in pipe = shows help.

# Scrape a website to markdown
imperium-crawl scrape --url https://bbc.com/news

# Crawl with depth control
imperium-crawl crawl --url https://blog.cloudflare.com --max-depth 2 --max-pages 5

# AI-powered extraction — plain English
imperium-crawl ai-extract --url https://amazon.com/dp/B0D1XD1ZV3 \
  --schema "extract product name, price, rating, and review count"

# Discover hidden APIs
imperium-crawl discover-apis --url https://weather.com

# Batch scrape in parallel
imperium-crawl batch-scrape --urls '["https://site1.com","https://site2.com"]' --concurrency 3

# Interactive setup wizard
imperium-crawl setup

Output Formats

imperium-crawl scrape --url https://example.com                          # JSON (default)
imperium-crawl scrape --url https://example.com --output-format markdown  # Markdown
imperium-crawl scrape --url https://example.com --output-format csv       # CSV
imperium-crawl scrape --url https://example.com --pretty                  # Pretty JSON
imperium-crawl scrape --url https://example.com --output result.json      # Write to file

TUI Mode

imperium-crawl tui

Interactive slash-command terminal with parameter prompts, table rendering, markdown display, and session state. Use /save to export results and /again to re-run the last command.

Explore REPL

Interactively explore a site in a headed browser, then save the session as a reusable skill:

imperium-crawl explore https://example.com

> navigate https://example.com/login
> type "#email" "[email protected]"
> type "#password" "{{env:MY_PASSWORD}}"
> click "#submit"
> snapshot
> save-skill my-login
✅ Saved skill: my-login (4 actions, 1 parameter detected)

Commands: navigate, click, type, select, hover, press, scroll, wait, screenshot, snapshot, evaluate, save-skill, history, undo, status, help, exit

Skills & Recipes

Skills let you teach imperium-crawl how to extract data from any website, then re-run for fresh content whenever you want.

Create a skill:

create_skill({
  url: "https://techcrunch.com/category/artificial-intelligence",
  name: "tc-ai-news",
  description: "Latest AI news from TechCrunch"
})

Run a skill:

run_skill({ name: "tc-ai-news" })
→ Returns fresh structured data with all detected fields

Skills are saved in ~/.imperium-crawl/skills/ as JSON files — human-readable, editable, portable.

Skill Parameters

Use template variables in skills — resolved at run time:

# In skill JSON actions:
{ "value": "{{input:query}}" }           # passed via --params or prompted
{ "value": "{{env:SITE_PASSWORD}}" }     # from environment variable
{ "value": "{{computed:date_today}}" }   # auto-computed (date_today, timestamp, random_string, year, month, day)

# Run with params:
imperium-crawl run-skill my-search --params '{"query": "machine learning"}'

Skill Chains

Chain skills together — output of one step becomes input to the next:

{
  "type": "chain",
  "name": "search-and-extract",
  "steps": [
    { "skill": "search-results", "output": "search" },
    { "skill": "extract-details", "input": { "url": "$search.results[0].url" }, "output": "details" }
  ]
}

Variable syntax: $step_name.field.nested[0] — simple dot-path access, no eval.

Built-in Recipes

| Recipe | What It Does | |--------|-------------| | hn-top-stories | Hacker News front page — titles, scores, comment counts | | github-trending | GitHub trending repos — stars, language, description | | job-listings-greenhouse | Greenhouse job boards — title, team, location | | ecommerce-product | Product name, price, rating, reviews, images | | product-reviews | Review text, ratings, author, date from product pages | | crypto-websocket | Live crypto prices via WebSocket monitoring | | news-article-reader | Article title, author, date, content from news sites | | reddit-posts | Subreddit posts — title, score, comments, flair | | seo-page-audit | SEO signals — meta tags, headings, structured data | | social-media-mentions | Brand mentions across social platforms | | influencer-niche-discovery | Find influencers by niche + location via Instagram | | influencer-hashtag-scout | Discover influencers through hashtag analysis | | influencer-competitor-spy | Find influencers from competitor brand mentions | | influencer-content-scout | Analyze content patterns of niche influencers |

See SKILL/ for detailed workflow guides and agent integration.

API Discovery Workflow

Turn any website into an API. No documentation needed.

1. discover_apis({ url: "https://weather.com" })
   → Found 11 hidden API endpoints:
     • Main weather API (api.weather.com) with exposed API key
     • mParticle analytics endpoints
     • Taboola content recommendation API
     • OneTrust consent management API

2. query_api({ url: "https://api.weather.com/v3/...", method: "GET" })
   → Direct API call, bypasses DOM entirely — 10x faster, structured JSON

3. monitor_websocket({ url: "https://binance.com/en/trade/BTC_USDT", duration_seconds: 10 })
   → Captures real-time WebSocket messages — live BTC price feed

AI Agent Guide

imperium-crawl ships with SKILL/ — a structured guide that teaches AI agents how to use all 41 tools effectively. Includes proven workflows, decision trees, error recovery, and advanced patterns.

Two Ways to Connect

| Method | Setup | Works With | |--------|-------|-----------| | CLI + SKILL/ | npm i -g imperium-crawl + SKILL.md in agent context | Any agent with bash access — Claude Code, Cursor, OpenClaw, ChatGPT, custom agents | | TUI | imperium-crawl tui — interactive terminal | Direct human use, demos, debugging |

Per-Agent Setup

| AI Agent | How to Add SKILL/ | |----------|-------------------| | Claude Code | Copy SKILL.md to project root — auto-detected | | Cursor / Windsurf | Add SKILL.md to project rules or system prompt | | OpenClaw / custom agents | Include SKILL.md in system prompt or context window | | ChatGPT / GPT agents | Paste SKILL.md content into custom instructions |

Resilience

Exponential backoff with full jitter — AWS-recommended retry pattern, no thundering herd
Per-domain circuit breaker — 5 failures opens circuit for 60s, then half-open probing with auto recovery
URL normalization — 11-step pipeline removes tracking params (utm_*, fbclid, gclid), sorts query params
Proxy support — single proxy or rotating pool with http/https/socks4/socks5
Browser pool — keyed by proxy URL, auto-eviction, configurable pool size
robots.txt — respected by default (configurable)
Graceful shutdown — 10s timeout on browser cleanup to prevent hung processes

Real-World Test Results

Every tool tested against production websites with real anti-bot defenses:

| Tool | Target | Result | |------|--------|--------| | 📄 scrape | BBC News | Full markdown, stealth level 3 auto-escalation | | 🕸️ crawl | Cloudflare Blog | 213K characters crawled with depth control | | 🗺️ map | BBC | Full URL discovery via sitemap + link extraction | | 🕷️ extract | Amazon (AirPods Pro 2) | Product title, 45,297 reviews, brand extracted | | 📖 readability | Medium article | Clean — title, author, content, publish date | | 📸 screenshot | ProductHunt | Captured Cloudflare Turnstile challenge page | | 🔍 search | Brave Web | Web results with snippets and URLs | | 📰 news_search | Brave News | News results with freshness ranking | | 🖼️ image_search | Brave Image | Images with thumbnails and source URLs | | 🎬 video_search | Brave Video | Video results across platforms | | 🛠️ create_skill | Hacker News | Auto-detected 30 stories with CSS selectors | | ▶️ run_skill | Saved skill | Fresh structured data from saved config | | 📋 list_skills | — | Lists all skills with configurations | | 🔓 discover_apis | Airbnb Paris | 34 hidden APIs — DataDome, Google Maps key, internal APIs | | ⚡ query_api | jsonplaceholder | Direct JSON API call with stealth headers | | 📡 monitor_websocket | Binance BTC/USDT | 3 WebSocket connections, 23 live messages — BTC price live | | 🧠 ai_extract | Amazon product | AI extracted name, price, rating, review count | | 🎯 snapshot | GitHub, Wikipedia | ARIA tree with 107/113 refs, annotated screenshots | | 🖱️ interact | Login flow | Click → type → submit — ref targeting, session encryption, 18 action types | | 📦 batch_scrape | 10 news sites | Parallel, concurrency 3, soft failure, 9/10 succeeded | | 📋 list_jobs | — | Batch jobs with status and progress | | 📊 job_status | Batch job | Full per-URL results with timing | | 🗑️ delete_job | Completed job | Cleaned up job data from disk | | 🧠 knowledge | Local knowledge file | Per-domain stats: stealth levels, success rates, anti-bot systems detected | | 🎬 youtube | "web scraping tutorial" | Search results, video details, comments, transcripts — no API key | | 💬 reddit | r/webscraping | Subreddit posts, comments, search — public JSON API | | 📸 instagram | @nike profile | Profile details, engagement rate, recent posts — internal API | | 📥 download | YouTube video, web page images | Auto-detect URL type, download media files — images, video, og:image | | 📡 rss | Hacker News RSS | Parsed feed items with title, link, date, author, categories |

41 tools. 34 hidden APIs on Airbnb. Live BTC feed. Reusable browser flows. Zero API keys for scraping.

Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | BRAVE_API_KEY | No | Brave Search API key (enables 4 search tools) | | TWOCAPTCHA_API_KEY | No | 2Captcha API key (enables auto CAPTCHA solving) | | LLM_API_KEY | No | Anthropic, OpenAI, or MiniMax API key (enables ai_extract) | | LLM_PROVIDER | No | anthropic, openai, or minimax (default: anthropic). Recommended: minimax with MiniMax-M1 — best price/performance for extraction | | LLM_MODEL | No | Override default LLM model | | OPENAI_API_KEY | No | OpenAI API key for Whisper transcription (transcribe any YouTube video without captions) | | SESSION_ENCRYPTION_KEY | No | 32-byte hex key for encrypting session files at rest | | PROXY_URL | No | Single proxy URL (http/https/socks4/socks5) | | PROXY_URLS | No | Comma-separated proxy URLs for rotation | | BROWSER_POOL_SIZE | No | Max pooled browser instances (default: 3) | | RESPECT_ROBOTS | No | Respect robots.txt (default: true) | | CHROME_PROFILE_PATH | No | Chrome user data dir for authenticated sessions | | NO_COLOR | No | Disable colored output | | CI | No | Auto-detected; disables TTY features |

Development

git clone https://github.com/ceoimperiumprojects/imperium-crawl
cd imperium-crawl
npm install
npm run build
npm run dev         # Watch mode (rebuild on changes)
npm test            # 546 tests
npm start           # Start CLI (shows help or TUI)

Contributing

Contributions welcome! Whether it's a bug fix, new tool, or documentation improvement — open an issue or PR.

# Fork the repo, then:
git clone https://github.com/YOUR_USERNAME/imperium-crawl
cd imperium-crawl
npm install
git checkout -b my-feature
# Make changes...
npm test
git push origin my-feature
# Open a PR

License

MIT — use it however you want. Free forever.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

imperium-crawl

What's new in 2.6.0

Quick Start

Power Examples

Scrape through Cloudflare

Discover hidden APIs on any website

AI extraction in plain English

Extract ALL images from any page (100% coverage)

Batch scrape with resume

Why imperium-crawl?

vs. The Competition

Stealth Engine

Stealth Levels

Anti-Bot System Detection

Smart Rendering Cache

Adaptive Learning Engine

Smart Features

All 41 Tools

📄 Scraping (no API key needed)

🔍 Search (requires free Brave API key)

⚡ Skills (no API key needed)

🔓 API Discovery & Real-Time (no API key needed, requires Playwright)

🧠 AI Extraction (requires LLM API key)

🖱️ Interaction (no API key needed, requires Playwright)

📱 Social Media (no API key needed)

📥 Media & Feeds (no API key needed)

📦 Batch Processing (no API key needed)

🧠 Knowledge Engine (no API key needed)

📄 Documents (no API key needed)

👀 Change Tracking (no API key needed)

🔁 Imperium Flows (no API key needed; browser workflows may require Playwright)

Setup

API Keys

Enable Full Stealth (Level 3)

CLI Usage

Output Formats

TUI Mode

Explore REPL

Skills & Recipes

Skill Parameters

Skill Chains

Built-in Recipes

API Discovery Workflow

AI Agent Guide

Two Ways to Connect

Per-Agent Setup

Resilience

Real-World Test Results

Environment Variables

Development

Contributing

License