lupin-cli

v0.2.0

Published

21 days ago

Adaptive scraper with HTTP-first routing, Camoufox headless escalation, and Patchright fallback.

0High
0Medium
0Low

scraping camoufox playwright patchright browser-automation mcp

npm install -g lupin-cli
lupin setup
lupin fetch https://www.nytimes.com/ --format markdown

Why Lupin · Comparison · Benchmark · Platforms · MCP for AI Agents · Docs

Why Lupin?

Most web pages don't need a stealth browser. But when they do, you shouldn't have to figure that out yourself. Sometimes, your scraping pipeline works with plain HTTP 10 times in a row, but then fails the 11th time. Lupin solves that issue by implementing smart escalation.

HTTP (fast, ~0.2s) ──→ Blocked ? ──→ Camoufox (stealth Firefox) ──→ Blocked ? ──→ Patchright (stealth Chrome)

Lupin starts with a plain HTTP request. If the response looks blocked (Cloudflare challenge, empty body, bot detection page), it automatically escalates to two heavily patched stealth browsers: Camoufox, an anti-fingerprint Firefox fork, and Patchright, a patched Chromium that passes every major bot detector. Having two different engines (selected for their efficiency) maintained by two different teams diminishes the risk of watching your request suddenly get blocked on all engines.

Domains that needed escalation are remembered with their engine. Next time, Lupin skips straight to the engine that worked (24h sticky memory). Over time, your scraping gets faster automatically.

This matters because:

Most of your requests will go through HTTP, saving 10-20x time and bandwidth compared to a headless browser
Only when exhausted or picking a hard domain, your requests will use a stealth browser
This means: faster, a bit more reliable scraping and less egress/proxy costs for all your projects

Benchmark

Benchmark as of 2026-04-07, on 25 real-world targets considered hard. These results are not definitive, anti-bot protections evolve all the time and one website that was crawlable one day may become blocked tomorrow.

| Site | --------------- | Reuters | ✅ | Bloomberg | ✅ | NY Times | ✅ | Booking.com | ✅ | Zillow | ✅ | TikTok | ✅ | Indeed | ✅ | ScienceDirect | ✅ | Reddit | ✅ | Instagram | ✅ | YouTube | ✅ | X.com | ✅ | Pinterest | ✅ | Amazon | ✅ | LinkedIn | ✅ | Washington Post | ✅ | Medium | ✅ | Cloudflare | ✅ | Polymarket | ✅ | Airbnb | ✅ | eBay | ✅ | ArXiv | ✅ | Wikipedia | ✅ | Craigslist | ✅ | example.com | ✅ | Score | Lupin | Crawlee | Scrapling | Crawl4AI | Exa MCP | Claude Code fetch() | | --------- | --------- | --------- | --------- | --------- | ------------------- | | ✅ | ❌ | ❌ | ✅ | ❌ | | ❌ | ✅ | ❌ | ✅ | ❌ | | ❌ | ✅ | ❌ | ❌ | ❌ | | ✅ | ❌ | ❌ | ❌ | ❌ | | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ❌ | ❌ | ❌ | ❌ | | ✅ | ✅ | ✅ | ✅ | ❌ | | ✅ | ✅ | ✅ | — | — | | ✅ | ✅ | ✅ | ✅ | ❌ | | ✅ | ✅ | ✅ | ❌ | ❌ | | ✅ | ✅ | ✅ | ❌ | ❌ | | ✅ | ✅ | ❌ | ❌ | ❌ | | ✅ | ✅ | ✅ | ❌ | ❌ | | ✅ | ✅ | ✅ | ✅ | ❌ | | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ❌ | | ❌ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ❌ | | ✅ | ❌ | ✅ | ✅ | ❌ | | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | ✅ | ✅ | | ✅ | ✅ | ✅ | — | — | | ✅ | ✅ | ✅ | ✅ | ✅ | | 25/25 | 22/25 | 21/25 | 19/25 | 17/23 | 7/23 |

Benchmark run 2026-04-07. Crawlee uses PlaywrightCrawler, Scrapling uses curl_cffi (HTTP-only), Claude Code uses the native fetch web function, Crawl4AI uses Playwright via patchright. Exa MCP and CC fetch tested on 23 of 25 URLs (— = not tested). Please note that in our tests, some heavily protected websites still fail after 4-5 consecutive attempts; these websites need either proxy rotation or more custom fingerprinting.

Built-in web search

Lupin provides built-in web search as a convenience (supporting DuckDuckGo and Google as engines), DuckDuckGo is the default engine and the most reliable in our tests.

# Search the web (default engine: DuckDuckGo)
lupin search web "best open source web scraping tools" --limit 10

# Search a specific site with most recent results first and in markdown format
lupin search web "agent memory" --site docs.anthropic.com --sort recent --format markdown

Popular social media platforms

Lupin provides built-in scrapers for the 8 most popular social platforms, using web search as a source for links. No API keys and no cookie exports required.

lupin search x "from:elonmusk AI" --limit 5
lupin search tiktok "productivity hacks" --limit 10
lupin search instagram "street photography" --limit 5
lupin fetch reddit https://reddit.com/r/node/comments/abc --max-comments 20

| Platform | Search | Fetch | Method | | ------------ | ------ | ----- | ------------------ | | Web / Google | ✅ | ✅ | Browser | | X / Twitter | ✅ | ✅ | Browser | | Reddit | ✅ | ✅ | HTTP only | | Hacker News | ✅ | ✅ | HTTP only | | YouTube | ✅ | ✅ | HTTP only | | Instagram | ✅ | ✅ | Browser for search | | TikTok | ✅ | ✅ | Browser for search | | Polymarket | ✅ | ✅ | HTTP only |

Platform scrapers are provided as a convenience. You can install/uninstall them at any time. Please note that scrapers for popular platforms often change and require updates (see below). Need a site that isn't built in? You can build your own installable platform package. See the custom platform guide.

Platform updates and health checks

Social sites change often. Lupin separates platform health from core scraping so you can see what is installed, check whether a provider still works, and update platform packages when fixes ship.

# Show installed platforms, source, status, and version
lupin platform list

# Check whether Lupin core or platform packages have updates
lupin update check
lupin platform update --check

# Run manifest/tool checks for every platform
lupin platform doctor --all

# Run live smoke checks against known public targets
lupin platform doctor --all --smoke

Quick start

npm install -g lupin-cli
lupin setup               # installs browser engines
lupin setup --with-video  # adds yt-dlp + FFmpeg for video download
lupin doctor              # shows what's ready

# Scrape any page
lupin fetch https://example.com

# Output as markdown (for LLMs, RAG pipelines)
lupin fetch https://example.com --format markdown

# Output as JSON (for scripts/crawl)
lupin fetch https://example.com --format json

# Output as HTML (for scripts/crawl)
lupin fetch https://example.com --format html

# Search the web
lupin search web "best web scraping library 2026"

# Crawl an entire site
lupin crawl https://docs.example.com --depth 2 --limit 50 --format markdown -o docs.jsonl

# Extract structured data with an LLM
lupin fetch https://example.com --schema '{"type":"object","properties":{"title":{"type":"string"}}}'

# Download YT/TikTok/Instagram video content
lupin download https://www.youtube.com/watch?v=dQw4w9WgXcQ

Docker

docker build -t lupin .
docker run --rm -i lupin fetch https://example.com
docker run --rm -i lupin --mcp

HTTP-only flows (fetch in auto mode, search reddit, search hn, search youtube) work before browser setup.

Using in AI Agents

We recommend that your agents use Lupin as a CLI or as an MCP server. Both let your agents scrape, search, browse and crawl.

CLI Setup (recommended, less token usage, similar features)

Claude Code / Codex / OpenCode / Hermes / OpenClaw: add instructions to your AGENTS.md:

## Web Scraping

This project uses `lupin-cli` for web scraping. Run `lupin --help` for full usage.

Common commands:
- `lupin fetch ` — scrape any page (returns JSON with text, title, status)
- `lupin fetch  --format markdown` — get clean LLM-ready markdown
- `lupin search web "query"` — web search
- `lupin search x "query"` — search X/Twitter without API keys
- `lupin search reddit "query"` — search Reddit

This setup uses ~90% fewer tokens than the MCP server and works with any agent that can run shell commands.

MCP Setup

Claude Desktop — add to ~/Library/Application Support/Claude/claude_desktop_config.json:

{
  "mcpServers": {
    "lupin": {
      "command": "npx",
      "args": ["lupin-cli", "--mcp"]
    }
  }
}

Cursor / other MCP clients:

{
  "command": "npx",
  "args": ["lupin-cli", "--mcp"]
}

Available tools in MCP

| Category | Tools | | ---------------- | -------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | Search (9) | search_web, search_google, search_x, search_reddit, search_hn, search_youtube, search_polymarket, search_instagram, search_tiktok | | Fetch (10) | fetch_page, fetch_x_post, fetch_reddit_post, fetch_hn_item, fetch_polymarket_market, fetch_youtube_video, fetch_instagram_post, fetch_instagram_profile, fetch_tiktok_post, fetch_tiktok_profile | | Browser (10) | browser_open_session, browser_navigate, browser_click, browser_type, browser_press, browser_wait_for, browser_snapshot, browser_extract, browser_screenshot, browser_close_session | | Site (2) | crawl_site, map_site | | Video (1) | download_video (requires lupin setup --with-video) |

Use as a Library

import { Lupin } from "lupin-cli";

const scraper = new Lupin();

try {
  const result = await scraper.scrape("https://example.com");
  console.log(result.engine, result.confidence, result.text.slice(0, 300));
} finally {
  await scraper.close();
}

One-shot convenience:

import { scrapePage } from "lupin-cli";

const result = await scrapePage("https://example.com", { engine: "auto" });

LLM summarization and structured schemas

Like Firecrawl and modern solutions, Lupin provides the possibility to wire in an LLM to retrieve structured data from any page using any LLM (Ollama or OpenAI-compatible endpoint) and return content as summarized markdown or structured JSON.

# Free-form extraction
lupin fetch <url> --extract "what are the prices?"

# Structured extraction with JSON Schema
lupin fetch <url> --schema '{"type":"object","properties":{"title":{"type":"string"},"price":{"type":"number"}}}'

# Multimodal: analyze images and video from platform posts
lupin fetch instagram <url> --extract "what brands are visible in the image?"
lupin fetch youtube <url> --extract "list the products shown in this video"
lupin fetch tiktok <url> --schema '{"type":"object","properties":{"products_shown":{"type":"array","items":{"type":"string"}}}}'

# Text-only extraction on any platform
lupin fetch reddit <url> --extract "summarize the top comments"

# Per-page extraction during crawls
lupin crawl https://docs.example.com --extract "summarize" --llm ollama

For platform providers (Instagram, TikTok, YouTube, X), the model receives the actual images and video alongside text, not just metadata. You can ask about what's in a photo or video, not just what the caption says.

Recommended setup: Ollama (free, local LLM, zero API keys. Requires 2-4GB of VRAM)

ollama pull qwen3.5:4b
lupin llm add ollama --base-url http://localhost:11434/v1 --model qwen3.5:4b --default

Alternative: OpenAI / OpenRouter-like endpoint

export OPENROUTER_API_KEY=sk-or-...

lupin llm add openrouter \
  --base-url https://openrouter.ai/api/v1 \
  --api-key '${OPENROUTER_API_KEY}' \
  --model qwen/qwen3.5-9b \
  --default

Also supports any OpenAI-compatible endpoint. See LLM extraction docs for all options.

Video, audio & social content download

Lupin can download video or audio from YouTube, TikTok, Instagram, and 1000+ other sites by installing yt-dlp as a dependency.

lupin setup --with-video                                    # one-time setup
lupin download https://www.youtube.com/watch?v=dQw4w9WgXcQ  # video as MP4
lupin download <url> --audio-only                            # extract MP3
lupin download <url> --subtitles                             # grab subs too

Content is downloaded temporarily into ~/.lupin/; yt-dlp will auto-update on each run.

Proxy Support

Lupin can route fetch, search, and crawl traffic through a single proxy or a rotating proxy list.

lupin fetch https://example.com --proxy socks5://127.0.0.1:1080
lupin search web "agentic AI" --proxy http://user:pass@host:port
lupin crawl https://example.com --proxy-list proxies.txt --proxy-rotate sticky-domain

Docs

| Document | Description | | -------------------------------------- | ----------------------------------------------------- | | CLI Reference | Full flag reference for every command | | Configuration | Environment variables, result schemas, engine routing | | Custom Platforms | Build, install, and share your own Lupin platforms |

Tests

npm test           # local/fixture suite
npm run test:live  # public-site verification
npm run test:all   # both

License

MIT