@agentcomputer/torch
v0.1.4
Published
π₯ The self-healing AI scraping agent π₯
Readme
π₯ The self-healing AI scraping agent π₯
Point torch at a URL β it writes a scraper β it writes the playbook β it ships the playbook. When the site changes and the playbook breaks β torch redoes recon β updates the playbook β ships again.
curl -fsSL https://raw.githubusercontent.com/AgentComputerAI/torch/main/install.sh | shPoint it at any website. It does the rest.
torch https://news.ycombinator.comβ recon, framework detection, anti-bot escalation, extraction, and a reusable skills/sites/hackernews/SKILL.md playbook β written by torch, for torch.
π― What torch actually does
You give it a URL. Torch does all of the following autonomously while you get coffee:
URL ββ
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββ
β π΅οΈ Recon curl it, detect framework (Next.js / Shopify / SPA / etc)
β π Reverse eng find hidden APIs, decrypt encrypted endpoints, trace WS
β π§© Strategy pick lightest approach: API β sitemap β cheerio β browser
β π‘οΈ Evasion real Chrome profile β stealth β solver β proxy (escalating)
β βοΈ Extract write scraper, run as background process, validate output
β π Playbook save what worked to skills/sites/<slug>/SKILL.md forever
β π Propagate prompt user to PR the skill back upstream
βββββββββββββββββββββββββββββββββββββββββββββββ
βΌ
./output/<slug>.json + ./skills/sites/<slug>/SKILL.mdTorch doesn't just scrape HTML β it reverse-engineers sites. It reads obfuscated JS, extracts API endpoints, probes encrypted CloudFront payloads, establishes WebSocket sessions, and builds custom scrapers against internal APIs that were never meant to be public. When a site encrypts its data with NaCl crypto_secretbox, torch extracts the keys from the page source and attempts decryption autonomously.
The killer move is playbook persistence. Every site torch figures out becomes a reusable skill that future runs read first, skipping recon entirely. The skills that ship with this repo were all generated by torch itself, driving itself via RPC mode against a real Chrome profile.
β‘ Quick start
Or install from source:
git clone https://github.com/AgentComputerAI/torch
cd torch
npm install
npm run build
npm install -g .# Interactive session β chat with torch about what to scrape
torch
# One-shot β point and shoot
torch https://www.digikey.com/en/products/category/microcontrollers/685
# One-shot with a target description
torch https://www.amazon.com/s?k=mechanical+keyboard "top 30 keyboards with price, rating, reviews"
# JSONL RPC mode β drive torch from any language over stdin/stdout
torch --rpcTorch auto-clones your Chrome profile on first run (one-time, ~10-30s via rsync with cache exclusion, ~200MB on disk), auto-launches Chrome with --remote-debugging-port=9222, and every subsequent run reuses the same Chrome instance instantly.
π§ The core trick: real Chrome > stealth patches
Every other scraper fights the same losing battle: launch a fresh Chromium, patch navigator.webdriver, rotate a fake fingerprint, and lose anyway β because the site's bot scorer weighs reputation and browsing history more than any single fingerprint signal.
Torch flips it. On first run it clones your actual Chrome profile (excluding caches via rsync) into ~/.torch/chrome-profile, then auto-launches a second Chrome instance against that clone with the debug port open on 127.0.0.1:9222. When the scrape skill needs a browser, it does:
import puppeteer from "puppeteer-core";
const browser = await puppeteer.connect({ browserURL: "http://127.0.0.1:9222" });That browser has your cookies, your history, your TLS session state, your Client Hints. Amazon, Walmart, Target, eBay, Zillow, Booking, Airbnb, Costco β all landed on the first try with this approach. No stealth patches. No solvers. No proxies.
π₯οΈ Running on a VM or headless server
The real-Chrome-clone trick obviously can't work if there's no host Chrome to clone β VMs, CI boxes, remote scraping pods, Docker containers, anything without a logged-in user profile. On those machines torch falls back through two cheaper tiers:
Camoufox (if
TORCH_CAMOUFOX_ENDPOINTis set) β a Firefox fork with fingerprint spoofing patched into the engine at the C++ level. Unlike puppeteer-stealth's JS shims, Camoufox's patches are invisible to JavaScript, so anti-bot systems can't detect the tampering itself. Includes a built-in virtual display so it runs headfully on headless servers without xvfb. See thecamoufoxskill for the full integration playbook.# On your VM / CI base image, install once: pip install camoufox[geoip] && python -m camoufox fetch npm install playwright-core # Launch as a Playwright server torch connects to python -m camoufox server --port 4444 & echo "TORCH_CAMOUFOX_ENDPOINT=ws://127.0.0.1:4444" >> .envDisposable Chromium + puppeteer-extra-stealth (no env var set, last-resort fallback) β bundled with torch by default. Works for soft targets, gets blocked on anything with serious bot scoring. This is where the 9-layer anti-blocking ladder exists to fight its way through.
On your own laptop, real-Chrome-clone is the right answer and torch defaults to it. On a VM, install Camoufox and torch will transparently route browser scrapes through it instead.
π§ Skills
Torch is built on pi-coding-agent's skill system. Every capability is a SKILL.md the agent routes to on demand.
Core skills
| Skill | Purpose |
|---|---|
| π·οΈ scrape | Full scraping workflow β recon, strategy, extraction, anti-blocking, playbook authoring |
| π reverse-engineer | Find hidden APIs, decrypt encrypted endpoints, trace WebSocket streams, extract auth tokens from obfuscated JS |
| π¦ camoufox | Firefox fork with C++-level fingerprint spoofing β use on VMs / CI where real-Chrome-clone can't run |
| π€ 2captcha | Solve reCAPTCHA v2/v3, Turnstile, hCaptcha via the 2Captcha API (human workers, ~$1/1k) |
| π§ capmonster | Cheaper AI-based solver with cf_clearance support (~$0.60/1k) |
| π proxy | Authenticated residential proxy integration β Oxylabs, Bright Data, Smartproxy, IPRoyal |
| π¬ agentmail | Disposable email inboxes for gated signup flows |
| π€ contributing | PR workflow and quality bar for sharing new site skills upstream |
Site skills
All generated by torch itself via RPC mode against a real Chrome profile. Each documents detection signals, the strategy that worked, copy-pasteable stealth config, selectors and endpoints, an anti-blocking table, real data shape, pagination, and gotchas.
| Category | Sites |
|---|---|
| π‘ Public API (skip browser) | arxiv Β· github Β· hackernews Β· huggingface Β· pypi Β· reddit Β· stackoverflow Β· wikipedia |
| π SSR / embedded JSON | apple Β· doordash Β· ikea Β· imdb Β· nike Β· producthunt |
| π E-commerce (real Chrome) | amazon Β· costco Β· ebay Β· etsy Β· homedepot Β· target Β· walmart |
| π§³ Marketplace / travel | airbnb Β· booking Β· ubereats |
| π Real estate / local | redfin Β· yelp Β· zillow |
| π‘οΈ Hardened (PerimeterX / DataDome / Akamai) | digikey Β· stockx |
Adding a new site skill
Just run torch on it:
torch https://www.whatever.comTorch does Phase 0 recon β Phase 1 framework detection β Phase 2 browser scraping if needed β writes ./output/<slug>.json β writes ./skills/sites/<slug>/SKILL.md β tells you to open a PR. If you do, the next torch user inherits your playbook automatically. Self-propagating knowledge base.
π‘οΈ Anti-blocking ladder
Torch escalates through these layers only as far as needed. Stops at the first one that works.
Layer 0 π Connect to real Chrome at 127.0.0.1:9222 (auto-launched cloned profile)
Layer 1 π» Headed mode + puppeteer-extra-plugin-stealth (fallback)
Layer 2 π Realistic headers + randomized viewport + UA rotation
Layer 3 πͺ Cookie / session persistence across runs
Layer 4 π Behavioral mimicry (delays, scroll, mouse jitter)
Layer 5 βοΈ Cloudflare challenge handling + Turnstile detection
Layer 6 π€ 2captcha or capmonster solver invocation
Layer 7 π Residential proxy rotation via the proxy skill
Layer 8 β‘ Resource blocking (images/css/fonts) for speed
Layer 9 π¨βπ» Interactive fallback β opens site in your browser for manual click-throughLayer 0 solves 27 of the 29 shipped sites on its own.
π RPC mode
Drive torch from any language. Stream JSONL commands on stdin, get JSONL events on stdout.
(echo '{"type":"prompt","message":"scrape https://news.ycombinator.com"}'; sleep 300) | torch --rpcSee the pi-mono RPC docs for the full protocol. Commands: prompt, steer, follow_up, abort, new_session, get_state, get_messages, set_model, cycle_model, set_thinking_level.
This is how the 29 site skills in this repo were generated β a small Node driver that spawns torch --rpc, sends one prompt per site, waits for agent_end, and moves on. 10 instances in parallel. The driver ships with the repo at scripts/drive-torch.mjs:
# Scrape one site, print every tool call the agent makes
node scripts/drive-torch.mjs --verbose amazon 'scrape https://www.amazon.com/s?k=mechanical+keyboard'
# Parallelize with xargs
printf '%s\n' hackernews reddit github | xargs -P 3 -I{} \
node scripts/drive-torch.mjs {} 'scrape https://{}.com'π¦ Prerequisites
| Required | Optional |
|---|---|
| Node.js β₯ 20 | AgentMail API key β only for agentmail (gated signups) |
| Google Chrome (for real-profile scraping) | 2Captcha / CapMonster key β only when a target hits a captcha |
| Anthropic / OpenAI API key (for the agent brain) | Residential proxy creds β only when IP-banned |
The real Chrome auto-clone is optional but strongly recommended β it's the difference between landing on Amazon instantly and burning an hour fighting bot scores.
ποΈ Architecture
torch <url>
β
ββ cli.ts parse args, load .env
β β
β ββ ensureChromeEndpoint() detect / clone / launch Chrome debug port
β β
β ββ spawn pi-coding-agent with:
β ββ SYSTEM.md invariants, scout mode, naming, cleanup
β ββ skills/
β β ββ scrape/ reconnaissance + extraction workflow
β β ββ 2captcha/ solver API
β β ββ capmonster/ solver API
β β ββ proxy/ residential proxy patterns
β β ββ agentmail/ disposable inboxes
β β ββ camoufox/ Firefox-fork stealth (VM / CI fallback)
β β ββ contributing/ PR workflow
β β ββ sites/<slug>/ per-site playbooks
β ββ extensions/
β β ββ header.ts fire-themed terminal banner
β ββ pi-processes background scrape process management
β
ββ output/<slug>.json + skills/sites/<slug>/SKILL.mdπ§ͺ Development
npm run dev # run via tsx, no build step
npm run build # compile src/ β dist/
npm start # run compiled entryπ License
MIT. See LICENSE. Built on pi-coding-agent by Mario Zechner.
π₯ Self-healing. Self-propagating. Self-improving. π₯
Every site anyone figures out becomes a skill the whole community inherits. Every broken playbook auto-repairs itself on the next run.
Contribute back at github.com/AgentComputerAI/torch
