@agentcomputer/torch

v0.1.4

Published

5 days ago

🔥 The self-healing AI scraping agent 🔥

0High
0Medium
0Low

hariva.sh

advaitpaliwal

🔥 The self-healing AI scraping agent 🔥

Point torch at a URL → it writes a scraper → it writes the playbook → it ships the playbook. When the site changes and the playbook breaks → torch redoes recon → updates the playbook → ships again.

curl -fsSL https://raw.githubusercontent.com/AgentComputerAI/torch/main/install.sh | sh

Point it at any website. It does the rest.

torch https://news.ycombinator.com

→ recon, framework detection, anti-bot escalation, extraction, and a reusable skills/sites/hackernews/SKILL.md playbook — written by torch, for torch.

🎯 What torch actually does

You give it a URL. Torch does all of the following autonomously while you get coffee:

  URL ─┐
       ▼
  ┌─────────────────────────────────────────────┐
  │  🕵️  Recon         curl it, detect framework (Next.js / Shopify / SPA / etc)
  │  🔓  Reverse eng   find hidden APIs, decrypt encrypted endpoints, trace WS
  │  🧩  Strategy      pick lightest approach: API → sitemap → cheerio → browser
  │  🛡️  Evasion       real Chrome profile → stealth → solver → proxy (escalating)
  │  ⚗️  Extract       write scraper, run as background process, validate output
  │  📚  Playbook      save what worked to skills/sites/<slug>/SKILL.md forever
  │  🔁  Propagate     prompt user to PR the skill back upstream
  └─────────────────────────────────────────────┘
       ▼
  ./output/<slug>.json + ./skills/sites/<slug>/SKILL.md

Torch doesn't just scrape HTML — it reverse-engineers sites. It reads obfuscated JS, extracts API endpoints, probes encrypted CloudFront payloads, establishes WebSocket sessions, and builds custom scrapers against internal APIs that were never meant to be public. When a site encrypts its data with NaCl crypto_secretbox, torch extracts the keys from the page source and attempts decryption autonomously.

The killer move is playbook persistence. Every site torch figures out becomes a reusable skill that future runs read first, skipping recon entirely. The skills that ship with this repo were all generated by torch itself, driving itself via RPC mode against a real Chrome profile.

⚡ Quick start

Or install from source:

git clone https://github.com/AgentComputerAI/torch
cd torch
npm install
npm run build
npm install -g .

# Interactive session — chat with torch about what to scrape
torch

# One-shot — point and shoot
torch https://www.digikey.com/en/products/category/microcontrollers/685

# One-shot with a target description
torch https://www.amazon.com/s?k=mechanical+keyboard "top 30 keyboards with price, rating, reviews"

# JSONL RPC mode — drive torch from any language over stdin/stdout
torch --rpc

Torch auto-clones your Chrome profile on first run (one-time, ~10-30s via rsync with cache exclusion, ~200MB on disk), auto-launches Chrome with --remote-debugging-port=9222, and every subsequent run reuses the same Chrome instance instantly.

🧠 The core trick: real Chrome > stealth patches

Every other scraper fights the same losing battle: launch a fresh Chromium, patch navigator.webdriver, rotate a fake fingerprint, and lose anyway — because the site's bot scorer weighs reputation and browsing history more than any single fingerprint signal.

Torch flips it. On first run it clones your actual Chrome profile (excluding caches via rsync) into ~/.torch/chrome-profile, then auto-launches a second Chrome instance against that clone with the debug port open on 127.0.0.1:9222. When the scrape skill needs a browser, it does:

import puppeteer from "puppeteer-core";
const browser = await puppeteer.connect({ browserURL: "http://127.0.0.1:9222" });

That browser has your cookies, your history, your TLS session state, your Client Hints. Amazon, Walmart, Target, eBay, Zillow, Booking, Airbnb, Costco — all landed on the first try with this approach. No stealth patches. No solvers. No proxies.

🖥️ Running on a VM or headless server

The real-Chrome-clone trick obviously can't work if there's no host Chrome to clone — VMs, CI boxes, remote scraping pods, Docker containers, anything without a logged-in user profile. On those machines torch falls back through two cheaper tiers:

Camoufox (if TORCH_CAMOUFOX_ENDPOINT is set) — a Firefox fork with fingerprint spoofing patched into the engine at the C++ level. Unlike puppeteer-stealth's JS shims, Camoufox's patches are invisible to JavaScript, so anti-bot systems can't detect the tampering itself. Includes a built-in virtual display so it runs headfully on headless servers without xvfb. See the camoufox skill for the full integration playbook.
```
# On your VM / CI base image, install once:
pip install camoufox[geoip] && python -m camoufox fetch
npm install playwright-core

# Launch as a Playwright server torch connects to
python -m camoufox server --port 4444 &
echo "TORCH_CAMOUFOX_ENDPOINT=ws://127.0.0.1:4444" >> .env
```
Disposable Chromium + puppeteer-extra-stealth (no env var set, last-resort fallback) — bundled with torch by default. Works for soft targets, gets blocked on anything with serious bot scoring. This is where the 9-layer anti-blocking ladder exists to fight its way through.

On your own laptop, real-Chrome-clone is the right answer and torch defaults to it. On a VM, install Camoufox and torch will transparently route browser scrapes through it instead.

🔧 Skills

Torch is built on pi-coding-agent's skill system. Every capability is a SKILL.md the agent routes to on demand.

Core skills

| Skill | Purpose | |---|---| | 🕷️ scrape | Full scraping workflow — recon, strategy, extraction, anti-blocking, playbook authoring | | 🔓 reverse-engineer | Find hidden APIs, decrypt encrypted endpoints, trace WebSocket streams, extract auth tokens from obfuscated JS | | 🦊 camoufox | Firefox fork with C++-level fingerprint spoofing — use on VMs / CI where real-Chrome-clone can't run | | 🤖 2captcha | Solve reCAPTCHA v2/v3, Turnstile, hCaptcha via the 2Captcha API (human workers, ~$1/1k) | | 🧠 capmonster | Cheaper AI-based solver with cf_clearance support (~$0.60/1k) | | 🌐 proxy | Authenticated residential proxy integration — Oxylabs, Bright Data, Smartproxy, IPRoyal | | 📬 agentmail | Disposable email inboxes for gated signup flows | | 🤝 contributing | PR workflow and quality bar for sharing new site skills upstream |

Site skills

All generated by torch itself via RPC mode against a real Chrome profile. Each documents detection signals, the strategy that worked, copy-pasteable stealth config, selectors and endpoints, an anti-blocking table, real data shape, pagination, and gotchas.

| Category | Sites | |---|---| | 📡 Public API (skip browser) | arxiv · github · hackernews · huggingface · pypi · reddit · stackoverflow · wikipedia | | 📄 SSR / embedded JSON | apple · doordash · ikea · imdb · nike · producthunt | | 🛒 E-commerce (real Chrome) | amazon · costco · ebay · etsy · homedepot · target · walmart | | 🧳 Marketplace / travel | airbnb · booking · ubereats | | 🏠 Real estate / local | redfin · yelp · zillow | | 🛡️ Hardened (PerimeterX / DataDome / Akamai) | digikey · stockx |

Adding a new site skill

Just run torch on it:

torch https://www.whatever.com

Torch does Phase 0 recon → Phase 1 framework detection → Phase 2 browser scraping if needed → writes ./output/<slug>.json → writes ./skills/sites/<slug>/SKILL.md → tells you to open a PR. If you do, the next torch user inherits your playbook automatically. Self-propagating knowledge base.

🛡️ Anti-blocking ladder

Torch escalates through these layers only as far as needed. Stops at the first one that works.

  Layer 0  🏆  Connect to real Chrome at 127.0.0.1:9222 (auto-launched cloned profile)
  Layer 1  👻  Headed mode + puppeteer-extra-plugin-stealth (fallback)
  Layer 2  📋  Realistic headers + randomized viewport + UA rotation
  Layer 3  🍪  Cookie / session persistence across runs
  Layer 4  🐁  Behavioral mimicry (delays, scroll, mouse jitter)
  Layer 5  ☁️   Cloudflare challenge handling + Turnstile detection
  Layer 6  🤖  2captcha or capmonster solver invocation
  Layer 7  🌐  Residential proxy rotation via the proxy skill
  Layer 8  ⚡  Resource blocking (images/css/fonts) for speed
  Layer 9  👨‍💻  Interactive fallback — opens site in your browser for manual click-through

Layer 0 solves 27 of the 29 shipped sites on its own.

🔌 RPC mode

Drive torch from any language. Stream JSONL commands on stdin, get JSONL events on stdout.

(echo '{"type":"prompt","message":"scrape https://news.ycombinator.com"}'; sleep 300) | torch --rpc

See the pi-mono RPC docs for the full protocol. Commands: prompt, steer, follow_up, abort, new_session, get_state, get_messages, set_model, cycle_model, set_thinking_level.

This is how the 29 site skills in this repo were generated — a small Node driver that spawns torch --rpc, sends one prompt per site, waits for agent_end, and moves on. 10 instances in parallel. The driver ships with the repo at scripts/drive-torch.mjs:

# Scrape one site, print every tool call the agent makes
node scripts/drive-torch.mjs --verbose amazon 'scrape https://www.amazon.com/s?k=mechanical+keyboard'

# Parallelize with xargs
printf '%s\n' hackernews reddit github | xargs -P 3 -I{} \
  node scripts/drive-torch.mjs {} 'scrape https://{}.com'

📦 Prerequisites

| Required | Optional | |---|---| | Node.js ≥ 20 | AgentMail API key — only for agentmail (gated signups) | | Google Chrome (for real-profile scraping) | 2Captcha / CapMonster key — only when a target hits a captcha | | Anthropic / OpenAI API key (for the agent brain) | Residential proxy creds — only when IP-banned |

The real Chrome auto-clone is optional but strongly recommended — it's the difference between landing on Amazon instantly and burning an hour fighting bot scores.

🏗️ Architecture

  torch <url>
     │
     ├─ cli.ts                        parse args, load .env
     │    │
     │    ├─ ensureChromeEndpoint()   detect / clone / launch Chrome debug port
     │    │
     │    └─ spawn pi-coding-agent with:
     │         ├─ SYSTEM.md           invariants, scout mode, naming, cleanup
     │         ├─ skills/
     │         │   ├─ scrape/         reconnaissance + extraction workflow
     │         │   ├─ 2captcha/       solver API
     │         │   ├─ capmonster/     solver API
     │         │   ├─ proxy/          residential proxy patterns
     │         │   ├─ agentmail/      disposable inboxes
     │         │   ├─ camoufox/       Firefox-fork stealth (VM / CI fallback)
     │         │   ├─ contributing/   PR workflow
     │         │   └─ sites/<slug>/   per-site playbooks
     │         ├─ extensions/
     │         │   └─ header.ts       fire-themed terminal banner
     │         └─ pi-processes        background scrape process management
     │
     └─ output/<slug>.json + skills/sites/<slug>/SKILL.md

🧪 Development

npm run dev           # run via tsx, no build step
npm run build         # compile src/ → dist/
npm start             # run compiled entry

📜 License

MIT. See LICENSE. Built on pi-coding-agent by Mario Zechner.

🔥 Self-healing. Self-propagating. Self-improving. 🔥

Every site anyone figures out becomes a skill the whole community inherits. Every broken playbook auto-repairs itself on the next run.

Contribute back at github.com/AgentComputerAI/torch

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

🔥 The self-healing AI scraping agent 🔥

🎯 What torch actually does

⚡ Quick start

🧠 The core trick: real Chrome > stealth patches

🖥️ Running on a VM or headless server

🔧 Skills

Core skills

Site skills

Adding a new site skill

🛡️ Anti-blocking ladder

🔌 RPC mode

📦 Prerequisites

🏗️ Architecture

🧪 Development

📜 License

🔥 Self-healing. Self-propagating. Self-improving. 🔥