sitepull
v0.3.0
Published
Reverse-engineer a hosted web app and run it locally. Probes endpoints, vendors all assets, generates a zero-dependency local server. Auto-detects SPA vs MPA, recursive crawl, auth via cookie jar.
Maintainers
Readme
sitepull
Reverse-engineer a hosted web app and run it locally. Auto-detects SPA vs MPA, vendors every asset, generates a zero-dependency local server with safe stubs for any backend endpoints found.
npx sitepull https://example.comv0.2 adds: --diff (re-audit + change report), --beautify (pretty-print bundles), --browser (Playwright for JS-heavy sites), --auth-flow (interactive login), and an MCP server (sitepull-mcp) so AI agents can call it as a tool.
▶ sitepull v0.2.1
Target: https://example.com/
Output: ./audits/example.com/
[1/6] Reconnaissance
✓ HTTP 200, server: ECS
· Scripts: 0, stylesheets: 0, icons: 0, images: 0
[2/6] Detecting site type (SPA vs MPA)
✓ Detected: MPA
→ 3/3 unknown paths returned 404
[3/6] Probing endpoints
✓ Found 1 real endpoints (out of 38 probed)
[4/6] Crawling site (max-pages=200, max-depth=4)
✓ [depth 0] / (1256B)
✓ [depth 1] /domains (4081B)
...
[5/6] Generating server.js + package.json + README
✓ Wrote app/ at audits/example.com/app
[6/6] Smoke test on port 8080
✓ GET / -> 200 (1256 bytes)
✓ Done in 1.7s. cd audits/example.com/app && node server.jsInstall
# One-off
npx sitepull <URL>
# Globally
npm install -g sitepull
sitepull <URL>
# From source
git clone <this repo>
cd sitepull && npm link
sitepull <URL>Requires Node ≥ 18. Zero npm dependencies.
What it does
- Recon — fetches
/, parses every script/stylesheet/image URL, captures HTTP headers. - Detect — probes 3 random unknown paths to decide SPA vs MPA.
- Probe — hits 38 well-known endpoints (
/api/*,/health,/.env,/admin, …) plus light POST-fuzz on any API endpoints discovered. - Vendor — downloads every same-origin asset to
audits/<host>/app/public/byte-for-byte.- SPA: walks the asset graph from
index.html+ recurses into CSSurl(...)refs. - MPA: BFS-crawls links from the homepage with depth + page caps + robots.txt support.
- SPA: walks the asset graph from
- Generate — writes
server.js(zero deps, Node built-inhttp),package.json,README.md,AUDIT.md. Stubs each discovered backend endpoint with a safe placeholder response. - Smoke test — boots the generated server, fetches
/, prints status + bytes.
Flags
--out <dir> Output directory (default: ./audits/<host>)
--port <N> Port the generated server.js will listen on (default: 8080)
--cookie <str> Cookie header for auth-walled sites
--user-agent <str> Override the default User-Agent
--max-pages <N> MPA crawler page cap (default: 200)
--max-depth <N> MPA crawler depth cap (default: 4)
--include <regex> Only follow URLs matching this regex
--exclude <regex> Skip URLs matching this regex
--no-respect-robots Ignore robots.txt
--rate-ms <N> Polite delay between requests in ms (default: 50)
--concurrency <N> Parallel HTTP fetches (default: 6)
--force-mode spa|mpa Override auto-detection
--no-probe Skip endpoint probing
--no-fuzz Skip POST-fuzzing
--no-smoke-test Skip the final boot test
--beautify Pretty-print minified .js/.css/.html bundles to *.pretty.* siblings
--browser Use Playwright Chromium instead of fetch (JS-rendered sites)
--storage-state <f> Use Playwright storage-state file (cookies + localStorage) for auth
--auth-flow Open Chromium for interactive login, save state, then exit
(use with --storage-state to point at the file to write)
--diff Re-audit and write DIFF.md against the previous run at --outv0.2 features
--beautify — readable bundles
Adds a post-vendor pass that line-breaks and indents minified .js/.css/.html files larger than 30 KB. Saved as *.pretty.* siblings of the originals. Zero dependencies (uses a brace-aware walker that respects strings, template literals, regex, and comments).
sitepull https://example.com --beautify
# → public/assets/index-abc123.js (original, untouched)
# → public/assets/index-abc123.pretty.js (~15 K lines, grep-friendly)--diff — detect changes since last clone
Each audit writes a .sitepull-manifest.json with SHA-256 of every vendored file. A subsequent run with --diff re-audits, compares manifests, and writes DIFF.md:
sitepull https://my-site.com # initial clone
# ... a week later ...
sitepull https://my-site.com --diff # writes DIFF.md
# → 3 added, 1 removed, 7 changed, 42 unchangedUse this to monitor a site for changes (CSP rotation, asset cache-bust, content edits, removed routes).
--browser — render JS-heavy sites
When a site needs JavaScript to produce content (Cloudflare interstitials, hydrated SSR, infinite-scroll feeds), pass --browser to use Playwright Chromium instead of raw fetch. Playwright is an optional peer dependency — only loaded when this flag is used.
# One-time setup:
npm install -g playwright && npx playwright install chromium
sitepull https://complex-spa.example.com --browser--auth-flow — clone sites behind login
Opens a real Chromium window. You log in interactively. On close, cookies + localStorage are persisted. Re-use that state on subsequent audits with --storage-state.
sitepull --auth-flow https://app.example.com --storage-state ./auth.json
# (browser opens; you log in; close the window)
sitepull https://app.example.com --browser --storage-state ./auth.jsonOr use the simpler --cookie flag if you already have the Cookie header from devtools:
sitepull https://app.example.com --cookie "session=eyJhbGc...; csrf=xyz"sitepull-mcp — MCP server for AI agents
A second binary, sitepull-mcp, exposes sitepull as MCP tools any compatible AI agent (Claude Desktop, Cursor, Cline, etc.) can call directly. Stdio transport.
Tools exposed:
web_audit(url, ...options)— run a full auditweb_audit_diff(url, out)— re-audit and produce DIFF.mdweb_audit_serve(out, port)— boot the local server
Add to your agent's MCP config, e.g. for Claude Desktop:
{ "mcpServers": { "sitepull": { "command": "sitepull-mcp" } } }Output layout
audits/<host>/
├── AUDIT.md # Findings report (architecture, endpoints, fuzz, hashes)
└── app/
├── server.js # Generated zero-dep Node server with stubs
├── package.json # `"type": "module"`
├── README.md # How to run + integrity hashes + caveats
└── public/ # Byte-for-byte vendored assets
├── index.html
├── assets/...
└── (all other files)Auth (cloning sites behind a login)
# Open the site in your browser, log in, copy the Cookie header from devtools,
# then pass it to sitepull:
sitepull https://app.example.com --cookie "session=eyJhbGc...; csrf=xyz"The cookie is attached to every request the crawler/probe makes. Same-origin only — never sent to external CDNs.
Safety / responsible use
- Polite by default: 50 ms delay between requests, max 6 concurrent fetches, respects
robots.txt. Tune with--rate-msand--concurrency. - Read-only: never POSTs except during
--probe(and only with safe payloads to a fixed list of known API paths). - No upstream forwarding: generated stubs never forward to the real third-party APIs (Stripe, OpenAI, etc.). They return placeholders so the local frontend can boot.
- No secret theft: any API keys spotted in bundles are NOT copied into the generated server. Stubs are inert.
- Don't audit what you don't have permission to: this tool is for your own sites, sites you've been hired to test, OSS frontends, and educational reverse-engineering of public static pages. Don't use it to scrape sites that prohibit it in their ToS.
Architecture
bin/sitepull.mjs ─ CLI entry, argv parser, error handling
lib/
├── audit.js ─ Phases 1–6 orchestrator
├── recon.js ─ Phase 1: fetch /, extract asset URLs from HTML
├── detect.js ─ Phase 2: SPA/MPA classifier
├── probe.js ─ Phase 3: endpoint sweep + POST fuzzer
├── crawl.js ─ Phase 4 (MPA): BFS crawler with robots support
├── vendor.js ─ Phase 4 (SPA + MPA assets): downloads to disk
├── stubs.js ─ Phase 5a: synthesize safe stub branches
├── server-template.js ─ Phase 5b: builds server.js + package.json
├── report.js ─ Phase 6: writes AUDIT.md + README.md
└── util.js ─ shared (logging, fetch, hashing, throttle)License
MIT
