sitepull

v0.3.0

Published

9 days ago

Reverse-engineer a hosted web app and run it locally. Probes endpoints, vendors all assets, generates a zero-dependency local server. Auto-detects SPA vs MPA, recursive crawl, auth via cookie jar.

0High
0Medium
0Low

the4dev

audit clone mirror scraper crawler spa reverse-engineering static-site

sitepull

Reverse-engineer a hosted web app and run it locally. Auto-detects SPA vs MPA, vendors every asset, generates a zero-dependency local server with safe stubs for any backend endpoints found.

npx sitepull https://example.com

v0.2 adds: --diff (re-audit + change report), --beautify (pretty-print bundles), --browser (Playwright for JS-heavy sites), --auth-flow (interactive login), and an MCP server (sitepull-mcp) so AI agents can call it as a tool.

▶ sitepull  v0.2.1
  Target: https://example.com/
  Output: ./audits/example.com/

[1/6] Reconnaissance
  ✓ HTTP 200, server: ECS
  · Scripts: 0, stylesheets: 0, icons: 0, images: 0
[2/6] Detecting site type (SPA vs MPA)
  ✓ Detected: MPA
  → 3/3 unknown paths returned 404
[3/6] Probing endpoints
  ✓ Found 1 real endpoints (out of 38 probed)
[4/6] Crawling site (max-pages=200, max-depth=4)
  ✓ [depth 0] / (1256B)
  ✓ [depth 1] /domains (4081B)
  ...
[5/6] Generating server.js + package.json + README
  ✓ Wrote app/ at audits/example.com/app
[6/6] Smoke test on port 8080
  ✓ GET / -> 200 (1256 bytes)

✓ Done in 1.7s.  cd audits/example.com/app && node server.js

Install

# One-off
npx sitepull <URL>

# Globally
npm install -g sitepull
sitepull <URL>

# From source
git clone <this repo>
cd sitepull && npm link
sitepull <URL>

Requires Node ≥ 18. Zero npm dependencies.

What it does

Recon — fetches /, parses every script/stylesheet/image URL, captures HTTP headers.
Detect — probes 3 random unknown paths to decide SPA vs MPA.
Probe — hits 38 well-known endpoints (/api/*, /health, /.env, /admin, …) plus light POST-fuzz on any API endpoints discovered.
Vendor — downloads every same-origin asset to audits/<host>/app/public/ byte-for-byte.
- SPA: walks the asset graph from index.html + recurses into CSS url(...) refs.
- MPA: BFS-crawls links from the homepage with depth + page caps + robots.txt support.
Generate — writes server.js (zero deps, Node built-in http), package.json, README.md, AUDIT.md. Stubs each discovered backend endpoint with a safe placeholder response.
Smoke test — boots the generated server, fetches /, prints status + bytes.

Flags

--out <dir>            Output directory (default: ./audits/<host>)
--port <N>             Port the generated server.js will listen on (default: 8080)

--cookie <str>         Cookie header for auth-walled sites
--user-agent <str>     Override the default User-Agent

--max-pages <N>        MPA crawler page cap (default: 200)
--max-depth <N>        MPA crawler depth cap (default: 4)
--include <regex>      Only follow URLs matching this regex
--exclude <regex>      Skip URLs matching this regex
--no-respect-robots    Ignore robots.txt
--rate-ms <N>          Polite delay between requests in ms (default: 50)
--concurrency <N>      Parallel HTTP fetches (default: 6)

--force-mode spa|mpa   Override auto-detection
--no-probe             Skip endpoint probing
--no-fuzz              Skip POST-fuzzing
--no-smoke-test        Skip the final boot test

--beautify             Pretty-print minified .js/.css/.html bundles to *.pretty.* siblings
--browser              Use Playwright Chromium instead of fetch (JS-rendered sites)
--storage-state <f>    Use Playwright storage-state file (cookies + localStorage) for auth
--auth-flow            Open Chromium for interactive login, save state, then exit
                       (use with --storage-state to point at the file to write)

--diff                 Re-audit and write DIFF.md against the previous run at --out

v0.2 features

`--beautify` — readable bundles

Adds a post-vendor pass that line-breaks and indents minified .js/.css/.html files larger than 30 KB. Saved as *.pretty.* siblings of the originals. Zero dependencies (uses a brace-aware walker that respects strings, template literals, regex, and comments).

sitepull https://example.com --beautify
# → public/assets/index-abc123.js          (original, untouched)
# → public/assets/index-abc123.pretty.js   (~15 K lines, grep-friendly)

`--diff` — detect changes since last clone

Each audit writes a .sitepull-manifest.json with SHA-256 of every vendored file. A subsequent run with --diff re-audits, compares manifests, and writes DIFF.md:

sitepull https://my-site.com                           # initial clone
# ... a week later ...
sitepull https://my-site.com --diff                    # writes DIFF.md
# → 3 added, 1 removed, 7 changed, 42 unchanged

Use this to monitor a site for changes (CSP rotation, asset cache-bust, content edits, removed routes).

`--browser` — render JS-heavy sites

When a site needs JavaScript to produce content (Cloudflare interstitials, hydrated SSR, infinite-scroll feeds), pass --browser to use Playwright Chromium instead of raw fetch. Playwright is an optional peer dependency — only loaded when this flag is used.

# One-time setup:
npm install -g playwright && npx playwright install chromium

sitepull https://complex-spa.example.com --browser

`--auth-flow` — clone sites behind login

Opens a real Chromium window. You log in interactively. On close, cookies + localStorage are persisted. Re-use that state on subsequent audits with --storage-state.

sitepull --auth-flow https://app.example.com --storage-state ./auth.json
# (browser opens; you log in; close the window)
sitepull https://app.example.com --browser --storage-state ./auth.json

Or use the simpler --cookie flag if you already have the Cookie header from devtools:

sitepull https://app.example.com --cookie "session=eyJhbGc...; csrf=xyz"

`sitepull-mcp` — MCP server for AI agents

A second binary, sitepull-mcp, exposes sitepull as MCP tools any compatible AI agent (Claude Desktop, Cursor, Cline, etc.) can call directly. Stdio transport.

Tools exposed:

web_audit(url, ...options) — run a full audit
web_audit_diff(url, out) — re-audit and produce DIFF.md
web_audit_serve(out, port) — boot the local server

Add to your agent's MCP config, e.g. for Claude Desktop:

{ "mcpServers": { "sitepull": { "command": "sitepull-mcp" } } }

Output layout

audits/<host>/
├── AUDIT.md                  # Findings report (architecture, endpoints, fuzz, hashes)
└── app/
    ├── server.js             # Generated zero-dep Node server with stubs
    ├── package.json          # `"type": "module"`
    ├── README.md             # How to run + integrity hashes + caveats
    └── public/               # Byte-for-byte vendored assets
        ├── index.html
        ├── assets/...
        └── (all other files)

Auth (cloning sites behind a login)

# Open the site in your browser, log in, copy the Cookie header from devtools,
# then pass it to sitepull:
sitepull https://app.example.com --cookie "session=eyJhbGc...; csrf=xyz"

The cookie is attached to every request the crawler/probe makes. Same-origin only — never sent to external CDNs.

Safety / responsible use

Polite by default: 50 ms delay between requests, max 6 concurrent fetches, respects robots.txt. Tune with --rate-ms and --concurrency.
Read-only: never POSTs except during --probe (and only with safe payloads to a fixed list of known API paths).
No upstream forwarding: generated stubs never forward to the real third-party APIs (Stripe, OpenAI, etc.). They return placeholders so the local frontend can boot.
No secret theft: any API keys spotted in bundles are NOT copied into the generated server. Stubs are inert.
Don't audit what you don't have permission to: this tool is for your own sites, sites you've been hired to test, OSS frontends, and educational reverse-engineering of public static pages. Don't use it to scrape sites that prohibit it in their ToS.

Architecture

bin/sitepull.mjs    ─ CLI entry, argv parser, error handling
lib/
├── audit.js        ─ Phases 1–6 orchestrator
├── recon.js        ─ Phase 1: fetch /, extract asset URLs from HTML
├── detect.js       ─ Phase 2: SPA/MPA classifier
├── probe.js        ─ Phase 3: endpoint sweep + POST fuzzer
├── crawl.js        ─ Phase 4 (MPA): BFS crawler with robots support
├── vendor.js       ─ Phase 4 (SPA + MPA assets): downloads to disk
├── stubs.js        ─ Phase 5a: synthesize safe stub branches
├── server-template.js ─ Phase 5b: builds server.js + package.json
├── report.js       ─ Phase 6: writes AUDIT.md + README.md
└── util.js         ─ shared (logging, fetch, hashing, throttle)

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

sitepull

Install

What it does

Flags

v0.2 features

--beautify — readable bundles

--diff — detect changes since last clone

--browser — render JS-heavy sites

--auth-flow — clone sites behind login

sitepull-mcp — MCP server for AI agents

Output layout

Auth (cloning sites behind a login)

Safety / responsible use

Architecture

License

`--beautify` — readable bundles

`--diff` — detect changes since last clone

`--browser` — render JS-heavy sites

`--auth-flow` — clone sites behind login

`sitepull-mcp` — MCP server for AI agents