pagepocket

v0.4.0

Published

3 months ago

0High
0Medium
0Low

PagePocket

PagePocket is a webpage snapshot tool. Given a URL, it loads the page in a headless browser, records network activity, and rewrites remote resources to local files so the page can be viewed offline.

Highlights

Captures the final HTML after the page settles.
Records fetch/XHR request and response data for offline replay.
Downloads static assets (scripts, styles, images, fonts, etc.).
Rewrites resource links to local files or inlined Data URLs.
Injects a replay script so the snapshot can run without a network connection.

How it works

Page load Uses Puppeteer to launch a headless browser and open the target URL.
Request interception and recording Injects src/preload.ts into the page to wrap fetch and XMLHttpRequest, capturing request/response data in memory. In Node, it also listens to network responses to capture response bodies.
Resource capture and rewrite Parses the HTML with Cheerio, extracts resource URLs (script, link, img, srcset, etc.), downloads them into a local folder, and rewrites HTML references to local paths.
Replay script injection Injects a replay script into the output HTML that swaps remote requests for local or recorded data during offline viewing.

Install

Install globally so the pp CLI is available in your shell:

npm i -g pagepocket

Usage

pp https://example.com
pp https://example.com -o ./snapshots

Output

Snapshots are written to the current directory by default.

Use --output to choose a different directory; filenames still derive from the page title:

*.html: offline snapshot page
*.requests.json: recorded requests/responses
*_files/: downloaded static assets

Example output paths:

example.html
example.requests.json
example_files/
snapshots/example.html
snapshots/example.requests.json
snapshots/example_files/

Configuration

These environment variables control timeouts:

PAGEPOCKET_NAV_TIMEOUT_MS: navigation timeout for the initial page load (default: 60000)
PAGEPOCKET_PENDING_TIMEOUT_MS: time to wait for tracked fetch/XHR activity to settle (default: 40000)

Notes and limitations

PagePocket records fetch/XHR traffic and DOM content, but it does not guarantee capture of every dynamic request if a site continuously streams data.
Some sites require authentication or run strict CSP policies; you may need to load the page in a logged-in session or adjust your capture approach.
Snapshots are intended for offline viewing and debugging, not for producing a perfect archival copy of every runtime behavior.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme