wayback-grab
v0.1.1
Published
Download a full Wayback Machine snapshot of a domain, rewrite internal links, and ship a working offline mirror.
Downloads
56
Maintainers
Readme
wayback-grab
Download a full Wayback Machine snapshot of a domain, rewrite internal links, and ship a working offline mirror.
wayback-grab queries the Internet Archive's CDX index for every capture of a domain, downloads them in parallel, lays them out on disk in a navigable directory tree, and rewrites the HTML and CSS so internal links resolve locally. Open the resulting index.html in a browser and the site works offline.
Install
npm install -g wayback-grabRequires Node.js 18 or newer.
Quick start
wayback-grab spacejam.comThat downloads every capture the Wayback Machine has of spacejam.com into ./wayback-archive/ and prints the path to the entry page when it's done.
Usage
wayback-grab <domain> [options]Common workflows
Pull a specific era of a site, keeping the latest snapshot per URL within the range:
wayback-grab spacejam.com -o ./spacejam-1996 --from 19961101 --to 19991231See what's archived before downloading anything:
wayback-grab spacejam.com --dry-runPull more cautiously to avoid hammering archive.org:
wayback-grab spacejam.com --concurrency 2Skip the link-rewriting pass (you just want the raw files):
wayback-grab spacejam.com --no-rewriteOptions
| Flag | Description | Default |
| --- | --- | --- |
| -o, --output <dir> | Output directory | ./wayback-archive |
| --from <YYYYMMDD> | Only include snapshots on or after this date | unbounded |
| --to <YYYYMMDD> | Only include snapshots on or before this date | unbounded |
| --pick <first\|last> | Which snapshot to keep per unique URL | last |
| --include-subs | Include archived subdomains | off |
| -c, --concurrency <n> | Parallel downloads | 5 |
| --no-rewrite | Skip rewriting internal links to local paths | off |
| --dry-run | List URLs without downloading | off |
| -h, --help | Show help | — |
How it works
- Index. Query the Wayback CDX API for every capture matching the domain.
- Dedupe. Group by URL and keep one snapshot per URL — by default the most recent within your date range.
- Download. Fetch each capture from
web.archive.org/web/<timestamp>id_/<url>(theid_flag asks the Wayback Machine for the original bytes, without its toolbar injection). - Lay out. Map each URL to a path on disk based on its hostname and pathname, with query strings folded into the filename so distinct query variants don't collide.
- Rewrite. Walk every HTML and CSS file and replace absolute and
web.archive.orgURLs with relative paths into the local mirror.
Tips
- Start with
--dry-runto see what's actually in the archive. Old domains often have hundreds of parked-page snapshots from after the site went down — constrain--from/--toto the era you care about. - The CDX API can be slow on busy days. If a run stalls on the index step, give it a minute before retrying.
- Re-running into the same output directory is safe — already-downloaded assets are skipped.
- The mirror is meant to be browsed locally via
file://. Some sites with absolute paths to assets that were never archived will have broken images; that's an upstream gap, not somethingwayback-grabcan fill.
Limitations
- Dynamic content (XHR, JS-rendered pages) is preserved only to the extent the Wayback Machine archived the resulting HTML or the underlying endpoints.
- Forms, search boxes, and anything server-side won't work — it's a static snapshot.
- The Wayback Machine doesn't archive everything. Missing assets are missing assets.
