wayback-recover
v0.1.0
Published
Recover a website from the Wayback Machine
Maintainers
Readme
wayback-recover
Recover a website from the Wayback Machine.
npx wayback-recover example.comDownloads pages and assets, rewrites links to local paths, and produces a self-contained static copy. Works with WordPress, Jekyll, static sites, and other platforms.
Why this over a bulk downloader?
Tools like wayback-dl mirror everything the Wayback Machine has for a URL. That works for archival, but recovering a site that went offline requires more care:
- Cutoff detection. The Wayback Machine often has captures from after a site was compromised or expired. wayback-recover detects when a site went down and only uses captures from before that point.
- Link rewriting. All asset references and internal links are rewritten to local relative paths, including CDN URLs, versioned assets (
style.css?ver=4.2), and protocol-relative URLs. Wayback Machine toolbar injections are removed. - Deduplication. One copy of each URL (the latest good version) rather than thousands of near-identical captures.
- Checkpoint/resume. Interrupted downloads can be resumed by running the same command again. Rate limiting defaults to 15 requests/minute with exponential backoff.
Usage
npx wayback-recover example.com # Auto-detects everything
npx wayback-recover example.com --before 20170101 # Manual cutoff date
npx wayback-recover example.com --dry-run # Preview what would be downloaded
npx wayback-recover example.com --no-assets # HTML pages only
npx wayback-recover example.com -o ./my-backup # Custom output directory| Option | Description | Default |
|--------|-------------|---------|
| -o, --output <dir> | Output directory | ./<domain>-recovered |
| --before <YYYYMMDD> | Only use captures before this date | Auto-detected |
| --no-assets | Skip CSS, JS, images | |
| --dry-run | Show what would be downloaded | |
| --rate-limit <n> | Max requests per minute | 15 |
| --no-resume | Start fresh, ignore checkpoint | |
| --verbose | Verbose logging | |
Deployment
The output is a static site. Any static hosting works:
# GitHub Pages
cd example.com-recovered
git init && git add -A && git commit -m "Recovered site"
gh repo create my-site --public --source=. --push
# Enable Pages in repo Settings → Pages → Deploy from branch (main)Limitations
- Only recovers what the Wayback Machine captured. If a page or image was never crawled, it cannot be recovered.
- External embeds (YouTube, third-party widgets) are not included.
License
MIT
