wayback-recover

v0.1.0

Published

19 days ago

Recover a website from the Wayback Machine

0High
0Medium
0Low

cxjohn

wayback-machine internet-archive website recovery archive backup wordpress blog

wayback-recover

Recover a website from the Wayback Machine.

npx wayback-recover example.com

Downloads pages and assets, rewrites links to local paths, and produces a self-contained static copy. Works with WordPress, Jekyll, static sites, and other platforms.

Why this over a bulk downloader?

Tools like wayback-dl mirror everything the Wayback Machine has for a URL. That works for archival, but recovering a site that went offline requires more care:

Cutoff detection. The Wayback Machine often has captures from after a site was compromised or expired. wayback-recover detects when a site went down and only uses captures from before that point.
Link rewriting. All asset references and internal links are rewritten to local relative paths, including CDN URLs, versioned assets (style.css?ver=4.2), and protocol-relative URLs. Wayback Machine toolbar injections are removed.
Deduplication. One copy of each URL (the latest good version) rather than thousands of near-identical captures.
Checkpoint/resume. Interrupted downloads can be resumed by running the same command again. Rate limiting defaults to 15 requests/minute with exponential backoff.

Usage

npx wayback-recover example.com                    # Auto-detects everything
npx wayback-recover example.com --before 20170101  # Manual cutoff date
npx wayback-recover example.com --dry-run          # Preview what would be downloaded
npx wayback-recover example.com --no-assets        # HTML pages only
npx wayback-recover example.com -o ./my-backup     # Custom output directory

| Option | Description | Default | |--------|-------------|---------| | -o, --output <dir> | Output directory | ./<domain>-recovered | | --before <YYYYMMDD> | Only use captures before this date | Auto-detected | | --no-assets | Skip CSS, JS, images | | | --dry-run | Show what would be downloaded | | | --rate-limit <n> | Max requests per minute | 15 | | --no-resume | Start fresh, ignore checkpoint | | | --verbose | Verbose logging | |

Deployment

The output is a static site. Any static hosting works:

# GitHub Pages
cd example.com-recovered
git init && git add -A && git commit -m "Recovered site"
gh repo create my-site --public --source=. --push
# Enable Pages in repo Settings → Pages → Deploy from branch (main)

Limitations

Only recovers what the Wayback Machine captured. If a page or image was never crawled, it cannot be recovered.
External embeds (YouTube, third-party widgets) are not included.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

wayback-recover

Why this over a bulk downloader?

Usage

Deployment

Limitations

License