wayback-site-rescue-test

v1.0.3

Published

19 days ago

Download and locally replay Wayback Machine captures with resumable state and typed APIs.

0High
0Medium
0Low

furpan

archiving cdx crawler internet-archive wayback web-archiving web-crawler web-scraping

wayback-site-rescue

Download and replay archived websites from the Internet Archive Wayback Machine with a typed API and CLI.

This package helps you:

fetch snapshots for a URL
download pages and requisites (assets)
rewrite links for local replay
resume interrupted runs with state
apply cleanup/SEO transforms (robots/sitemap/redirect helpers)

Why this package?

Typical use cases:

Site recovery / migration Recover old pages and map them to a new domain.
Archive-based static backup Create a local copy for documentation or legal/ops needs.
SEO-safe legacy handling Generate sitemap.xml, robots.txt, and optional archived-404 redirects.
Research and audits Programmatically inspect historical captures.

Install

Package usage (consumer)

npm install wayback-site-rescue

bun add wayback-site-rescue

CLI usage

npx wayback-site-rescue --url https://example.com --list-only

Quick start (API)

import { runDownloader } from "wayback-site-rescue";

const result = await runDownloader({
  url: "https://example.com",
  listOnly: true,
});

console.log(result);

Quick start (CLI)

wayback-site-rescue --url https://example.com --directory ./downloads

Interactive prompt mode:

wayback-site-rescue --interactive --list-only

Common options

--url <url> target URL
--directory <path> output directory (default ./downloads)
--from <timestamp> start range (YYYYMMDDhhmmss)
--to <timestamp> end range (YYYYMMDDhhmmss)
--list-only query/list without downloading
--exact-url use exact URL matching in CDX
--capture-concurrency <n> concurrent capture workers
--rate-limit-per-second <n> global request pacing
--recovery-domain <domain> rewrite internal links/canonical/meta to a new domain
--create-sitemap write sitemap after run
--block-scrapers-in-robots generate restrictive robots

For a complete set, run:

wayback-site-rescue --help

Examples

See the examples/ directory for practical variants:

examples/list-only.ts
examples/download-with-rewrite.ts
examples/seo-and-cleanup.ts

Run one with:

bun examples/list-only.ts

Development (repo)

This repository uses Bun for local tooling.

bun install
bun run check

Individual tasks:

bun run lint
bun run typecheck
bun run test:ci
bun run build

Credits, references, docs, and specs

This project builds on great open-source work and public standards. Credit where it’s due:

Internet Archive / Wayback Machine
- Wayback Machine: https://web.archive.org/
- Internet Archive org/repositories: https://github.com/internetarchive
CDX and archival replay context
- Wayback CDX Server implementation: https://github.com/internetarchive/wayback/tree/master/wayback-cdx-server
Core libraries used by this package
- Commander: https://github.com/tj/commander.js
- Inquirer prompts: https://github.com/SBoudrias/Inquirer.js
- Axios: https://github.com/axios/axios
- Cheerio: https://github.com/cheeriojs/cheerio
- PQueue: https://github.com/sindresorhus/p-queue
Tooling and registry workflows
- Bun docs: https://bun.sh/docs
- Bun package publishing: https://bun.sh/docs/cli/publish
- npm registry docs: https://docs.npmjs.com/
Specs referenced by generated outputs/behavior
- Memento protocol (RFC 7089): https://www.rfc-editor.org/rfc/rfc7089
- Robots Exclusion Protocol (RFC 9309): https://www.rfc-editor.org/rfc/rfc9309
- Sitemaps protocol: https://www.sitemaps.org/protocol.html

License

MIT