wayback-machine-downloader

v0.5.0

Published

4 months ago

Interactive Wayback Machine downloader for archiving websites locally.

0High
0Medium
0Low

sigman78

wayback archive downloader wayback-machine

wayback-machine-downloader

Downloads archived snapshots of a website from the Wayback Machine and saves them locally.

Requirements

Node.js 18 or later.

Installation

npm install -g wayback-machine-downloader

Or run directly from a local clone:

npm install
node cli.js [url] [options]

Usage

Interactive mode

Run without arguments to be guided through all options via prompts:

wayback-machine-downloader

Non-interactive mode

Pass a URL (domain or full URL) directly on the command line:

wayback-machine-downloader example.com [options]
wayback-machine-downloader --url example.com [options]

If both a positional URL and --url are given, --url takes precedence.

Options

Arguments:
  url                     Domain or URL to archive (same as --url)

Options:
  --url <url>             Domain or URL to archive
  --from <timestamp>      Start timestamp YYYYMMDDhhmmss (default: none)
  --to <timestamp>        End timestamp YYYYMMDDhhmmss (default: none)
  --threads <n>           Concurrent download threads (default: 3)
  --directory <path>      Output directory (default: websites/<host>/)
  --rewrite-links         Rewrite page links to relative paths
  --canonical <action>    Canonical tag handling: keep|remove (default: keep)
  --exact-url             Download only the exact URL, no wildcard /*
  --external-assets       Also download off-site (external) assets
  --debug                 Enable verbose debug logging
  -h, --help              Show this help and exit

Examples

# Archive everything from example.com
wayback-machine-downloader example.com

# Archive snapshots from a specific year
wayback-machine-downloader example.com --from 20200101000000 --to 20201231235959

# Rewrite links for offline browsing; strip canonical tags
wayback-machine-downloader example.com --rewrite-links --canonical remove

# Download only the exact URL (no wildcard crawl) with 8 threads
wayback-machine-downloader https://example.com/blog/ --exact-url --threads 8

# Save to a custom directory
wayback-machine-downloader example.com --directory ./archive/example

Programmatic API

import { WaybackMachineDownloader, setDebugMode } from "wayback-machine-downloader";
import { normalizeBaseUrlInput } from "wayback-machine-downloader/lib/utils.js";

const base = normalizeBaseUrlInput("example.com");

const dl = new WaybackMachineDownloader({
  base_url: base.canonicalUrl,
  normalized_base: base,
  from_timestamp: 0,
  to_timestamp: 0,
  threads_count: 3,
  rewrite_mode: "as-is",   // "as-is" | "relative"
  canonical_action: "keep", // "keep" | "remove"
  exact_url: false,
  download_external_assets: false,
  directory: null,          // null = default websites/<host>/
});

await dl.download_files();

Output

Files are saved under websites/<host>/ by default. Each snapshot is stored at the path it had on the original site.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

wayback-machine-downloader

Requirements

Installation

Usage

Interactive mode

Non-interactive mode

Options

Examples

Programmatic API

Output