websitetor

v1.0.1

Published

a month ago

Download full websites with all assets, including HTML, CSS, JS, JSON, TXT, and archived snapshots from the Wayback Machine.

0High
0Medium
0Low

onantis

website downloader scraper wayback archive crawler offline

Websitetor

Download full websites with all assets, including HTML, CSS, JS, JSON, TXT, and archived snapshots from the Wayback Machine.

Introduction

Websitetor is a Node.js package for downloading complete websites to disk. It handles recursive crawling, resolves relative and absolute links, collects all referenced assets, and preserves the original folder structure. It supports both live sites and archival snapshots via the Wayback Machine.

Features

Downloads HTML, CSS, JS, JSON, TXT, images, fonts, and media
Archived website support via the Wayback Machine
Recursive link crawling with configurable depth
Parallel downloads with configurable concurrency
Validates downloaded sites for broken internal links and missing assets
CLI and programmatic API
Returns a structured result with file count and error details

Installation

npm install websitetor

Usage

Basic Usage

import { download } from "websitetor";

download("https://example.com", "./example-site")
  .then(() => console.log("Download completed."))
  .catch(console.error);

Archived Version

import { download } from "websitetor";

download("https://example.com", "./example-archive", { wayback: true })
  .then(() => console.log("Archived download completed."))
  .catch(console.error);

Validate a Downloaded Site

import { validate } from "websitetor";

const result = validate("./example-site");

if (result.valid) {
  console.log("Site is complete. No broken links or missing assets.");
} else {
  console.log(`Broken links: ${result.brokenLinks.length}`);
  console.log(`Missing assets: ${result.missingAssets.length}`);
}

CLI

# Download a live website
websitetor download https://example.com ./example-site

# Download from the Wayback Machine
websitetor download https://example.com ./example-archive --wayback

# Set recursion depth
websitetor download https://example.com ./example-site --depth 3

# Set concurrency
websitetor download https://example.com ./example-site --concurrency 5

# Validate a downloaded site
websitetor validate ./example-site

# Validate and output JSON
websitetor validate ./example-site --json

# List available commands
websitetor list

API

`download(url, destination, options?)`

Downloads a website and saves all resources to the destination directory.

download(url: string, destination: string, options?: DownloadOptions): Promise<DownloadResult>

DownloadOptions

interface DownloadOptions {
  wayback?: boolean;      // Fetch from the Wayback Machine. Default: false
  depth?: number;         // Maximum recursion depth. Default: 5
  concurrency?: number;   // Parallel downloads. Default: 3
}

DownloadResult

interface DownloadResult {
  success: boolean;
  filesDownloaded: number;
  errors: Array<{
    url: string;
    message: string;
  }>;
  archived: boolean;
}

Example:

{
  "success": true,
  "filesDownloaded": 42,
  "errors": [
    {
      "url": "https://example.com/missing.js",
      "message": "Resource not found"
    }
  ],
  "archived": false
}

`validate(path)`

Scans a previously downloaded site directory for broken internal links and missing assets. Parses every HTML file and checks that all referenced resources exist on disk.

validate(sitePath: string): ValidationResult

ValidationResult

interface ValidationResult {
  valid: boolean;
  htmlFiles: number;
  brokenLinks: Array<{
    file: string;
    href: string;
    reason: string;
  }>;
  missingAssets: Array<{
    file: string;
    src: string;
    reason: string;
  }>;
}

Example:

{
  "valid": false,
  "htmlFiles": 5,
  "brokenLinks": [
    {
      "file": "index.html",
      "href": "/about.html",
      "reason": "File not found on disk"
    }
  ],
  "missingAssets": [
    {
      "file": "index.html",
      "src": "/images/logo.png",
      "reason": "Asset not found on disk"
    }
  ]
}

CLI Commands

| Command | Description | | ------------------------------------ | -------------------------------------------------------- | | websitetor list | List all available commands and options | | websitetor download <url> <dest> | Download a website to a local directory | | websitetor validate <path> | Scan a downloaded site for broken links and missing assets |

`download` options

| Option | Description | Default | | ------------------------ | ---------------------------------------------------- | ------- | | --wayback | Use the Wayback Machine instead of the live site | false | | --depth <number> | Maximum link recursion depth | 5 | | --concurrency <number> | Number of files downloaded in parallel | 3 |

`validate` options

| Option | Description | Default | | -------- | ---------------------------------- | ------- | | --json | Output the full result as JSON | false |

Project Structure

websitetor/
├── src/
│   ├── downloader/       # Core crawl and download engine
│   ├── resolver/         # HTML and CSS link extraction
│   ├── archiver/         # Wayback Machine integration
│   ├── validator/        # Broken link and missing asset scanner
│   ├── utils/            # URL helpers and path mapping
│   ├── cli/              # CLI entry point
│   └── index.ts          # Public API exports
├── dist/                 # Compiled output
├── package.json
├── tsconfig.json
└── README.md

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme