webstract

v1.1.0

Published

5 months ago

Webstract extracts a website's complete HTML DOM along with all related CSS, JavaScript, and image assets, representing everything as a structured tree.

0High
0Medium
0Low

ninejuan

web scraper offline cli download

Webstract

npm version license

CLI to snapshot a web page for offline use: downloads HTML plus CSS/JS/images on the same site, rewrites references, and saves everything locally.

Install

npm install -g webstract          # after publishing
# or run without installing
npx webstract <url> <outputDir>

Quick start

yarn start <url> <outputDir> [--concurrency <n>] [--timeout <ms>]
yarn start https://example.com ./dump

Build once for distribution:

yarn build
node dist/cli.js <url> <outputDir>

What it does

Follows redirects; the final URL is the base for asset rewriting.
Saves index.html under <outputDir>/<domain>/ and rewrites references to point to downloaded files.
Downloads assets on the same registrable domain or same root label (e.g., daum.net → daumcdn.net), not just strict origin.
Collects linked CSS (link[rel=stylesheet]), JS (script[src]), images (img/srcset, img[src], source[src|srcset], icons), inline CSS in <style>/style=, and meta images (OG/Twitter).
Parses downloaded CSS for @import and url(...) references on the same domain/root label.
External origins remain absolute; skipped items are listed in missing-assets.json. Use --download-external to force-download other domains (saved under a hostname-prefixed path).
Writes _WST.md summary with request/final URLs and download/skip/fail counts.

CLI options

| Option | Description | Default | | --- | --- | --- | | -c, --concurrency <n> | Concurrent downloads | WEBSTRACT_CONCURRENCY or 5 | | -t, --timeout <ms> | Request timeout in ms | WEBSTRACT_TIMEOUT_MS or 15000 | | -r, --retries <n> | Retry attempts per request | WEBSTRACT_MAX_RETRIES or 3 | | --retry-delay <ms> | Delay between retries (exponential backoff) | WEBSTRACT_RETRY_DELAY_MS or 1000 | | --user-agent <ua> | Custom User-Agent string | WEBSTRACT_USER_AGENT | | --no-follow-redirects | Do not follow HTTP redirects | follow redirects | | --insecure | Allow insecure TLS (self-signed) | off | | --download-external | Force download of external-domain assets (prefixed by hostname) | off | | --no-css-parse | Skip CSS @import/url() parsing | on | | --no-meta | Skip meta (OG/Twitter) image discovery | on | | --summary-format <md|json> | _WST summary format | md | | --output-name <name> | Override output folder name | derived from domain | | --quiet / --verbose | Control log verbosity | normal |

Environment variables: WEBSTRACT_CONCURRENCY, WEBSTRACT_TIMEOUT_MS, WEBSTRACT_USER_AGENT.

Output layout

<outputDir>/<domain>/
├─ index.html
├─ _WST.md                  # summary
├─ missing-assets.json      # only if something was skipped
├─ css/...
├─ js/...
└─ images/...               # file tree mirrors remote paths

Open index.html in a browser for the offline copy. Check _WST.md for a quick summary and missing-assets.json to see which external assets stayed remote.

Programmatic use

import { webstract } from "webstract";

await webstract("https://example.com", "./dump/example.com");

Project layout

src/webstract.ts: Orchestrates extraction and options.
src/lib/: Shared utilities (HTTP client, logger).
src/extract/: Core extraction logic (collector, CSS parsing, downloader, rewriter, output).
CLI entry: src/cli.ts.

Environment variables are loaded via dotenv (quiet mode); keep your .env out of version control.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme