webstract
v1.1.0
Published
Webstract extracts a website's complete HTML DOM along with all related CSS, JavaScript, and image assets, representing everything as a structured tree.
Downloads
2
Maintainers
Readme
Webstract
CLI to snapshot a web page for offline use: downloads HTML plus CSS/JS/images on the same site, rewrites references, and saves everything locally.
Install
npm install -g webstract # after publishing
# or run without installing
npx webstract <url> <outputDir>Quick start
yarn start <url> <outputDir> [--concurrency <n>] [--timeout <ms>]
yarn start https://example.com ./dumpBuild once for distribution:
yarn build
node dist/cli.js <url> <outputDir>What it does
- Follows redirects; the final URL is the base for asset rewriting.
- Saves
index.htmlunder<outputDir>/<domain>/and rewrites references to point to downloaded files. - Downloads assets on the same registrable domain or same root label (e.g.,
daum.net→daumcdn.net), not just strict origin. - Collects linked CSS (
link[rel=stylesheet]), JS (script[src]), images (img/srcset,img[src],source[src|srcset], icons), inline CSS in<style>/style=, and meta images (OG/Twitter). - Parses downloaded CSS for
@importandurl(...)references on the same domain/root label. - External origins remain absolute; skipped items are listed in
missing-assets.json. Use--download-externalto force-download other domains (saved under a hostname-prefixed path). - Writes
_WST.mdsummary with request/final URLs and download/skip/fail counts.
CLI options
| Option | Description | Default |
| --- | --- | --- |
| -c, --concurrency <n> | Concurrent downloads | WEBSTRACT_CONCURRENCY or 5 |
| -t, --timeout <ms> | Request timeout in ms | WEBSTRACT_TIMEOUT_MS or 15000 |
| -r, --retries <n> | Retry attempts per request | WEBSTRACT_MAX_RETRIES or 3 |
| --retry-delay <ms> | Delay between retries (exponential backoff) | WEBSTRACT_RETRY_DELAY_MS or 1000 |
| --user-agent <ua> | Custom User-Agent string | WEBSTRACT_USER_AGENT |
| --no-follow-redirects | Do not follow HTTP redirects | follow redirects |
| --insecure | Allow insecure TLS (self-signed) | off |
| --download-external | Force download of external-domain assets (prefixed by hostname) | off |
| --no-css-parse | Skip CSS @import/url() parsing | on |
| --no-meta | Skip meta (OG/Twitter) image discovery | on |
| --summary-format <md|json> | _WST summary format | md |
| --output-name <name> | Override output folder name | derived from domain |
| --quiet / --verbose | Control log verbosity | normal |
Environment variables: WEBSTRACT_CONCURRENCY, WEBSTRACT_TIMEOUT_MS, WEBSTRACT_USER_AGENT.
Output layout
<outputDir>/<domain>/
├─ index.html
├─ _WST.md # summary
├─ missing-assets.json # only if something was skipped
├─ css/...
├─ js/...
└─ images/... # file tree mirrors remote pathsOpen index.html in a browser for the offline copy. Check _WST.md for a quick summary and missing-assets.json to see which external assets stayed remote.
Programmatic use
import { webstract } from "webstract";
await webstract("https://example.com", "./dump/example.com");Project layout
src/webstract.ts: Orchestrates extraction and options.src/lib/: Shared utilities (HTTP client, logger).src/extract/: Core extraction logic (collector, CSS parsing, downloader, rewriter, output).- CLI entry:
src/cli.ts.
Environment variables are loaded via dotenv (quiet mode); keep your .env out of version control.
