npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@link-assistant/web-capture

v1.7.26

Published

CLI and microservice to render web pages as HTML, Markdown, or PNG

Readme

web-capture (JavaScript/Node.js)

npm version npm downloads GitHub Release CI - JavaScript License: Unlicense

A CLI and microservice to fetch URLs and render them as:

  • Markdown: Converted from HTML with image extraction (default)
  • HTML: Rendered page content
  • PNG/JPEG screenshot: Viewport or full-page capture with theme support
  • ZIP archive: Markdown + locally downloaded images
  • PDF: Print-quality document with embedded images
  • DOCX: Word document with embedded images

Installation

From npm

npm install -g @link-assistant/web-capture

From Source

cd js
npm install

Quick Start

CLI Usage

# Capture a URL as Markdown (default format)
# Output auto-derived to ./data/web-capture/<host>/<path>/document.md
web-capture https://example.com

# Capture as Markdown to a specific file
web-capture https://example.com -o page.md

# Write to stdout explicitly
web-capture https://example.com -o -

# Capture as HTML
web-capture https://example.com --format html -o page.html

# Take a PNG screenshot
web-capture https://example.com --format png -o screenshot.png

# Take a JPEG screenshot with custom quality
web-capture https://example.com --format jpeg --quality 60 -o screenshot.jpg

# Dark theme full-page screenshot
web-capture https://example.com --format png --theme dark --fullPage -o dark.png

# Custom viewport size
web-capture https://example.com --format png --width 1920 --height 1080 -o wide.png

# Download as PDF
web-capture https://example.com --format pdf -o page.pdf

# Download as DOCX
web-capture https://example.com --format docx -o page.docx

# Create a ZIP archive (default: zip format)
web-capture https://example.com --archive

# Create a ZIP archive with specific format
web-capture https://example.com --archive zip -o site.zip
web-capture https://example.com --archive tar.gz -o site.tar.gz

# Keep images inline as base64 (opt-in)
web-capture https://example.com --embed-images -o page.md

# Custom images directory
web-capture https://example.com --images-dir assets -o page.md

# Disable specific features
web-capture https://example.com --no-extract-latex --no-post-process -o page.md

# Start as API server
web-capture --serve

# Start server on custom port
web-capture --serve --port 8080

Google Docs

# Capture the live editor model as Markdown (default --capture browser)
web-capture https://docs.google.com/document/d/DOC_ID/edit --capture browser

# Capture via the public export endpoint
web-capture https://docs.google.com/document/d/DOC_ID/edit --capture api

# Capture via the Google Docs REST API
web-capture https://docs.google.com/document/d/DOC_ID/edit --capture api --apiToken YOUR_TOKEN

Browser-model capture waits until DOCS_modelChunk data stops changing before parsing. For very large or slow documents, tune the wait with WEB_CAPTURE_GDOCS_STABILITY_MS (default 1500) and WEB_CAPTURE_GDOCS_MAX_WAIT_MS (default 30000).

API Endpoints (Server Mode)

Start the server with web-capture --serve and use the endpoints below.

HTML Endpoint

GET /html?url=<URL>&engine=<ENGINE>

Returns the raw HTML content of the specified URL.

| Parameter | Required | Description | Default | | --------- | -------- | ------------------------------------ | --------- | | url | Yes | URL to fetch | - | | engine | No | Browser engine: puppeteer/playwright | puppeteer |

Markdown Endpoint

GET /markdown?url=<URL>

Converts the HTML content of the specified URL to Markdown format. By default, original remote image URLs are preserved and base64 data URIs are stripped (clean single-file output). Use keepOriginalLinks=false to strip all images, or embedImages=true to keep base64 images inline.

| Parameter | Required | Description | Default | | ------------------- | -------- | ------------------------------------------ | ------- | | url | Yes | URL to fetch | - | | embedImages | No | Keep base64 images inline (true/false) | false | | keepOriginalLinks | No | Keep original remote URLs, strip base64 | true |

Image Endpoint

GET /image?url=<URL>&format=png&theme=dark&width=1920&height=1080&fullPage=true

Returns a screenshot of the specified URL.

| Parameter | Required | Description | Default | | --------------- | -------- | ---------------------------------------- | --------------- | | url | Yes | URL to capture | - | | engine | No | Browser engine: puppeteer/playwright | puppeteer | | format | No | Image format: png (lossless) or jpeg | png | | quality | No | JPEG quality 0-100 (ignored for PNG) | 80 | | width | No | Viewport width in pixels | 1280 | | height | No | Viewport height in pixels | 800 | | fullPage | No | Capture full scrollable page | false | | theme | No | Color scheme: light, dark | browser default | | dismissPopups | No | Auto-close cookie/consent popups | true |

Archive Endpoint

GET /archive?url=<URL>&localImages=true&documentFormat=markdown

Returns a ZIP archive containing either document.md or document.html and asset directories (images/, css/).

| Parameter | Required | Description | Default | | ------------------- | -------- | -------------------------------------------- | ---------- | | url | Yes | URL to archive | - | | localImages | No | Download images locally into the archive | true | | documentFormat | No | Document format: markdown or html | markdown | | embedImages | No | Keep base64 images inline in the document | false | | keepOriginalLinks | No | Keep original remote URLs (skip downloading) | false |

Archive structure (with localImages=true):

archive.zip
├── document.md       # or document.html when documentFormat=html
├── images/
│   ├── image-1.png
│   ├── image-2.jpg
│   └── ...
└── css/              # only when documentFormat=html
    ├── style-1.css
    └── ...

PDF Endpoint

GET /pdf?url=<URL>&theme=light

Returns a PDF document rendered by the browser engine (all images embedded).

| Parameter | Required | Description | Default | | --------------- | -------- | ------------------------------- | --------------- | | url | Yes | URL to convert | - | | engine | No | Browser engine | puppeteer | | width | No | Viewport width in pixels | 1280 | | height | No | Viewport height in pixels | 800 | | theme | No | Color scheme: light, dark | browser default | | dismissPopups | No | Auto-close popups before export | true |

DOCX Endpoint

GET /docx?url=<URL>

Returns a DOCX document with embedded images.

| Parameter | Required | Description | Default | | --------- | -------- | -------------- | ------- | | url | Yes | URL to convert | - |

Fetch/Stream Endpoints

GET /fetch?url=<URL>
GET /stream?url=<URL>

Proxy fetch and streaming proxy endpoints.

CLI Reference

Server Mode

web-capture --serve [--port <port>]

| Option | Short | Description | Default | | --------- | ----- | ------------------------ | ------------------ | | --serve | -s | Start as HTTP API server | - | | --port | -p | Port to listen on | 3000 (or PORT env) |

Capture Mode

web-capture <url> [options]

| Option | Short | Description | Default | | --------------------------- | ----- | ---------------------------------------------- | ----------------------------------- | | --format | -f | Output format (see below) | markdown | | --output | -o | Output file path. Use -o - for stdout | auto-derived from URL | | --data-dir | | Base directory for auto-derived output paths | ./data/web-capture | | --engine | -e | Browser engine: puppeteer, playwright | puppeteer (or BROWSER_ENGINE env) | | --theme | -t | Color scheme: light, dark, no-preference | browser default | | --width | | Viewport width in pixels | 1280 | | --height | | Viewport height in pixels | 800 | | --quality | | JPEG quality 0-100 | 80 | | --fullPage | | Capture full scrollable page | false | | --embed-images | | Keep images as inline base64 data URIs | false | | --no-extract-images | | Alias for --embed-images | false | | --keep-original-links | | Keep original remote URLs, strip base64 | false | | --images-dir | | Subdirectory name for extracted images | images | | --archive | | Create archive: zip, 7z, tar.gz, tar | - | | --document-format | | Document format in archive: markdown, html | markdown | | --localImages | | Download images locally in archive mode | true | | --extract-latex | | Extract LaTeX formulas | true | | --no-extract-latex | | Disable LaTeX extraction | - | | --extract-metadata | | Extract article metadata | true | | --no-extract-metadata | | Disable metadata extraction | - | | --post-process | | Apply post-processing | true | | --no-post-process | | Disable post-processing | - | | --detect-code-language | | Detect code block languages | true | | --no-detect-code-language | | Disable code language detection | - |

Supported formats:

| Format | Description | | ----------------- | ---------------------------------------- | | markdown / md | Markdown conversion (default) | | html | Rendered HTML | | image / png | PNG screenshot (lossless) | | jpeg | JPEG screenshot (configurable quality) | | pdf | PDF with embedded images | | docx | Word document with embedded images | | archive / zip | ZIP archive with markdown + local images |

Configuration

Configuration values are resolved with the following priority (highest to lowest):

  1. CLI arguments: --port 8080
  2. Environment variables: PORT=8080
  3. Default .lenv file: .lenv in the project root
  4. Built-in defaults

Environment Variables

| Variable | Description | Default | | ---------------------------------- | ----------------------------------- | -------------------- | | PORT | Server port | 3000 | | BROWSER_ENGINE | Browser engine | puppeteer | | API_TOKEN | API token for authenticated capture | - | | WEB_CAPTURE_DATA_DIR | Base directory for output | ./data/web-capture | | WEB_CAPTURE_EMBED_IMAGES | 0/1 — keep images inline | 0 | | WEB_CAPTURE_KEEP_ORIGINAL_LINKS | 0/1 — keep original remote URLs | 0 | | WEB_CAPTURE_IMAGES_DIR | Subdirectory for extracted images | images | | WEB_CAPTURE_EXTRACT_LATEX | 0/1 — extract LaTeX | 1 | | WEB_CAPTURE_EXTRACT_METADATA | 0/1 — extract metadata | 1 | | WEB_CAPTURE_POST_PROCESS | 0/1 — post-processing | 1 | | WEB_CAPTURE_DETECT_CODE_LANGUAGE | 0/1 — detect code langs | 1 |

Browser Engine Support

The service supports both Puppeteer and Playwright browser engines:

  • Puppeteer: Default engine, mature and well-tested
  • Playwright: Alternative engine with similar capabilities

Supported engine values:

  • puppeteer or pptr - Use Puppeteer
  • playwright or pw - Use Playwright

Popup/Modal Dismissal

By default, the image, PDF, and archive endpoints automatically dismiss common popups before capture:

  • Google Funding Choices consent dialogs
  • Cookie consent banners
  • Generic modals and overlays
  • Fixed-position overlays covering content

Set dismissPopups=false to disable this behavior.

Docker

# Build and run using Docker Compose
docker compose up -d

# Or manually
docker build -t web-capture-js .
docker run -p 3000:3000 web-capture-js

Development

Available Commands

  • npm run dev - Start the development server with hot reloading
  • npm run start - Start the service using Docker Compose
  • npm test - Run all unit tests
  • npm run lint - Check code with ESLint
  • npm run format - Format code with Prettier

Testing

npm test                    # Run unit and integration tests
npm run test:e2e            # Run end-to-end tests
npm run test:all            # Run all tests including build

Built With

  • Express.js for the web server
  • Puppeteer and Playwright for headless browser automation
  • Turndown for HTML to Markdown conversion
  • archiver for ZIP creation
  • docx for DOCX generation
  • Jest for testing

Library Usage

You can also use web-capture as a Node.js library:

import {
  fetchHtml,
  convertHtmlToMarkdown,
} from '@link-assistant/web-capture/src/lib.js';
import { createBrowser } from '@link-assistant/web-capture/src/browser.js';
import { dismissPopups } from '@link-assistant/web-capture/src/popups.js';

// Fetch and convert to markdown
const html = await fetchHtml('https://habr.com/en/articles/895896/');
const markdown = convertHtmlToMarkdown(
  html,
  'https://habr.com/en/articles/895896/'
);

// Take a themed screenshot
const browser = await createBrowser('playwright', { colorScheme: 'dark' });
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto('https://habr.com/en/articles/895896/', {
  waitUntil: 'networkidle0',
});
await dismissPopups(page);
const buffer = await page.screenshot({ type: 'png', fullPage: true });
await browser.close();

License

Unlicense — This is free and unencumbered software released into the public domain. You are free to copy, modify, publish, use, compile, sell, or distribute this software for any purpose, commercial or non-commercial, and by any means. See https://unlicense.org for details.