@link-assistant/web-capture

v1.7.26

Published

4 days ago

CLI and microservice to render web pages as HTML, Markdown, or PNG

0High
0Medium
0Low

konard

render markdown screenshot microservice

web-capture (JavaScript/Node.js)

A CLI and microservice to fetch URLs and render them as:

Markdown: Converted from HTML with image extraction (default)
HTML: Rendered page content
PNG/JPEG screenshot: Viewport or full-page capture with theme support
ZIP archive: Markdown + locally downloaded images
PDF: Print-quality document with embedded images
DOCX: Word document with embedded images

Installation

From npm

npm install -g @link-assistant/web-capture

From Source

cd js
npm install

Quick Start

CLI Usage

# Capture a URL as Markdown (default format)
# Output auto-derived to ./data/web-capture/<host>/<path>/document.md
web-capture https://example.com

# Capture as Markdown to a specific file
web-capture https://example.com -o page.md

# Write to stdout explicitly
web-capture https://example.com -o -

# Capture as HTML
web-capture https://example.com --format html -o page.html

# Take a PNG screenshot
web-capture https://example.com --format png -o screenshot.png

# Take a JPEG screenshot with custom quality
web-capture https://example.com --format jpeg --quality 60 -o screenshot.jpg

# Dark theme full-page screenshot
web-capture https://example.com --format png --theme dark --fullPage -o dark.png

# Custom viewport size
web-capture https://example.com --format png --width 1920 --height 1080 -o wide.png

# Download as PDF
web-capture https://example.com --format pdf -o page.pdf

# Download as DOCX
web-capture https://example.com --format docx -o page.docx

# Create a ZIP archive (default: zip format)
web-capture https://example.com --archive

# Create a ZIP archive with specific format
web-capture https://example.com --archive zip -o site.zip
web-capture https://example.com --archive tar.gz -o site.tar.gz

# Keep images inline as base64 (opt-in)
web-capture https://example.com --embed-images -o page.md

# Custom images directory
web-capture https://example.com --images-dir assets -o page.md

# Disable specific features
web-capture https://example.com --no-extract-latex --no-post-process -o page.md

# Start as API server
web-capture --serve

# Start server on custom port
web-capture --serve --port 8080

Google Docs

# Capture the live editor model as Markdown (default --capture browser)
web-capture https://docs.google.com/document/d/DOC_ID/edit --capture browser

# Capture via the public export endpoint
web-capture https://docs.google.com/document/d/DOC_ID/edit --capture api

# Capture via the Google Docs REST API
web-capture https://docs.google.com/document/d/DOC_ID/edit --capture api --apiToken YOUR_TOKEN

Browser-model capture waits until DOCS_modelChunk data stops changing before parsing. For very large or slow documents, tune the wait with WEB_CAPTURE_GDOCS_STABILITY_MS (default 1500) and WEB_CAPTURE_GDOCS_MAX_WAIT_MS (default 30000).

API Endpoints (Server Mode)

Start the server with web-capture --serve and use the endpoints below.

HTML Endpoint

GET /html?url=<URL>&engine=<ENGINE>

Returns the raw HTML content of the specified URL.

| Parameter | Required | Description | Default | | --------- | -------- | ------------------------------------ | --------- | | url | Yes | URL to fetch | - | | engine | No | Browser engine: puppeteer/playwright | puppeteer |

Markdown Endpoint

GET /markdown?url=<URL>

Converts the HTML content of the specified URL to Markdown format. By default, original remote image URLs are preserved and base64 data URIs are stripped (clean single-file output). Use keepOriginalLinks=false to strip all images, or embedImages=true to keep base64 images inline.

| Parameter | Required | Description | Default | | ------------------- | -------- | ------------------------------------------ | ------- | | url | Yes | URL to fetch | - | | embedImages | No | Keep base64 images inline (true/false) | false | | keepOriginalLinks | No | Keep original remote URLs, strip base64 | true |

Image Endpoint

GET /image?url=<URL>&format=png&theme=dark&width=1920&height=1080&fullPage=true

Returns a screenshot of the specified URL.

| Parameter | Required | Description | Default | | --------------- | -------- | ---------------------------------------- | --------------- | | url | Yes | URL to capture | - | | engine | No | Browser engine: puppeteer/playwright | puppeteer | | format | No | Image format: png (lossless) or jpeg | png | | quality | No | JPEG quality 0-100 (ignored for PNG) | 80 | | width | No | Viewport width in pixels | 1280 | | height | No | Viewport height in pixels | 800 | | fullPage | No | Capture full scrollable page | false | | theme | No | Color scheme: light, dark | browser default | | dismissPopups | No | Auto-close cookie/consent popups | true |

Archive Endpoint

GET /archive?url=<URL>&localImages=true&documentFormat=markdown

Returns a ZIP archive containing either document.md or document.html and asset directories (images/, css/).

| Parameter | Required | Description | Default | | ------------------- | -------- | -------------------------------------------- | ---------- | | url | Yes | URL to archive | - | | localImages | No | Download images locally into the archive | true | | documentFormat | No | Document format: markdown or html | markdown | | embedImages | No | Keep base64 images inline in the document | false | | keepOriginalLinks | No | Keep original remote URLs (skip downloading) | false |

Archive structure (with localImages=true):

archive.zip
├── document.md       # or document.html when documentFormat=html
├── images/
│   ├── image-1.png
│   ├── image-2.jpg
│   └── ...
└── css/              # only when documentFormat=html
    ├── style-1.css
    └── ...

PDF Endpoint

GET /pdf?url=<URL>&theme=light

Returns a PDF document rendered by the browser engine (all images embedded).

| Parameter | Required | Description | Default | | --------------- | -------- | ------------------------------- | --------------- | | url | Yes | URL to convert | - | | engine | No | Browser engine | puppeteer | | width | No | Viewport width in pixels | 1280 | | height | No | Viewport height in pixels | 800 | | theme | No | Color scheme: light, dark | browser default | | dismissPopups | No | Auto-close popups before export | true |

DOCX Endpoint

GET /docx?url=<URL>

Returns a DOCX document with embedded images.

| Parameter | Required | Description | Default | | --------- | -------- | -------------- | ------- | | url | Yes | URL to convert | - |

Fetch/Stream Endpoints

GET /fetch?url=<URL>
GET /stream?url=<URL>

Proxy fetch and streaming proxy endpoints.

CLI Reference

Server Mode

web-capture --serve [--port <port>]

| Option | Short | Description | Default | | --------- | ----- | ------------------------ | ------------------ | | --serve | -s | Start as HTTP API server | - | | --port | -p | Port to listen on | 3000 (or PORT env) |

Capture Mode

web-capture <url> [options]

| Option | --------------------------- | --format | --output | --data-dir | --engine | --theme | --width | --height | --quality | --fullPage | --embed-images | --no-extract-images | --keep-original-links | --images-dir | --archive | --document-format | --localImages | --extract-latex | --no-extract-latex | --extract-metadata | --no-extract-metadata | --post-process | --no-post-process | --detect-code-language | --no-detect-code-language | Short | Description | Default | | ----- | ---------------------------------------------- | ----------------------------------- | | -f | Output format (see below) | markdown | | -o | Output file path. Use -o - for stdout | auto-derived from URL | | | Base directory for auto-derived output paths | ./data/web-capture | | -e | Browser engine: puppeteer, playwright | puppeteer (or BROWSER_ENGINE env) | | -t | Color scheme: light, dark, no-preference | browser default | | | Viewport width in pixels | 1280 | | | Viewport height in pixels | 800 | | | JPEG quality 0-100 | 80 | | | Capture full scrollable page | false | | | Keep images as inline base64 data URIs | false | | | Alias for --embed-images | false | | | Keep original remote URLs, strip base64 | false | | | Subdirectory name for extracted images | images | | | Create archive: zip, 7z, tar.gz, tar | - | | | Document format in archive: markdown, html | markdown | | | Download images locally in archive mode | true | | | Extract LaTeX formulas | true | | | Disable LaTeX extraction | - | | | Extract article metadata | true | | | Disable metadata extraction | - | | | Apply post-processing | true | | | Disable post-processing | - | | | Detect code block languages | true | | | Disable code language detection | - |

Supported formats:

| Format | Description | | ----------------- | ---------------------------------------- | | markdown / md | Markdown conversion (default) | | html | Rendered HTML | | image / png | PNG screenshot (lossless) | | jpeg | JPEG screenshot (configurable quality) | | pdf | PDF with embedded images | | docx | Word document with embedded images | | archive / zip | ZIP archive with markdown + local images |

Configuration

Configuration values are resolved with the following priority (highest to lowest):

CLI arguments: --port 8080
Environment variables: PORT=8080
Default .lenv file: .lenv in the project root
Built-in defaults

Environment Variables

| Variable | Description | Default | | ---------------------------------- | ----------------------------------- | -------------------- | | PORT | Server port | 3000 | | BROWSER_ENGINE | Browser engine | puppeteer | | API_TOKEN | API token for authenticated capture | - | | WEB_CAPTURE_DATA_DIR | Base directory for output | ./data/web-capture | | WEB_CAPTURE_EMBED_IMAGES | 0/1 — keep images inline | 0 | | WEB_CAPTURE_KEEP_ORIGINAL_LINKS | 0/1 — keep original remote URLs | 0 | | WEB_CAPTURE_IMAGES_DIR | Subdirectory for extracted images | images | | WEB_CAPTURE_EXTRACT_LATEX | 0/1 — extract LaTeX | 1 | | WEB_CAPTURE_EXTRACT_METADATA | 0/1 — extract metadata | 1 | | WEB_CAPTURE_POST_PROCESS | 0/1 — post-processing | 1 | | WEB_CAPTURE_DETECT_CODE_LANGUAGE | 0/1 — detect code langs | 1 |

Browser Engine Support

The service supports both Puppeteer and Playwright browser engines:

Puppeteer: Default engine, mature and well-tested
Playwright: Alternative engine with similar capabilities

Supported engine values:

puppeteer or pptr - Use Puppeteer
playwright or pw - Use Playwright

Popup/Modal Dismissal

By default, the image, PDF, and archive endpoints automatically dismiss common popups before capture:

Google Funding Choices consent dialogs
Cookie consent banners
Generic modals and overlays
Fixed-position overlays covering content

Set dismissPopups=false to disable this behavior.

Docker

# Build and run using Docker Compose
docker compose up -d

# Or manually
docker build -t web-capture-js .
docker run -p 3000:3000 web-capture-js

Development

Available Commands

npm run dev - Start the development server with hot reloading
npm run start - Start the service using Docker Compose
npm test - Run all unit tests
npm run lint - Check code with ESLint
npm run format - Format code with Prettier

Testing

npm test                    # Run unit and integration tests
npm run test:e2e            # Run end-to-end tests
npm run test:all            # Run all tests including build

Built With

Express.js for the web server
Puppeteer and Playwright for headless browser automation
Turndown for HTML to Markdown conversion
archiver for ZIP creation
docx for DOCX generation
Jest for testing

Library Usage

You can also use web-capture as a Node.js library:

import {
  fetchHtml,
  convertHtmlToMarkdown,
} from '@link-assistant/web-capture/src/lib.js';
import { createBrowser } from '@link-assistant/web-capture/src/browser.js';
import { dismissPopups } from '@link-assistant/web-capture/src/popups.js';

// Fetch and convert to markdown
const html = await fetchHtml('https://habr.com/en/articles/895896/');
const markdown = convertHtmlToMarkdown(
  html,
  'https://habr.com/en/articles/895896/'
);

// Take a themed screenshot
const browser = await createBrowser('playwright', { colorScheme: 'dark' });
const page = await browser.newPage();
await page.setViewport({ width: 1920, height: 1080 });
await page.goto('https://habr.com/en/articles/895896/', {
  waitUntil: 'networkidle0',
});
await dismissPopups(page);
const buffer = await page.screenshot({ type: 'png', fullPage: true });
await browser.close();

License

Unlicense — This is free and unencumbered software released into the public domain. You are free to copy, modify, publish, use, compile, sell, or distribute this software for any purpose, commercial or non-commercial, and by any means. See https://unlicense.org for details.