site2pdf-cli

v0.1.14

Published

22 days ago

Generate comprehensive PDFs of entire websites, ideal for RAG.

0High
0Medium
0Low

laiso

crawler PDF

site2pdf

Generate a single PDF containing all pages of a website. Ideal for AI-based Retrieval-Augmented Generation (RAG) and Question Answering (QA) tasks.

Features

Portability - Combine multiple pages into a single shareable PDF
AI Integration - Works with Google NotebookLM, ChatGPT GPTs, and other AI tools
Visual Preservation - Maintains images and formatting for multimodal models
Concurrent Processing - Processes multiple pages in parallel for faster generation

Quick Start

npx site2pdf-cli https://example.com

Output is saved to ./out/<domain>.pdf.

Installation (from source)

To install the tool globally on your machine from source, run:

git clone https://github.com/laiso/site2pdf.git
cd site2pdf
npm install
npm run build
npm link

After installation, you can run the tool directly using the site2pdf command from anywhere:

site2pdf <main_url> [url_pattern]

Prerequisites

Node.js (v18 or later recommended)

Linux Dependencies

Puppeteer requires these system libraries:

sudo apt-get update
sudo apt-get install -y libnss3 libatk1.0-0 libatk-bridge2.0-0 libcups2 \
  libdrm2 libxkbcommon0 libxcomposite1 libxdamage1 libxfixes3 libxrandr2 \
  libgbm1 libasound2

Note: On newer Ubuntu versions (24.04+), use libasound2t64 instead of libasound2.

Usage

npx site2pdf-cli <main_url> [url_pattern]

| Argument | Description | |----------|-------------| | <main_url> | The starting URL to crawl and convert | | [url_pattern] | Optional regex to filter which links to include (defaults to same domain) |

URL Pattern Formats

Plain string: 'https://example.com/docs' - matches URLs containing this string
Regex literal: '/https:\/\/example\.com\/docs/i' - full regex with flags

Examples

Basic usage (captures all same-domain links):

npx site2pdf-cli https://docs.example.com

Filter to specific section:

npx site2pdf-cli "https://www.typescriptlang.org/docs/handbook/" "https://www.typescriptlang.org/docs/handbook/2/"

Environment Variables

| Variable | Description | |----------|-------------| | CHROME_PATH | Path to a custom Chrome/Chromium executable |

Troubleshooting

Windows: Sandbox Errors

Grant permissions to the Puppeteer cache:

icacls %USERPROFILE%/.cache/puppeteer/chrome /grant *S-1-15-2-1:(OI)(CI)(RX)

See Puppeteer Windows troubleshooting.

ARM64 Linux: Not Supported

Chrome does not provide ARM64 binaries for Linux. You'll see errors like:

"Failed to launch the browser process!"
"chrome-linux64/chrome: 1: Syntax error: "(" unexpected"

See Chrome for Testing ARM64 Support Issue.

How It Works

Launches headless Chrome via Puppeteer
Navigates to the main URL and extracts all matching links
Generates a PDF for each page concurrently
Merges all PDFs into a single document using pdf-lib
Saves to ./out/<slugified-url>.pdf

Development

git clone https://github.com/laiso/site2pdf.git
cd site2pdf
npm install

| Command | Description | |---------|-------------| | npm run dev -- <main_url> [url_pattern] | Run in development mode with watch | | npm run build | Compile TypeScript | | npm test | Run tests | | npx biome lint | Check for lint issues | | npx biome format | Format code |

Contributing

Issues and pull requests are welcome. Please follow the existing code style and include tests for new features.

License

MIT