docs-scraper

v1.0.3

Published

5 months ago

CLI and SDK for scraping documents from DocSend, Notion, and other sources

0High
0Medium
0Low

chrisling-dev

docsend scraper pdf notion cli

docs-scraper

A CLI tool that scrapes documents from various sources (Notion, DocSend, PDFs, etc.) and saves them as local PDF files. Uses Playwright for browser automation.

Installation

bun install

Usage

# Scrape a document (auto-starts daemon if not running)
docs-scraper scrape https://docsend.com/view/xxx

# Use a named profile for session persistence
docs-scraper scrape https://notion.so/xxx -p myprofile

# Pre-fill data for scraper (e.g., email for DocSend)
docs-scraper scrape https://docsend.com/view/xxx -D [email protected]

# Direct mode without daemon (single-shot)
docs-scraper scrape https://example.com/doc.pdf --no-daemon

Handling Authentication

When a scrape is blocked (requires login, email, passcode):

# Initial scrape returns a job ID
docs-scraper scrape https://docsend.com/view/xxx
# Output: Scrape blocked
#         Job ID: abc123

# Retry with data
docs-scraper update abc123 -D [email protected] -D password=1234

Daemon Management

The daemon runs in the background to keep browser instances warm:

docs-scraper daemon start    # Start daemon
docs-scraper daemon stop     # Stop daemon
docs-scraper daemon status   # Show PID, uptime, pending jobs

Profiles & Jobs

docs-scraper profiles list   # List saved profiles
docs-scraper profiles clear  # Clear all profiles
docs-scraper jobs list       # List pending blocked jobs

Cleanup

PDFs are stored in ~/.docs-scraper/output/. The daemon auto-cleans files older than 1 hour.

docs-scraper cleanup                  # Delete all PDFs
docs-scraper cleanup --older-than 1h  # Delete PDFs older than 1 hour

Supported Sources

[x] Direct PDF links
[x] Notion pages
[x] DocSend documents
[ ] Pitch.com
[ ] Google Slides
[ ] Google Drive
[x] LLM fallback (any webpage via Claude)

How It Works

User runs docs-scraper scrape <url> with optional profile name
CLI connects to daemon via Unix socket (auto-starts if needed)
Daemon launches Playwright browser, applies saved session cookies
Scraper registry routes URL to appropriate handler (Notion, DocSend, PDF, or LLM fallback)
If blocked (auth required), returns job ID for retry with credentials
On success, PDF is saved to ~/.docs-scraper/output/
Session cookies saved to profile for future scrapes

Tech Stack

TypeScript + Commander.js (CLI)
Playwright (browser automation)
Anthropic Claude API (LLM fallback scraper)
pdf-lib (PDF manipulation)
Pino (logging)
tsup (bundling)

Configuration

Copy .env.example to .env and configure:

# Anthropic (optional, for LLM fallback scraper)
ANTHROPIC_API_KEY=

# Browser
BROWSER_HEADLESS=true          # false for debugging
BROWSER_POOL_SIZE=3

Development

Quick Start

bun install         # Install dependencies
bun run build       # Build to dist/
bun link            # Link globally for testing

Scripts

bun run dev         # Run CLI in watch mode
bun run cli         # Run CLI directly from source
bun run daemon      # Run daemon directly from source
bun run typecheck   # Type-check only
bun run test        # Run tests
bun run build       # Build for distribution

Local Testing

To test the CLI locally as if it were installed globally:

# 1. Build the project
bun run build

# 2. Link globally
bun link

# 3. Now you can use the CLI directly
docs-scraper daemon start
docs-scraper scrape https://example.com/doc.pdf
docs-scraper daemon stop

# 4. Unlink when done (optional)
bun unlink

Project Structure

src/
├── cli.ts                 # CLI entry point
├── sdk.ts                 # Programmatic SDK
├── daemon/
│   ├── server.ts          # Background daemon server
│   └── client.ts          # CLI-to-daemon communication
├── scrapers/
│   ├── index.ts           # Scraper registry
│   ├── base/              # Base scraper class
│   ├── pdf/               # Direct PDF scraper
│   ├── notion/            # Notion scraper
│   ├── docsend/           # DocSend scraper
│   └── llm/               # LLM fallback scraper
├── profiles/              # Session profile management
├── jobs/                  # Blocked job tracking
├── services/
│   └── browser/           # Playwright browser manager
├── config/                # Environment config
├── types/                 # TypeScript types
└── utils/                 # Logging, errors

Adding a New Scraper

Create src/scrapers/newsource/NewScraper.ts extending BaseScraper
Implement static getCapabilities() with URL patterns
Implement static canHandle(url): boolean
Implement scrape(url): Promise<ScraperResult>
Register in src/scrapers/index.ts: scraperRegistry.register(NewScraper, 30)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

docs-scraper

Installation

Usage

Handling Authentication

Daemon Management

Profiles & Jobs

Cleanup

Supported Sources

How It Works

Tech Stack

Configuration

Development

Quick Start

Scripts

Local Testing

Project Structure

Adding a New Scraper