docs-scraper
v1.0.3
Published
CLI and SDK for scraping documents from DocSend, Notion, and other sources
Maintainers
Readme
docs-scraper
A CLI tool that scrapes documents from various sources (Notion, DocSend, PDFs, etc.) and saves them as local PDF files. Uses Playwright for browser automation.
Installation
bun installUsage
# Scrape a document (auto-starts daemon if not running)
docs-scraper scrape https://docsend.com/view/xxx
# Use a named profile for session persistence
docs-scraper scrape https://notion.so/xxx -p myprofile
# Pre-fill data for scraper (e.g., email for DocSend)
docs-scraper scrape https://docsend.com/view/xxx -D [email protected]
# Direct mode without daemon (single-shot)
docs-scraper scrape https://example.com/doc.pdf --no-daemonHandling Authentication
When a scrape is blocked (requires login, email, passcode):
# Initial scrape returns a job ID
docs-scraper scrape https://docsend.com/view/xxx
# Output: Scrape blocked
# Job ID: abc123
# Retry with data
docs-scraper update abc123 -D [email protected] -D password=1234Daemon Management
The daemon runs in the background to keep browser instances warm:
docs-scraper daemon start # Start daemon
docs-scraper daemon stop # Stop daemon
docs-scraper daemon status # Show PID, uptime, pending jobsProfiles & Jobs
docs-scraper profiles list # List saved profiles
docs-scraper profiles clear # Clear all profiles
docs-scraper jobs list # List pending blocked jobsCleanup
PDFs are stored in ~/.docs-scraper/output/. The daemon auto-cleans files older than 1 hour.
docs-scraper cleanup # Delete all PDFs
docs-scraper cleanup --older-than 1h # Delete PDFs older than 1 hourSupported Sources
- [x] Direct PDF links
- [x] Notion pages
- [x] DocSend documents
- [ ] Pitch.com
- [ ] Google Slides
- [ ] Google Drive
- [x] LLM fallback (any webpage via Claude)
How It Works
- User runs
docs-scraper scrape <url>with optional profile name - CLI connects to daemon via Unix socket (auto-starts if needed)
- Daemon launches Playwright browser, applies saved session cookies
- Scraper registry routes URL to appropriate handler (Notion, DocSend, PDF, or LLM fallback)
- If blocked (auth required), returns job ID for retry with credentials
- On success, PDF is saved to
~/.docs-scraper/output/ - Session cookies saved to profile for future scrapes
Tech Stack
- TypeScript + Commander.js (CLI)
- Playwright (browser automation)
- Anthropic Claude API (LLM fallback scraper)
- pdf-lib (PDF manipulation)
- Pino (logging)
- tsup (bundling)
Configuration
Copy .env.example to .env and configure:
# Anthropic (optional, for LLM fallback scraper)
ANTHROPIC_API_KEY=
# Browser
BROWSER_HEADLESS=true # false for debugging
BROWSER_POOL_SIZE=3Development
Quick Start
bun install # Install dependencies
bun run build # Build to dist/
bun link # Link globally for testingScripts
bun run dev # Run CLI in watch mode
bun run cli # Run CLI directly from source
bun run daemon # Run daemon directly from source
bun run typecheck # Type-check only
bun run test # Run tests
bun run build # Build for distributionLocal Testing
To test the CLI locally as if it were installed globally:
# 1. Build the project
bun run build
# 2. Link globally
bun link
# 3. Now you can use the CLI directly
docs-scraper daemon start
docs-scraper scrape https://example.com/doc.pdf
docs-scraper daemon stop
# 4. Unlink when done (optional)
bun unlinkProject Structure
src/
├── cli.ts # CLI entry point
├── sdk.ts # Programmatic SDK
├── daemon/
│ ├── server.ts # Background daemon server
│ └── client.ts # CLI-to-daemon communication
├── scrapers/
│ ├── index.ts # Scraper registry
│ ├── base/ # Base scraper class
│ ├── pdf/ # Direct PDF scraper
│ ├── notion/ # Notion scraper
│ ├── docsend/ # DocSend scraper
│ └── llm/ # LLM fallback scraper
├── profiles/ # Session profile management
├── jobs/ # Blocked job tracking
├── services/
│ └── browser/ # Playwright browser manager
├── config/ # Environment config
├── types/ # TypeScript types
└── utils/ # Logging, errorsAdding a New Scraper
- Create
src/scrapers/newsource/NewScraper.tsextendingBaseScraper - Implement static
getCapabilities()with URL patterns - Implement static
canHandle(url): boolean - Implement
scrape(url): Promise<ScraperResult> - Register in
src/scrapers/index.ts:scraperRegistry.register(NewScraper, 30)
