@vericontext/pagewise
v0.3.9
Published
URL to QA in one CLI — scrape, chunk, embed, and query any webpage
Downloads
1,257
Readme
Pagewise
URL to QA in one CLI — scrape, chunk, embed, and query any webpage.
https://github.com/user-attachments/assets/015e190a-06a3-43d6-a7de-ec2fee5e5e66
Install
Local install (recommended)
npm init -y
npm install @vericontext/pagewise@latestOne-shot (no install)
npx @vericontext/pagewise md https://example.comGlobal install
npm install -g @vericontext/pagewiseFrom source (development)
git clone https://github.com/vericontext/pagewise.git
cd pagewise
npm install
npx playwright install chromiumOpenAI API key
cp .env.example .env
# Edit .env with your keyUsage
Global Options
All commands support the --output flag:
pagewise <command> --output json # Structured JSON output
pagewise <command> --output text # Human-readable output (default in TTY)Non-TTY pipes automatically default to JSON. Override with OUTPUT_FORMAT=json env var.
pagewise md <url>
Output clean markdown to stdout. No API key needed (unless --describe-images is used). Pipe-friendly.
pagewise md https://example.com
pagewise md https://example.com | pbcopy
pagewise md https://example.com --output json
pagewise md https://example.com --describe-images # Include image captionsOptions:
--describe-images— Generate image captions using GPT-4o vision and insert them into markdown (requiresOPENAI_API_KEY)
pagewise ingest <url>
Scrape, chunk, embed, and store a page locally. Chunks include contextual metadata (domain, page title, section headers) for improved retrieval.
pagewise ingest https://example.com
pagewise ingest https://example.com --dry-run # Preview without storing
pagewise ingest https://example.com --dry-run --output jsonOptions:
--dry-run— Preview what would be ingested without storing anything--describe-images— Generate image captions using GPT-4o vision and include them in chunks--max-images <n>— Max images to describe per page (default: 10)
pagewise pages
List stored pages and their chunk counts. No API key needed.
pagewise pages
pagewise pages --output json
pagewise pages --domain docs.example.com
pagewise pages --url https://example.com --output jsonOptions:
--domain <domain>— Filter by domain--url <url>— Show details for a specific page (includes markdown content)
pagewise retrieve "<query>"
Search ingested pages and return raw chunks without LLM call. Ideal for AI agents that have their own LLM.
pagewise retrieve "what is the pricing?"
pagewise retrieve "main content" --top-k 3 --output json
pagewise retrieve "code examples" --chunk-type code --output json
pagewise retrieve "pricing" --no-rerank --merge --output jsonOptions:
--url <url>— Limit search to a specific URL--top-k <n>— Number of chunks to retrieve (default: 5)--domain <domain>— Filter by domain (e.g.docs.github.com)--chunk-type <type>— Filter by chunk type (prose,code,table,list)--no-rerank— Disable Cohere reranking--merge— Merge consecutive chunks from the same page
pagewise ask "<question>"
Hybrid RAG Q&A (BM25 + vector search + RRF) over ingested pages. Automatically uses context-stuffing for small single-page queries.
pagewise ask "What is this page about?"
pagewise ask "What are the pricing tiers?" --url https://example.com
pagewise ask "Compare features" --top-k 10
pagewise ask "Show setup code" --chunk-type code
pagewise ask "API reference" --domain docs.example.comOptions:
--url <url>— Limit search to a specific URL--top-k <n>— Number of chunks to retrieve (default: 5)--domain <domain>— Filter by domain (e.g.docs.github.com)--chunk-type <type>— Filter by chunk type (prose,code,table,list)
pagewise crawl <url>
Crawl a website (BFS) and ingest all discovered pages.
pagewise crawl https://docs.example.com
pagewise crawl https://docs.example.com --depth 1 --max-pages 5
pagewise crawl https://docs.example.com --include "/docs/**"
pagewise crawl https://docs.example.com --dry-run # Discover links only
pagewise crawl https://docs.example.com --include-static --dry-run # Include images in linksOptions:
--depth <n>— Maximum crawl depth (default: 2)--max-pages <n>— Maximum pages to crawl (default: 50)--include <patterns>— Comma-separated glob patterns for paths to include--exclude <patterns>— Comma-separated glob patterns for paths to exclude--delay <ms>— Delay between requests in ms (default: 1000)--concurrency <n>— Number of pages to scrape in parallel (default: 3)--follow-links <scope>— Link scope:same-domain(default) orall(cross-domain)--dry-run— Discover links on root page without crawling--include-static— Include static resources (images, CSS, JS, etc.) in discovered links (default: off)--describe-images— Generate image captions using GPT-4o vision and include them in chunks--max-images <n>— Max images to describe per page (default: 10)--image-detail <level>— Vision detail level:low(default) orhigh
Image Metadata
Image references () are automatically extracted from each crawled page and provided as structured data. This works regardless of the --include-static flag.
# View image list per page in JSON output
pagewise crawl https://example.com --max-pages 1 --output json | jq '.pages[0].images'
# -> [{ "url": "https://example.com/diagram.png", "alt": "Architecture diagram" }, ...]Each page object in the JSON output includes an images array:
{
"pages": [
{
"url": "https://example.com",
"chunks": 12,
"images": [
{ "url": "https://example.com/logo.svg", "alt": "Logo" },
{ "url": "https://example.com/chart.png", "alt": "Revenue chart" }
]
}
]
}--include-static controls whether image URLs are added to the crawl queue. Image metadata (the images field) is always collected.
Image Description (--describe-images)
When --describe-images is enabled, GPT-4o vision analyzes each image and generates a caption, which is inserted as a blockquote in the markdown. This converts visual content (infographics, charts, diagrams) into searchable text for RAG retrieval and the ask command.
# View image descriptions in md output
pagewise md https://example.com --describe-images --output json | jq '.markdown'
# Crawl with image descriptions
pagewise crawl https://example.com --describe-images --max-images 3 --max-pages 1Output format:

> **[Image description]** A bar chart showing quarterly revenue growth from Q1 to Q4.Cost management:
--image-detail low(default) — Minimal tokens per image--max-images 10(default) — Up to 10 images per page- SVGs are automatically skipped (not supported by vision API)
- Concurrency limited to 3 to avoid rate limits
- Individual image failures are skipped; remaining images continue processing
pagewise summary <url>
Generate a 1-page summary. Auto-ingests if not already stored.
pagewise summary https://example.compagewise compare <url1> <url2>
Compare two web pages. Use --aspect to focus the comparison.
pagewise compare https://a.com https://b.com
pagewise compare https://a.com https://b.com --aspect pricingpagewise schema [command]
Output CLI schema as JSON for agent/tool discovery.
pagewise schema # All commands
pagewise schema ask # Single command detailsAI Agent Integration
Pagewise is designed to work as a tool for AI agents (Claude Code, Cursor, Cline, etc.). See AGENTS.md for the full agent guide.
Agent Workflow (Embedding API only)
Agents with their own LLM can use pagewise purely as an embedding + retrieval layer — no chat model cost:
pagewise ingest <url> # 1. Store content (embedding)
pagewise pages --output json # 2. Check stored pages
pagewise retrieve "query" --top-k 5 --output json # 3. Search relevant chunks
# 4. Agent injects chunks into its own LLM contextRequired: OPENAI_API_KEY (embedding only). No chat model call needed.
Key Features for Agents
pages/retrieve— Composable, LLM-free commands for agent-controlled workflows--output json— Structured, parseable output on all commands--dry-run— Safe preview of mutating operations (ingest,crawl)pagewise schema— Runtime discovery of commands and options- Meaningful exit codes —
0success,2input error,3network error,4auth error - Auto-detection — Non-TTY pipes default to JSON, spinners suppressed
Architecture
Hybrid Search
Queries run through two parallel search paths and are merged via Reciprocal Rank Fusion (RRF):
- Vector search — semantic similarity via OpenAI embeddings + sqlite-vec
- BM25 search — keyword matching via SQLite FTS5
Smart Context Strategy
- Small pages (< 30K chars) with a
--urlfilter: full markdown is passed directly to the LLM (context-stuffing) - Large corpora: hybrid RAG pipeline with token budget management (70% of 400K context window)
Contextual Retrieval
Each chunk is prefixed with structural metadata (From {domain}, page "{title}", section "{header}":) before embedding, improving retrieval accuracy at zero LLM cost.
Re-ranking (optional)
Set COHERE_API_KEY in .env to enable Cohere Rerank v3.5 for additional retrieval quality. Without it, RRF scores are used directly.
Exit Codes
| Code | Meaning | |------|---------| | 0 | Success | | 1 | General error | | 2 | Input validation error (bad URL, missing argument) | | 3 | Network/scraping error | | 4 | API key missing or authentication error |
Environment Variables
| Variable | Required | Description |
|----------|----------|-------------|
| OPENAI_API_KEY | Yes (for ingest/retrieve/ask/summary/compare) | OpenAI API key for embeddings and chat |
| COHERE_API_KEY | No | Cohere API key for reranking (improves ask and retrieve quality) |
| OUTPUT_FORMAT | No | Set to json or text to override auto-detection |
Data
Pages and embeddings are stored in .pagewise/pagewise.db (SQLite) under the current project directory.
Test
npm test