@vericontext/pagewise

v0.3.9

Published

3 months ago

URL to QA in one CLI — scrape, chunk, embed, and query any webpage

0High
0Medium
0Low

kiyeonjeon21

Pagewise

URL to QA in one CLI — scrape, chunk, embed, and query any webpage.

https://github.com/user-attachments/assets/015e190a-06a3-43d6-a7de-ec2fee5e5e66

Install

Local install (recommended)

npm init -y
npm install @vericontext/pagewise@latest

One-shot (no install)

npx @vericontext/pagewise md https://example.com

Global install

npm install -g @vericontext/pagewise

From source (development)

git clone https://github.com/vericontext/pagewise.git
cd pagewise
npm install
npx playwright install chromium

OpenAI API key

cp .env.example .env
# Edit .env with your key

Usage

Global Options

All commands support the --output flag:

pagewise <command> --output json   # Structured JSON output
pagewise <command> --output text   # Human-readable output (default in TTY)

Non-TTY pipes automatically default to JSON. Override with OUTPUT_FORMAT=json env var.

`pagewise md <url>`

Output clean markdown to stdout. No API key needed (unless --describe-images is used). Pipe-friendly.

pagewise md https://example.com
pagewise md https://example.com | pbcopy
pagewise md https://example.com --output json
pagewise md https://example.com --describe-images  # Include image captions

Options:

--describe-images — Generate image captions using GPT-4o vision and insert them into markdown (requires OPENAI_API_KEY)

`pagewise ingest <url>`

Scrape, chunk, embed, and store a page locally. Chunks include contextual metadata (domain, page title, section headers) for improved retrieval.

pagewise ingest https://example.com
pagewise ingest https://example.com --dry-run     # Preview without storing
pagewise ingest https://example.com --dry-run --output json

Options:

--dry-run — Preview what would be ingested without storing anything
--describe-images — Generate image captions using GPT-4o vision and include them in chunks
--max-images <n> — Max images to describe per page (default: 10)

`pagewise pages`

List stored pages and their chunk counts. No API key needed.

pagewise pages
pagewise pages --output json
pagewise pages --domain docs.example.com
pagewise pages --url https://example.com --output json

Options:

--domain <domain> — Filter by domain
--url <url> — Show details for a specific page (includes markdown content)

`pagewise retrieve "<query>"`

Search ingested pages and return raw chunks without LLM call. Ideal for AI agents that have their own LLM.

pagewise retrieve "what is the pricing?"
pagewise retrieve "main content" --top-k 3 --output json
pagewise retrieve "code examples" --chunk-type code --output json
pagewise retrieve "pricing" --no-rerank --merge --output json

Options:

--url <url> — Limit search to a specific URL
--top-k <n> — Number of chunks to retrieve (default: 5)
--domain <domain> — Filter by domain (e.g. docs.github.com)
--chunk-type <type> — Filter by chunk type (prose, code, table, list)
--no-rerank — Disable Cohere reranking
--merge — Merge consecutive chunks from the same page

`pagewise ask "<question>"`

Hybrid RAG Q&A (BM25 + vector search + RRF) over ingested pages. Automatically uses context-stuffing for small single-page queries.

pagewise ask "What is this page about?"
pagewise ask "What are the pricing tiers?" --url https://example.com
pagewise ask "Compare features" --top-k 10
pagewise ask "Show setup code" --chunk-type code
pagewise ask "API reference" --domain docs.example.com

Options:

--url <url> — Limit search to a specific URL
--top-k <n> — Number of chunks to retrieve (default: 5)
--domain <domain> — Filter by domain (e.g. docs.github.com)
--chunk-type <type> — Filter by chunk type (prose, code, table, list)

`pagewise crawl <url>`

Crawl a website (BFS) and ingest all discovered pages.

pagewise crawl https://docs.example.com
pagewise crawl https://docs.example.com --depth 1 --max-pages 5
pagewise crawl https://docs.example.com --include "/docs/**"
pagewise crawl https://docs.example.com --dry-run  # Discover links only
pagewise crawl https://docs.example.com --include-static --dry-run  # Include images in links

Options:

--depth <n> — Maximum crawl depth (default: 2)
--max-pages <n> — Maximum pages to crawl (default: 50)
--include <patterns> — Comma-separated glob patterns for paths to include
--exclude <patterns> — Comma-separated glob patterns for paths to exclude
--delay <ms> — Delay between requests in ms (default: 1000)
--concurrency <n> — Number of pages to scrape in parallel (default: 3)
--follow-links <scope> — Link scope: same-domain (default) or all (cross-domain)
--dry-run — Discover links on root page without crawling
--include-static — Include static resources (images, CSS, JS, etc.) in discovered links (default: off)
--describe-images — Generate image captions using GPT-4o vision and include them in chunks
--max-images <n> — Max images to describe per page (default: 10)
--image-detail <level> — Vision detail level: low (default) or high

Image Metadata

Image references (![alt](url)) are automatically extracted from each crawled page and provided as structured data. This works regardless of the --include-static flag.

# View image list per page in JSON output
pagewise crawl https://example.com --max-pages 1 --output json | jq '.pages[0].images'
# -> [{ "url": "https://example.com/diagram.png", "alt": "Architecture diagram" }, ...]

Each page object in the JSON output includes an images array:

{
  "pages": [
    {
      "url": "https://example.com",
      "chunks": 12,
      "images": [
        { "url": "https://example.com/logo.svg", "alt": "Logo" },
        { "url": "https://example.com/chart.png", "alt": "Revenue chart" }
      ]
    }
  ]
}

--include-static controls whether image URLs are added to the crawl queue. Image metadata (the images field) is always collected.

Image Description (`--describe-images`)

When --describe-images is enabled, GPT-4o vision analyzes each image and generates a caption, which is inserted as a blockquote in the markdown. This converts visual content (infographics, charts, diagrams) into searchable text for RAG retrieval and the ask command.

# View image descriptions in md output
pagewise md https://example.com --describe-images --output json | jq '.markdown'

# Crawl with image descriptions
pagewise crawl https://example.com --describe-images --max-images 3 --max-pages 1

Output format:

![chart](https://example.com/chart.png)
> **[Image description]** A bar chart showing quarterly revenue growth from Q1 to Q4.

Cost management:

--image-detail low (default) — Minimal tokens per image
--max-images 10 (default) — Up to 10 images per page
SVGs are automatically skipped (not supported by vision API)
Concurrency limited to 3 to avoid rate limits
Individual image failures are skipped; remaining images continue processing

`pagewise summary <url>`

Generate a 1-page summary. Auto-ingests if not already stored.

pagewise summary https://example.com

`pagewise compare <url1> <url2>`

Compare two web pages. Use --aspect to focus the comparison.

pagewise compare https://a.com https://b.com
pagewise compare https://a.com https://b.com --aspect pricing

`pagewise schema [command]`

Output CLI schema as JSON for agent/tool discovery.

pagewise schema            # All commands
pagewise schema ask        # Single command details

AI Agent Integration

Pagewise is designed to work as a tool for AI agents (Claude Code, Cursor, Cline, etc.). See AGENTS.md for the full agent guide.

Agent Workflow (Embedding API only)

Agents with their own LLM can use pagewise purely as an embedding + retrieval layer — no chat model cost:

pagewise ingest <url>                              # 1. Store content (embedding)
pagewise pages --output json                       # 2. Check stored pages
pagewise retrieve "query" --top-k 5 --output json  # 3. Search relevant chunks
# 4. Agent injects chunks into its own LLM context

Required: OPENAI_API_KEY (embedding only). No chat model call needed.

Key Features for Agents

pages / retrieve — Composable, LLM-free commands for agent-controlled workflows
--output json — Structured, parseable output on all commands
--dry-run — Safe preview of mutating operations (ingest, crawl)
pagewise schema — Runtime discovery of commands and options
Meaningful exit codes — 0 success, 2 input error, 3 network error, 4 auth error
Auto-detection — Non-TTY pipes default to JSON, spinners suppressed

Architecture

Hybrid Search

Queries run through two parallel search paths and are merged via Reciprocal Rank Fusion (RRF):

Vector search — semantic similarity via OpenAI embeddings + sqlite-vec
BM25 search — keyword matching via SQLite FTS5

Smart Context Strategy

Small pages (< 30K chars) with a --url filter: full markdown is passed directly to the LLM (context-stuffing)
Large corpora: hybrid RAG pipeline with token budget management (70% of 400K context window)

Contextual Retrieval

Each chunk is prefixed with structural metadata (From {domain}, page "{title}", section "{header}":) before embedding, improving retrieval accuracy at zero LLM cost.

Re-ranking (optional)

Set COHERE_API_KEY in .env to enable Cohere Rerank v3.5 for additional retrieval quality. Without it, RRF scores are used directly.

Exit Codes

| Code | Meaning | |------|---------| | 0 | Success | | 1 | General error | | 2 | Input validation error (bad URL, missing argument) | | 3 | Network/scraping error | | 4 | API key missing or authentication error |

Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | OPENAI_API_KEY | Yes (for ingest/retrieve/ask/summary/compare) | OpenAI API key for embeddings and chat | | COHERE_API_KEY | No | Cohere API key for reranking (improves ask and retrieve quality) | | OUTPUT_FORMAT | No | Set to json or text to override auto-detection |

Data

Pages and embeddings are stored in .pagewise/pagewise.db (SQLite) under the current project directory.

Test

npm test

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Pagewise

Install

Local install (recommended)

One-shot (no install)

Global install

From source (development)

OpenAI API key

Usage

Global Options

pagewise md <url>

pagewise ingest <url>

pagewise pages

pagewise retrieve "<query>"

pagewise ask "<question>"

pagewise crawl <url>

Image Metadata

Image Description (--describe-images)

pagewise summary <url>

pagewise compare <url1> <url2>

pagewise schema [command]

AI Agent Integration

Agent Workflow (Embedding API only)

Key Features for Agents

Architecture

Hybrid Search

Smart Context Strategy

Contextual Retrieval

Re-ranking (optional)

Exit Codes

Environment Variables

Data

Test

`pagewise md <url>`

`pagewise ingest <url>`

`pagewise pages`

`pagewise retrieve "<query>"`

`pagewise ask "<question>"`

`pagewise crawl <url>`

Image Description (`--describe-images`)

`pagewise summary <url>`

`pagewise compare <url1> <url2>`

`pagewise schema [command]`