npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@vericontext/pagewise

v0.3.9

Published

URL to QA in one CLI — scrape, chunk, embed, and query any webpage

Downloads

1,257

Readme

Pagewise

URL to QA in one CLI — scrape, chunk, embed, and query any webpage.

https://github.com/user-attachments/assets/015e190a-06a3-43d6-a7de-ec2fee5e5e66

Install

Local install (recommended)

npm init -y
npm install @vericontext/pagewise@latest

One-shot (no install)

npx @vericontext/pagewise md https://example.com

Global install

npm install -g @vericontext/pagewise

From source (development)

git clone https://github.com/vericontext/pagewise.git
cd pagewise
npm install
npx playwright install chromium

OpenAI API key

cp .env.example .env
# Edit .env with your key

Usage

Global Options

All commands support the --output flag:

pagewise <command> --output json   # Structured JSON output
pagewise <command> --output text   # Human-readable output (default in TTY)

Non-TTY pipes automatically default to JSON. Override with OUTPUT_FORMAT=json env var.

pagewise md <url>

Output clean markdown to stdout. No API key needed (unless --describe-images is used). Pipe-friendly.

pagewise md https://example.com
pagewise md https://example.com | pbcopy
pagewise md https://example.com --output json
pagewise md https://example.com --describe-images  # Include image captions

Options:

  • --describe-images — Generate image captions using GPT-4o vision and insert them into markdown (requires OPENAI_API_KEY)

pagewise ingest <url>

Scrape, chunk, embed, and store a page locally. Chunks include contextual metadata (domain, page title, section headers) for improved retrieval.

pagewise ingest https://example.com
pagewise ingest https://example.com --dry-run     # Preview without storing
pagewise ingest https://example.com --dry-run --output json

Options:

  • --dry-run — Preview what would be ingested without storing anything
  • --describe-images — Generate image captions using GPT-4o vision and include them in chunks
  • --max-images <n> — Max images to describe per page (default: 10)

pagewise pages

List stored pages and their chunk counts. No API key needed.

pagewise pages
pagewise pages --output json
pagewise pages --domain docs.example.com
pagewise pages --url https://example.com --output json

Options:

  • --domain <domain> — Filter by domain
  • --url <url> — Show details for a specific page (includes markdown content)

pagewise retrieve "<query>"

Search ingested pages and return raw chunks without LLM call. Ideal for AI agents that have their own LLM.

pagewise retrieve "what is the pricing?"
pagewise retrieve "main content" --top-k 3 --output json
pagewise retrieve "code examples" --chunk-type code --output json
pagewise retrieve "pricing" --no-rerank --merge --output json

Options:

  • --url <url> — Limit search to a specific URL
  • --top-k <n> — Number of chunks to retrieve (default: 5)
  • --domain <domain> — Filter by domain (e.g. docs.github.com)
  • --chunk-type <type> — Filter by chunk type (prose, code, table, list)
  • --no-rerank — Disable Cohere reranking
  • --merge — Merge consecutive chunks from the same page

pagewise ask "<question>"

Hybrid RAG Q&A (BM25 + vector search + RRF) over ingested pages. Automatically uses context-stuffing for small single-page queries.

pagewise ask "What is this page about?"
pagewise ask "What are the pricing tiers?" --url https://example.com
pagewise ask "Compare features" --top-k 10
pagewise ask "Show setup code" --chunk-type code
pagewise ask "API reference" --domain docs.example.com

Options:

  • --url <url> — Limit search to a specific URL
  • --top-k <n> — Number of chunks to retrieve (default: 5)
  • --domain <domain> — Filter by domain (e.g. docs.github.com)
  • --chunk-type <type> — Filter by chunk type (prose, code, table, list)

pagewise crawl <url>

Crawl a website (BFS) and ingest all discovered pages.

pagewise crawl https://docs.example.com
pagewise crawl https://docs.example.com --depth 1 --max-pages 5
pagewise crawl https://docs.example.com --include "/docs/**"
pagewise crawl https://docs.example.com --dry-run  # Discover links only
pagewise crawl https://docs.example.com --include-static --dry-run  # Include images in links

Options:

  • --depth <n> — Maximum crawl depth (default: 2)
  • --max-pages <n> — Maximum pages to crawl (default: 50)
  • --include <patterns> — Comma-separated glob patterns for paths to include
  • --exclude <patterns> — Comma-separated glob patterns for paths to exclude
  • --delay <ms> — Delay between requests in ms (default: 1000)
  • --concurrency <n> — Number of pages to scrape in parallel (default: 3)
  • --follow-links <scope> — Link scope: same-domain (default) or all (cross-domain)
  • --dry-run — Discover links on root page without crawling
  • --include-static — Include static resources (images, CSS, JS, etc.) in discovered links (default: off)
  • --describe-images — Generate image captions using GPT-4o vision and include them in chunks
  • --max-images <n> — Max images to describe per page (default: 10)
  • --image-detail <level> — Vision detail level: low (default) or high

Image Metadata

Image references (![alt](url)) are automatically extracted from each crawled page and provided as structured data. This works regardless of the --include-static flag.

# View image list per page in JSON output
pagewise crawl https://example.com --max-pages 1 --output json | jq '.pages[0].images'
# -> [{ "url": "https://example.com/diagram.png", "alt": "Architecture diagram" }, ...]

Each page object in the JSON output includes an images array:

{
  "pages": [
    {
      "url": "https://example.com",
      "chunks": 12,
      "images": [
        { "url": "https://example.com/logo.svg", "alt": "Logo" },
        { "url": "https://example.com/chart.png", "alt": "Revenue chart" }
      ]
    }
  ]
}

--include-static controls whether image URLs are added to the crawl queue. Image metadata (the images field) is always collected.

Image Description (--describe-images)

When --describe-images is enabled, GPT-4o vision analyzes each image and generates a caption, which is inserted as a blockquote in the markdown. This converts visual content (infographics, charts, diagrams) into searchable text for RAG retrieval and the ask command.

# View image descriptions in md output
pagewise md https://example.com --describe-images --output json | jq '.markdown'

# Crawl with image descriptions
pagewise crawl https://example.com --describe-images --max-images 3 --max-pages 1

Output format:

![chart](https://example.com/chart.png)
> **[Image description]** A bar chart showing quarterly revenue growth from Q1 to Q4.

Cost management:

  • --image-detail low (default) — Minimal tokens per image
  • --max-images 10 (default) — Up to 10 images per page
  • SVGs are automatically skipped (not supported by vision API)
  • Concurrency limited to 3 to avoid rate limits
  • Individual image failures are skipped; remaining images continue processing

pagewise summary <url>

Generate a 1-page summary. Auto-ingests if not already stored.

pagewise summary https://example.com

pagewise compare <url1> <url2>

Compare two web pages. Use --aspect to focus the comparison.

pagewise compare https://a.com https://b.com
pagewise compare https://a.com https://b.com --aspect pricing

pagewise schema [command]

Output CLI schema as JSON for agent/tool discovery.

pagewise schema            # All commands
pagewise schema ask        # Single command details

AI Agent Integration

Pagewise is designed to work as a tool for AI agents (Claude Code, Cursor, Cline, etc.). See AGENTS.md for the full agent guide.

Agent Workflow (Embedding API only)

Agents with their own LLM can use pagewise purely as an embedding + retrieval layer — no chat model cost:

pagewise ingest <url>                              # 1. Store content (embedding)
pagewise pages --output json                       # 2. Check stored pages
pagewise retrieve "query" --top-k 5 --output json  # 3. Search relevant chunks
# 4. Agent injects chunks into its own LLM context

Required: OPENAI_API_KEY (embedding only). No chat model call needed.

Key Features for Agents

  • pages / retrieve — Composable, LLM-free commands for agent-controlled workflows
  • --output json — Structured, parseable output on all commands
  • --dry-run — Safe preview of mutating operations (ingest, crawl)
  • pagewise schema — Runtime discovery of commands and options
  • Meaningful exit codes0 success, 2 input error, 3 network error, 4 auth error
  • Auto-detection — Non-TTY pipes default to JSON, spinners suppressed

Architecture

Hybrid Search

Queries run through two parallel search paths and are merged via Reciprocal Rank Fusion (RRF):

  1. Vector search — semantic similarity via OpenAI embeddings + sqlite-vec
  2. BM25 search — keyword matching via SQLite FTS5

Smart Context Strategy

  • Small pages (< 30K chars) with a --url filter: full markdown is passed directly to the LLM (context-stuffing)
  • Large corpora: hybrid RAG pipeline with token budget management (70% of 400K context window)

Contextual Retrieval

Each chunk is prefixed with structural metadata (From {domain}, page "{title}", section "{header}":) before embedding, improving retrieval accuracy at zero LLM cost.

Re-ranking (optional)

Set COHERE_API_KEY in .env to enable Cohere Rerank v3.5 for additional retrieval quality. Without it, RRF scores are used directly.

Exit Codes

| Code | Meaning | |------|---------| | 0 | Success | | 1 | General error | | 2 | Input validation error (bad URL, missing argument) | | 3 | Network/scraping error | | 4 | API key missing or authentication error |

Environment Variables

| Variable | Required | Description | |----------|----------|-------------| | OPENAI_API_KEY | Yes (for ingest/retrieve/ask/summary/compare) | OpenAI API key for embeddings and chat | | COHERE_API_KEY | No | Cohere API key for reranking (improves ask and retrieve quality) | | OUTPUT_FORMAT | No | Set to json or text to override auto-detection |

Data

Pages and embeddings are stored in .pagewise/pagewise.db (SQLite) under the current project directory.

Test

npm test