npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

cyte

v1.0.4

Published

Recursive Web to Markdown CLI for AI Agents

Readme

cyte

cyte is a TypeScript CLI for extracting website content into Markdown, discovering links, and crawling internal pages recursively.

What It Does

  • Extract a page into clean Markdown.
  • Discover all links on a page with internal/external classification.
  • Recursively crawl internal pages and save content by domain + route.
  • Output JSON for automation/agent workflows.

Install

Global install (recommended for regular usage):

pnpm add -g cyte
cyte --help

No-install one-off usage:

npx cyte --help
npx cyte https://example.com

PNPM one-off alternative:

pnpm dlx cyte --help

For local development of this repo:

pnpm install
pnpm build

Quickstart

Single page extraction (stdout):

cyte vercel.com
npx cyte vercel.com

Links only:

cyte links vercel.com --json

Deep crawl + file output:

cyte vercel.com --deep --depth 2

Command Reference

cyte <url>

Extract a single page into Markdown.

Default behavior:

  • Prints Markdown to stdout.
  • Does not save files unless --deep is enabled.

Examples:

cyte https://example.com
cyte example.com
cyte example.com --json

Options:

  • --deep: enable recursive internal crawl and file output.
  • --depth <number>: max crawl depth, default 1.
  • --delay <number>: delay between requests in ms, default 150.
  • --concurrency <number>: max parallel crawl requests, default 3.
  • --output <path>: output directory for deep crawl, default ./cyte.
  • --clean: remove target domain output directory before deep crawl.
  • --sitemap: seed crawl from sitemap URLs (including robots sitemap entries).
  • --no-respect-robots: ignore robots.txt rules during deep crawl.
  • --json: return structured JSON instead of human output.
  • --format <type>: json or jsonl (used with --json), default json.
  • --download-media: reserved flag (not active yet).

cyte links <url>

Return links found in a page.

Examples:

cyte links https://example.com
cyte links example.com --json
cyte links example.com --internal
cyte links example.com --external --match docs

Options:

  • --internal: only internal links.
  • --external: only external links.
  • --match <pattern>: filter by title or URL substring.
  • --json: output JSON array.
  • --format <type>: json or jsonl (used with --json), default json.

URL Handling

  • Bare domains are accepted: vercel.com -> https://vercel.com/.
  • For failed https:// requests, cyte retries with http://.
  • URLs are normalized for crawl deduplication:
    • hash fragments removed
    • query string removed in crawl/link normalization
    • trailing slashes normalized
  • Skips unsupported link protocols:
    • mailto:
    • javascript:
    • tel:

Output Behavior

Single page mode (cyte <url>)

  • Returns extracted markdown to stdout.
  • Also prints success metadata (source URL, links discovered).
  • Does not write files.

Links mode (cyte links <url>)

  • Returns links table or JSON.
  • Does not write files.

Deep mode (cyte <url> --deep)

  • Crawls internal links only.
  • Writes markdown files grouped by domain and route:
cyte/
  example.com/
    index.md
    docs/
      index.md
      intro/
        index.md
  • Existing output files at the same path are overwritten.
  • Missing directories are created automatically.
  • Deep crawl ensures .gitignore contains cyte/.
  • If crawl failures occur, an error report is written to:
    • cyte/<domain>/_errors.json

JSON Output Contracts

Extract JSON (cyte <url> --json)

{
  "url": "https://example.com/",
  "title": "Example Domain",
  "markdown": "# Example Domain\n...",
  "links": [
    {
      "title": "Learn more",
      "url": "https://iana.org/domains/example",
      "type": "external"
    }
  ]
}

Links JSON (cyte links <url> --json)

[
  {
    "title": "Docs",
    "url": "https://example.com/docs",
    "type": "internal"
  }
]

Crawl JSON (cyte <url> --deep --json)

{
  "startUrl": "https://example.com/",
  "pagesVisited": 10,
  "pagesSucceeded": 10,
  "pagesFailed": 0,
  "pages": []
}

JSONL output

Use --format jsonl with --json.

Examples:

cyte links docs.example.com --json --format jsonl
cyte docs.example.com --deep --json --format jsonl

Notes:

  • links emits one link object per line.
  • extract emits one object line.
  • deep emits:
    • one summary line
    • one page line per crawled page

AI Agent Usage

cyte is agent-friendly by default: deterministic CLI, URL normalization, and machine-readable --json output.

Core agent workflows

  1. Discover routes, then fetch selected pages:
cyte links https://docs.example.com --json
cyte https://docs.example.com/authentication --json
  1. Filter internal links by topic:
cyte links https://docs.example.com --internal --match auth --json
  1. Build a knowledge snapshot for RAG:
cyte https://docs.example.com --deep --depth 2 --json

Recommended decision loop for agents

  1. Run links --json on the seed page.
  2. Keep only internal links and apply topic filters (--match).
  3. Fetch top candidate pages with cyte <url> --json.
  4. Escalate to deep crawl if coverage is insufficient.
  5. Index markdown + metadata for retrieval.

Contracts agents can rely on

  • cyte links <url> --json returns an array of:
    • { title, url, type }
  • cyte <url> --json returns:
    • { url, title, markdown, links }
  • cyte <url> --deep --json returns summary:
    • { startUrl, pagesVisited, pagesSucceeded, pagesFailed, pages }

Production notes for agent pipelines

  • Use --json in automation paths.
  • Start conservative on crawling:
    • --depth 1 --concurrency 2 --delay 200
  • Treat page-level failures as partial success and continue.
  • Re-crawls overwrite existing files by output path.

Node.js tool wrapper example

import { execFile } from "node:child_process";
import { promisify } from "node:util";

const execFileAsync = promisify(execFile);

async function discoverLinks(url: string) {
  const { stdout } = await execFileAsync("cyte", ["links", url, "--json"]);
  return JSON.parse(stdout) as Array<{
    title: string;
    url: string;
    type: string;
  }>;
}

Extraction Details

  • Uses Readability + fallback extraction for landing pages where Readability is too thin.
  • Preserves headings, lists, tables, code blocks, blockquotes.
  • Converts relative media URLs to absolute URLs:
    • /logo.png -> https://domain.com/logo.png

Development

Run in dev mode:

pnpm dev -- --help

Build:

pnpm build

Tests:

pnpm test
pnpm test:watch

Tech Stack

  • Node.js + TypeScript
  • Commander
  • Undici
  • JSDOM + Readability
  • Turndown + GFM plugin
  • Cheerio
  • p-limit
  • fs-extra

Troubleshooting

  • If output seems too thin on a landing page, rerun and compare --json output to inspect extracted content and links.
  • If a site blocks requests, try a lower concurrency and add delay:
    • --concurrency 1 --delay 400
  • If deep crawl seems incomplete, increase --depth.
  • If crawl coverage is still low, try --sitemap.
  • If pages are skipped unexpectedly, verify robots rules or use --no-respect-robots.

Releases

  • Changelog: see CHANGELOG.md.
  • Versioning: semantic versioning (major.minor.patch).
  • Publish flow:
    1. update CHANGELOG.md
    2. bump version (pnpm version patch|minor|major)
    3. publish (pnpm publish --access public)

License

MIT. See LICENSE.