cyte

v1.0.4

Published

3 months ago

Recursive Web to Markdown CLI for AI Agents

0High
0Medium
0Low

cyte

cyte is a TypeScript CLI for extracting website content into Markdown, discovering links, and crawling internal pages recursively.

What It Does

Extract a page into clean Markdown.
Discover all links on a page with internal/external classification.
Recursively crawl internal pages and save content by domain + route.
Output JSON for automation/agent workflows.

Install

Global install (recommended for regular usage):

pnpm add -g cyte
cyte --help

No-install one-off usage:

npx cyte --help
npx cyte https://example.com

PNPM one-off alternative:

pnpm dlx cyte --help

For local development of this repo:

pnpm install
pnpm build

Quickstart

Single page extraction (stdout):

cyte vercel.com
npx cyte vercel.com

Links only:

cyte links vercel.com --json

Deep crawl + file output:

cyte vercel.com --deep --depth 2

Command Reference

`cyte <url>`

Extract a single page into Markdown.

Default behavior:

Prints Markdown to stdout.
Does not save files unless --deep is enabled.

Examples:

cyte https://example.com
cyte example.com
cyte example.com --json

Options:

--deep: enable recursive internal crawl and file output.
--depth <number>: max crawl depth, default 1.
--delay <number>: delay between requests in ms, default 150.
--concurrency <number>: max parallel crawl requests, default 3.
--output <path>: output directory for deep crawl, default ./cyte.
--clean: remove target domain output directory before deep crawl.
--sitemap: seed crawl from sitemap URLs (including robots sitemap entries).
--no-respect-robots: ignore robots.txt rules during deep crawl.
--json: return structured JSON instead of human output.
--format <type>: json or jsonl (used with --json), default json.
--download-media: reserved flag (not active yet).

`cyte links <url>`

Return links found in a page.

Examples:

cyte links https://example.com
cyte links example.com --json
cyte links example.com --internal
cyte links example.com --external --match docs

Options:

--internal: only internal links.
--external: only external links.
--match <pattern>: filter by title or URL substring.
--json: output JSON array.
--format <type>: json or jsonl (used with --json), default json.

URL Handling

Bare domains are accepted: vercel.com -> https://vercel.com/.
For failed https:// requests, cyte retries with http://.
URLs are normalized for crawl deduplication:
- hash fragments removed
- query string removed in crawl/link normalization
- trailing slashes normalized
Skips unsupported link protocols:
- mailto:
- javascript:
- tel:

Output Behavior

Single page mode (`cyte <url>`)

Returns extracted markdown to stdout.
Also prints success metadata (source URL, links discovered).
Does not write files.

Links mode (`cyte links <url>`)

Returns links table or JSON.
Does not write files.

Deep mode (`cyte <url> --deep`)

Crawls internal links only.
Writes markdown files grouped by domain and route:

cyte/
  example.com/
    index.md
    docs/
      index.md
      intro/
        index.md

Existing output files at the same path are overwritten.
Missing directories are created automatically.
Deep crawl ensures .gitignore contains cyte/.
If crawl failures occur, an error report is written to:
- cyte/<domain>/_errors.json

JSON Output Contracts

Extract JSON (`cyte <url> --json`)

{
  "url": "https://example.com/",
  "title": "Example Domain",
  "markdown": "# Example Domain\n...",
  "links": [
    {
      "title": "Learn more",
      "url": "https://iana.org/domains/example",
      "type": "external"
    }
  ]
}

Links JSON (`cyte links <url> --json`)

[
  {
    "title": "Docs",
    "url": "https://example.com/docs",
    "type": "internal"
  }
]

Crawl JSON (`cyte <url> --deep --json`)

{
  "startUrl": "https://example.com/",
  "pagesVisited": 10,
  "pagesSucceeded": 10,
  "pagesFailed": 0,
  "pages": []
}

JSONL output

Use --format jsonl with --json.

Examples:

cyte links docs.example.com --json --format jsonl
cyte docs.example.com --deep --json --format jsonl

Notes:

links emits one link object per line.
extract emits one object line.
deep emits:
- one summary line
- one page line per crawled page

AI Agent Usage

cyte is agent-friendly by default: deterministic CLI, URL normalization, and machine-readable --json output.

Core agent workflows

Discover routes, then fetch selected pages:

cyte links https://docs.example.com --json
cyte https://docs.example.com/authentication --json

Filter internal links by topic:

cyte links https://docs.example.com --internal --match auth --json

Build a knowledge snapshot for RAG:

cyte https://docs.example.com --deep --depth 2 --json

Recommended decision loop for agents

Run links --json on the seed page.
Keep only internal links and apply topic filters (--match).
Fetch top candidate pages with cyte <url> --json.
Escalate to deep crawl if coverage is insufficient.
Index markdown + metadata for retrieval.

Contracts agents can rely on

cyte links <url> --json returns an array of:
- { title, url, type }
cyte <url> --json returns:
- { url, title, markdown, links }
cyte <url> --deep --json returns summary:
- { startUrl, pagesVisited, pagesSucceeded, pagesFailed, pages }

Production notes for agent pipelines

Use --json in automation paths.
Start conservative on crawling:
- --depth 1 --concurrency 2 --delay 200
Treat page-level failures as partial success and continue.
Re-crawls overwrite existing files by output path.

Node.js tool wrapper example

import { execFile } from "node:child_process";
import { promisify } from "node:util";

const execFileAsync = promisify(execFile);

async function discoverLinks(url: string) {
  const { stdout } = await execFileAsync("cyte", ["links", url, "--json"]);
  return JSON.parse(stdout) as Array<{
    title: string;
    url: string;
    type: string;
  }>;
}

Extraction Details

Uses Readability + fallback extraction for landing pages where Readability is too thin.
Preserves headings, lists, tables, code blocks, blockquotes.
Converts relative media URLs to absolute URLs:
- /logo.png -> https://domain.com/logo.png

Development

Run in dev mode:

pnpm dev -- --help

Build:

pnpm build

Tests:

pnpm test
pnpm test:watch

Tech Stack

Node.js + TypeScript
Commander
Undici
JSDOM + Readability
Turndown + GFM plugin
Cheerio
p-limit
fs-extra

Troubleshooting

If output seems too thin on a landing page, rerun and compare --json output to inspect extracted content and links.
If a site blocks requests, try a lower concurrency and add delay:
- --concurrency 1 --delay 400
If deep crawl seems incomplete, increase --depth.
If crawl coverage is still low, try --sitemap.
If pages are skipped unexpectedly, verify robots rules or use --no-respect-robots.

Releases

Changelog: see CHANGELOG.md.
Versioning: semantic versioning (major.minor.patch).
Publish flow:
1. update CHANGELOG.md
2. bump version (pnpm version patch|minor|major)
3. publish (pnpm publish --access public)

License

MIT. See LICENSE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

cyte

What It Does

Install

Quickstart

Command Reference

cyte <url>

cyte links <url>

URL Handling

Output Behavior

Single page mode (cyte <url>)

Links mode (cyte links <url>)

Deep mode (cyte <url> --deep)

JSON Output Contracts

Extract JSON (cyte <url> --json)

Links JSON (cyte links <url> --json)

Crawl JSON (cyte <url> --deep --json)

JSONL output

AI Agent Usage

Core agent workflows

Recommended decision loop for agents

Contracts agents can rely on

Production notes for agent pipelines

Node.js tool wrapper example

Extraction Details

Development

Tech Stack

Troubleshooting

Releases

License

`cyte <url>`

`cyte links <url>`

Single page mode (`cyte <url>`)

Links mode (`cyte links <url>`)

Deep mode (`cyte <url> --deep`)

Extract JSON (`cyte <url> --json`)

Links JSON (`cyte links <url> --json`)

Crawl JSON (`cyte <url> --deep --json`)