site-to-md

v0.1.1

Published

3 months ago

Make any website AI-agent-readable. Generates /llms.txt + clean markdown for every page.

0High
0Medium
0Low

dylanfeltus

llms.txt ai agents markdown crawler readability seo aeo

site-to-md 🤖

Make any website AI-agent-readable. Generates /llms.txt + clean markdown for every page.

robots.txt told search engines what to crawl. llms.txt tells AI agents what to read. This tool generates both automatically.

Quick Start

npx site-to-md https://mysite.com

That's it. Zero config. You'll get:

site-to-md-output/
├── llms.txt           # Index file per llmstxt.org spec
├── llms-ctx.txt       # All content inline (for single-prompt ingestion)
├── index.html.md      # Homepage as markdown
├── docs/
│   ├── getting-started.html.md
│   └── api-reference.html.md
└── blog/
    ├── hello-world.html.md
    └── release-notes.html.md

What It Does

Crawls your site — follows links or uses sitemap.xml if available
Extracts content — strips nav, footer, ads, scripts using Mozilla Readability (same as Firefox Reader View)
Converts to markdown — clean, structured markdown via Turndown
Generates /llms.txt — per the llmstxt.org spec
Generates per-page .html.md files — per the spec convention
Generates /llms-ctx.txt — all content inline for single-prompt ingestion

Install

# Use directly with npx (no install needed)
npx site-to-md https://mysite.com

# Or install globally
npm install -g site-to-md

# Or as a project dependency
npm install site-to-md

CLI Usage

# Crawl a live website
site-to-md https://docs.mysite.com

# Process local build output
site-to-md ./dist

# Customize output
site-to-md https://mysite.com \
  --out ./public \
  --title "My Product" \
  --desc "Developer documentation for My Product"

# Filter pages
site-to-md https://mysite.com \
  --include "/docs/**" \
  --include "/blog/**" \
  --exclude "/admin/**"

# Skip context file
site-to-md https://mysite.com --no-ctx

Options

| Flag | Description | Default | |------|-------------|---------| | --out <dir> | Output directory | ./site-to-md-output | | --title <name> | Site title for llms.txt | Auto-detected | | --desc <text> | Site description | Auto-detected | | --include <glob> | Include only matching paths (repeatable) | All | | --exclude <glob> | Exclude matching paths (repeatable) | None | | --no-ctx | Skip generating llms-ctx.txt | — | | --no-sitemap | Don't use sitemap.xml for crawling | — | | --max-depth <n> | Max crawl depth | 3 | | --concurrency <n> | Parallel requests | 5 | | --strip <selector> | CSS selectors to strip (repeatable) | — | | --config <path> | Config file path | Auto-detect |

Programmatic API

import { agentReady } from 'site-to-md';

const result = await agentReady({
  url: 'https://mysite.com',
  outDir: './public',
  title: 'My Product',
  description: 'Developer documentation',
  include: ['/docs/**'],
});

console.log(`Generated ${result.pages.length} pages`);
console.log(result.llmsTxt); // Contents of llms.txt

Config File

Create site-to-md.config.js in your project root:

export default {
  url: 'https://mysite.com',
  outDir: './public',
  title: 'My Product',
  description: 'A brief description for agents',

  include: ['/docs/**', '/blog/**'],
  exclude: ['/admin/**'],

  sections: {
    'Documentation': '/docs/**',
    'Blog': '/blog/**',
    'API Reference': '/api-docs/**',
  },

  maxDepth: 3,
  concurrency: 5,
  stripSelectors: ['.cookie-banner', '.ad-wrapper'],
};

Build Pipeline

{
  "scripts": {
    "build": "next build && site-to-md ./out --out ./out"
  }
}

Output Format

`/llms.txt`

Per the llmstxt.org spec:

# My Product

> Developer documentation for building with My Product

## Documentation

- [Getting Started](/docs/getting-started.html.md): Quick start guide
- [API Reference](/docs/api-reference.html.md): Complete API docs

## Blog

- [Hello World](/blog/hello-world.html.md): Our launch announcement

Per-page `.html.md`

Clean markdown extracted from each page — no nav, footer, ads, or scripts.

`/llms-ctx.txt`

All page content concatenated in a single file for one-shot ingestion by AI agents.

What is llms.txt?

llms.txt is a proposed standard (by Jeremy Howard) for making websites readable by AI agents. Think of it like robots.txt but for LLMs:

/llms.txt — A markdown index file listing your site's key pages with descriptions. AI agents read this first to understand what's on your site.
*.html.md — Clean markdown versions of each page (same URL + .md). No nav, no footer, no JavaScript — just the content.
/llms-ctx.txt — All content concatenated in one file for single-prompt ingestion.

Sites like Anthropic, Cloudflare, and Stripe already have /llms.txt files. site-to-md generates yours automatically.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme