llmfood

v1.0.2

Published

a month ago

Generate LLM-friendly Markdown from Docusaurus HTML builds

0High
0Medium
0Low

smol-ninja

ai docusaurus llm llmstxt markdown

llmfood

Generate LLM-friendly Markdown from Docusaurus HTML builds, implementing the llms.txt convention.

Overview

llmfood converts a Docusaurus static HTML build into clean Markdown files optimized for LLM consumption. It:

Discovers all pages in a Docusaurus build directory
Resolves client-side content that doesn't exist in static HTML (GitHub code references, remote content, mermaid diagrams)
Converts each HTML page to Markdown, stripping Docusaurus chrome (breadcrumbs, pagination, TOC, footers)
Generates llms.txt — a structured index linking to all converted .md files
Generates custom files — aggregated Markdown files matching URL patterns (e.g., llms-full.txt)

Installation

npm install llmfood
# or
bun add llmfood

Usage

Docusaurus Plugin (recommended)

Add llmfood as a Docusaurus plugin for zero-config integration. It runs automatically after docusaurus build:

// docusaurus.config.js
module.exports = {
  plugins: [
    [
      "llmfood/docusaurus",
      {
        sectionOrder: ["guides", "api", "concepts"],
        sectionLabels: { guides: "Guides", api: "API Reference" },
        customFiles: [
          {
            filename: "llms-full.txt",
            title: "Full Documentation",
            description: "Complete documentation in a single file",
            includePatterns: [/.*/],
          },
        ],
      },
    ],
  ],
};

The plugin automatically derives baseUrl, buildDir, siteTitle, and siteDescription from your Docusaurus config. It also sets docsDir to {siteDir}/docs by default, enabling source file scanning for mermaid diagrams and remote content resolution.

Standalone

import { generateLlmsMarkdown } from "llmfood";

await generateLlmsMarkdown({
  baseUrl: "https://docs.example.com",
  buildDir: "./build",
  siteTitle: "My Docs",
  siteDescription: "Documentation for my project",
  docsDir: "./docs", // optional: enables source file scanning
  sectionOrder: ["guides", "api", "concepts"],
  sectionLabels: { guides: "Guides", api: "API Reference" },
  ignorePatterns: [/\/blog\//],
  customFiles: [
    {
      filename: "llms-full.txt",
      title: "Full Documentation",
      description: "Complete documentation in a single file",
      includePatterns: [/.*/],
    },
  ],
});

Standalone HTML to Markdown

You can also use the converter directly:

import { htmlToMarkdown } from "llmfood";

const markdown = htmlToMarkdown(docusaurusHtmlString);

Content Resolution

Some Docusaurus plugins render content client-side, so the static HTML contains placeholders instead of real content. When docsDir is set, llmfood scans MDX source files and resolves these automatically:

| Pattern | Source detection | Resolution | | ---------------------- | ------------------------------------------------------------------------------------------------ | ----------------------------------------------------------------- | | GitHub code references | CodeBlock JSX, fenced ```lang reference, and children/src/srcUrl/source attributes | Fetches code from raw.githubusercontent.com with line ranges | | Remote content | url="..." or url={expr} in MDX | Fetches remote markdown (JSX expressions via resolveRemoteUrl) | | Mermaid diagrams | ```mermaid blocks in MDX | Injects mermaid source into HTML (client-side renders leave none) | | YouTube embeds | <iframe> with YouTube URL in HTML | Converts to [title](youtube-url) markdown link |

Source scanning also resolves imported MDX snippets (import Foo from "./_snippet.mdx"), substitutes ${props.x} expressions using caller prop values, and matches files by frontmatter id when the slug differs from the filename.

All external fetches run in parallel with a concurrency limit of 6.

API

`generateLlmsMarkdown(config)`

Processes an entire Docusaurus build and generates llms.txt plus any custom files.

`LlmfoodConfig`

| Property | Type | Required | Description | | --------------------- | --------------------------- | -------- | --------------------------------------------------------------------------- | | baseUrl | string | Yes | Base URL for generated links (e.g., https://docs.example.com) | | buildDir | string | Yes | Path to the Docusaurus build output directory | | customFiles | CustomLlmFile[] | No | Custom aggregated output files to generate | | docsDir | string | No | Path to docs source directory (enables mermaid + remote content resolution) | | ignorePatterns | RegExp[] | No | URL patterns to exclude (root / is always excluded) | | postProcessHtml | (html, context) => string | No | Hook to transform HTML before markdown conversion | | postProcessMarkdown | (md, context) => string | No | Hook to transform markdown after conversion | | resolveRemoteUrl | (expr) => string | No | Resolve JSX expressions (e.g., getBenchmarkURL(...)) to fetch URLs | | rootContent | string | No | Additional content to include at the top of llms.txt | | sectionLabels | Record<string, string> | No | Custom display labels for URL sections | | sectionOrder | string[] | No | Ordering for sections in llms.txt | | siteDescription | string | No | Site description shown in llms.txt | | siteTitle | string | No | Site title shown in llms.txt | | verbose | boolean | No | Log individual skipped pages with reasons |

Both hooks receive a ProcessContext with { urlPath: string } and may return a Promise.

`CustomLlmFile`

| Property | Type | Required | Description | | ----------------- | ---------- | -------- | ---------------------------------------- | | filename | string | Yes | Output filename (e.g., llms-full.txt) | | includePatterns | RegExp[] | Yes | URL patterns to include in this file | | description | string | No | Description shown at the top of the file | | title | string | No | Title shown at the top of the file |

`htmlToMarkdown(html)`

Converts a Docusaurus HTML string to clean Markdown. Expects the content to be wrapped in an <article> tag.

Returns an empty string if no <article> element is found.

Supported Docusaurus Elements

The converter handles these Docusaurus-specific elements:

Prism code blocks — preserves language and syntax highlighting structure
Admonitions — converts to :::type [title] syntax (tip, warning, info, caution, danger, note, important)
Tabs — renders each tab panel with its label as a bold heading
Details/Summary — preserves as HTML <details> elements
KaTeX math — converts to $$...$$ (block) and $...$ (inline) syntax
Images — converts to standard Markdown, skipping data URIs
Tables — converts to GFM table syntax with alignment support (:---:, ---:)
Strikethrough — converts <del> and <s> to ~~text~~
YouTube iframes — converts to markdown links with video title
Mermaid code blocks — preserves as fenced mermaid code blocks (when source is available)

Pages that can't be converted are tracked and summarized. Set verbose: true to see individual skipped pages with reasons (redirects, empty pages, missing files, errors).

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

llmfood

Overview

Installation

Usage

Docusaurus Plugin (recommended)

Standalone

Standalone HTML to Markdown

Content Resolution

API

generateLlmsMarkdown(config)

LlmfoodConfig

CustomLlmFile

htmlToMarkdown(html)

Supported Docusaurus Elements

License

`generateLlmsMarkdown(config)`

`LlmfoodConfig`

`CustomLlmFile`

`htmlToMarkdown(html)`