payload-plugin-scrape-ai

v0.2.2

Published

a month ago

Payload CMS plugin that auto-generates AI-friendly content (llms.txt, markdown, JSON-LD)

0High
0Medium
0Low

payload payloadcms payload-plugin cms ai llms-txt llms.txt markdown json-ld structured-data seo content-api rag ai-agents web-scraping

payload-plugin-scrape-ai

Make any Payload CMS website instantly accessible to AI agents and LLMs.

Auto-generates AI-friendly content mirrors of your entire site — llms.txt, structured markdown, JSON-LD, content relationship graphs, and a context search API. Everything stays in sync as content changes.

What It Generates

| Endpoint | Description | |----------|-------------| | /llms.txt | Curated content index (llms.txt standard) | | /llms-full.txt | Complete content listing | | /ai/:collection/:slug.md | Per-page markdown with YAML frontmatter | | /ai/sitemap.json | Content relationship graph | | /ai/structured/:collection/:slug | JSON-LD structured data | | /ai/context?query=... | Relevance-scored search API | | /.well-known/ai-plugin.json | Discovery manifest |

Quick Start

1. Install

npm install payload-plugin-scrape-ai

2. Add to Payload config

// payload.config.ts
import { scrapeAiPlugin } from 'payload-plugin-scrape-ai'

export default buildConfig({
  plugins: [
    scrapeAiPlugin({
      siteUrl: process.env.NEXT_PUBLIC_SERVER_URL || 'https://your-website.com',
      siteName: 'My Website',
      siteDescription: 'A brief description for the llms.txt header',
    }),
  ],
})

3. Wrap your Next.js config

// next.config.mjs
import { withScrapeAi } from 'payload-plugin-scrape-ai/next'

export default withScrapeAi({
  // ... your existing Next.js config
})

This serves AI content at root-level URLs (/llms.txt, /ai/*, /.well-known/ai-plugin.json) and adds HTTP Link headers for discovery on every page.

4. Add discovery metadata (recommended)

For Next.js App Router — spread into your layout metadata:

// app/layout.tsx
import { generateAiMetadata } from 'payload-plugin-scrape-ai'

export const metadata = {
  ...generateAiMetadata('https://your-website.com'),
  // ... your other metadata
}

Or use the React components for Pages Router / manual control:

import { ScrapeAiMeta, ScrapeAiFooterTag } from 'payload-plugin-scrape-ai/discovery'

<head>
  <ScrapeAiMeta siteUrl="https://your-website.com" siteName="My Website" />
</head>
<body>
  {children}
  <ScrapeAiFooterTag siteUrl="https://your-website.com" />
</body>

5. Visit `/admin/scrape-ai`

The dashboard gives you full control: collection toggles, content preview, llms.txt ordering, AI settings, endpoint testing, and a dead letter queue for failed entries.

Configuration

scrapeAiPlugin({
  siteUrl: 'https://your-website.com',       // required
  siteName: 'My Website',                     // llms.txt header
  siteDescription: 'What this site is about', // llms.txt header

  // Collection control (auto-detected if omitted)
  collections: ['pages', 'posts', 'products'],
  exclude: ['users', 'media'],

  // Draft handling
  drafts: 'published-only', // default — or 'include-drafts'

  // AI enrichment (optional — plugin works fully without it)
  ai: {
    provider: 'openai',    // or 'anthropic'
    apiKey: process.env.AI_API_KEY,
    model: 'gpt-4.1-nano', // use the built-in token estimator to pick
  },

  // Sync tuning
  sync: {
    debounceMs: 30000,
    initialSyncConcurrency: 5,
    rateLimitPerMinute: 60,
  },

  // Collection overrides (advanced — customize generated collections)
  aiContentOverrides: { access: { read: () => true } },
  aiSyncQueueOverrides: {},
  aiAggregatesOverrides: {},

  enabled: false, // disable runtime, keep DB schema
})

AI Enrichment

Works without any AI provider. When configured, it adds per-document:

Summary — 1-2 sentence description
Topics & Entities — Key themes, named entities, category
Semantic Chunks — Content split for RAG pipelines

Uses a single batched API call per document. The built-in Token Estimator (AI Settings tab) shows exact cost before you enable anything.

| Provider | Budget Model | Standard Model | |----------|-------------|---------------| | OpenAI | gpt-4.1-nano | gpt-4.1-mini | | Anthropic | claude-haiku-4-5 | claude-sonnet-4-6 |

Architecture

Content Change → afterChange Hook
                    │
                    ├─ Stage 1: Extract (Lexical/Slate → Markdown)
                    ├─ Stage 2: Structure (Frontmatter, JSON-LD, Hierarchy)
                    ├─ Upsert to ai-content collection
                    │
                    ├─ Queue: AI enrichment (async, never blocks save)
                    └─ Queue: Aggregate rebuild (deduplicated)
                                │
                          Scheduler → llms.txt, sitemap, llms-full.txt

Three collections are created: ai-content (document mirrors), ai-aggregates (llms.txt, sitemap cache), ai-sync-queue (job queue). All hidden from the admin sidebar.

Compatibility

Payload CMS v3.0.0+
Database — Any (MongoDB, Postgres, SQLite)
Rich Text — Lexical and Slate
Hosting — Long-lived Node.js (Vercel/serverless: scheduler won't fire between invocations, but content syncs on every save via hooks)

Local Development

When developing locally with pnpm, use tarball install to avoid duplicate @payloadcms/ui context issues:

# In the plugin repo
npm pack

# In your consuming project
pnpm install /path/to/payload-plugin-scrape-ai-0.2.0.tgz

Direct pnpm install /path/to/plugin creates a symlink that causes React context duplication with @payloadcms/ui.

Support

If this plugin saves you time:

License

MIT

Built by Leonardo Zambaiti / Zepoch

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

payload-plugin-scrape-ai

What It Generates

Quick Start

1. Install

2. Add to Payload config

3. Wrap your Next.js config

4. Add discovery metadata (recommended)

5. Visit /admin/scrape-ai

Configuration

AI Enrichment

Architecture

Compatibility

Local Development

Support

License

5. Visit `/admin/scrape-ai`