@casoon/astro-crawler-policy
v0.1.2
Published
Policy-first crawler control for Astro — generates robots.txt and llms.txt with presets, per-bot rules, AI crawler registry, and build-time audits.
Maintainers
Readme
@casoon/astro-crawler-policy
Policy-first crawler control for Astro. Generates robots.txt (and optionally llms.txt) from a typed configuration at build time.
What it does
- Generates
robots.txtfrom a typed configuration — no manual file editing required - Applies one of five built-in presets covering the most common use cases
- Supports content signals (
search,ai-input,ai-train) for newer crawler directives - Includes a bot registry with 13 known crawlers for per-bot and group-based rules
- Merges the generated output with an existing
public/robots.txt(replace / prepend / append) - Runs build-time audits that warn about common misconfigurations
- Optionally generates
llms.txt— a markdown summary of the AI content policy - Supports environment-specific overrides (e.g. lockdown on staging)
This plugin renders crawler policy. It does not enforce blocking at the network, WAF, or edge layer.
Installation
npm install @casoon/astro-crawler-policyQuick start
// astro.config.ts
import { defineConfig } from 'astro/config';
import crawlerPolicy from '@casoon/astro-crawler-policy';
export default defineConfig({
site: 'https://example.com',
integrations: [
crawlerPolicy({
preset: 'citationFriendly',
sitemaps: ['/sitemap-index.xml']
})
]
});The plugin hooks into astro:build:done and writes dist/robots.txt. With just these two options you get sensible defaults: search engines allowed, verified AI bots allowed for citation, AI training bots blocked.
Presets
Presets are the primary way to express intent. Each preset sets default content signals and group-level rules.
| Preset | Search | AI citation | AI training | Unknown AI |
|---|---|---|---|---|
| seoOnly | allow | disallow | disallow | disallow |
| citationFriendly (default) | allow | allow | disallow | disallow |
| openToAi | allow | allow | allow | allow |
| blockTraining | allow | allow | disallow | disallow |
| lockdown | disallow | disallow | disallow | disallow |
citationFriendly allows bots that do citation or summarization but blocks bots whose only purpose is training data collection (GPTBot, Google-Extended, CCBot, Bytespider, Applebot-Extended). Bots with mixed roles like ClaudeBot are allowed.
blockTraining goes further and blocks every bot with any training category, including mixed bots like ClaudeBot and meta-externalagent.
lockdown adds a global User-agent: * / Disallow: / rule, overriding everything.
Content signals
Content signals are non-standard directives appended to the wildcard User-agent: * block:
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
Allow: /They communicate intent to crawlers that support them. The three signals map to:
| Signal | Meaning |
|---|---|
| search | Indexing for traditional search engines |
| aiInput | Using content as input for AI responses (citation, summarization) |
| aiTrain | Using content as AI training data |
The directive name and signal keys follow the contentsignals.org specification (proposed IETF aipref standard). Google Search Console may flag them as unrecognised directives — the audit system emits an info message when they are present.
Each preset sets default values for all three signals. You can override them individually:
crawlerPolicy({
preset: 'citationFriendly',
contentSignals: {
aiTrain: true // override just this one; search and aiInput come from the preset
}
})Groups and per-bot rules
Rules are resolved in layers, from least to most specific:
- Preset — sets group-level defaults
groups— overrides for entire bot categoriesbots— overrides for individual bots by registry ID
A bot's final action is the most specific rule that applies to it. An explicit entry in bots always wins over a groups setting.
crawlerPolicy({
preset: 'citationFriendly',
// Override an entire group
groups: {
searchEngines: 'allow', // default
verifiedAi: 'allow', // default
unknownAi: 'disallow' // default
},
// Override individual bots (takes precedence over groups)
bots: {
GPTBot: 'disallow', // blocks this bot even if verifiedAi is 'allow'
ClaudeBot: 'allow' // allows this bot even if verifiedAi were 'disallow'
}
})The three groups are:
searchEngines— bots with categorysearch(Googlebot, Bingbot)verifiedAi— verified bots with AI categories (ai-search,ai-input,ai-training)unknownAi— unverified bots or bots with categoryunknown-ai
When a bot's action resolves to 'inherit' (no group or preset covers it), the bot is omitted from the output.
Custom rules
For anything not covered by the preset or registry, use rules to add raw robots.txt directives:
crawlerPolicy({
rules: [
{
userAgent: '*',
disallow: ['/admin/', '/internal/'],
crawlDelay: 2
},
{
userAgent: 'Slurp',
disallow: ['/']
}
]
})A userAgent: '*' rule in rules is merged with the wildcard block that the preset generates — it does not create a second User-agent: * section.
Available fields per rule:
| Field | Type | Description |
|---|---|---|
| userAgent | string \| string[] | One or more User-agent values |
| allow | string[] | Paths to allow |
| disallow | string[] | Paths to disallow |
| crawlDelay | number | Crawl-delay in seconds |
| comment | string | Inline comment above the rule |
Merge strategy
When a public/robots.txt already exists, the merge strategy controls how it is combined with the generated output.
| Strategy | Result |
|---|---|
| prepend (default) | Generated output first, then existing file |
| append | Existing file first, then generated output |
| replace | Generated output only, existing file ignored |
crawlerPolicy({
mergeStrategy: 'prepend'
})Use prepend to let the generated policy take precedence. Use append to keep hand-written rules at the top. Use replace when you want full control from config and no manual overrides.
Environment overrides
The plugin detects the current environment from these variables, in order:
CONTEXT(Netlify)DEPLOYMENT_ENVIRONMENTNODE_ENV- Falls back to
'production'
Use env to apply different settings per environment:
crawlerPolicy({
preset: 'citationFriendly',
env: {
staging: { preset: 'lockdown' },
preview: { preset: 'lockdown' }
}
})Any option can be overridden per environment. Nested objects (contentSignals, bots, groups) are merged — not replaced — with the base config.
Output files
crawlerPolicy({
output: {
robotsTxt: true, // default — writes dist/robots.txt
llmsTxt: true // opt-in — writes dist/llms.txt
}
})llms.txt
When output.llmsTxt: true is set, the plugin generates dist/llms.txt alongside robots.txt. The file is a Markdown summary of the AI content policy — which crawlers are allowed or blocked, what signals are active, and where the sitemap is:
# example.com
> AI content access policy for example.com.
> Generated by @casoon/astro-crawler-policy (preset: citationFriendly).
## Content Policy
- Search indexing: allowed
- AI citation and summarization: allowed
- AI training data collection: not allowed
## AI Systems
### Allowed
- OAI-SearchBot (OpenAI)
- ClaudeBot (Anthropic)
- claude-web (Anthropic)
- PerplexityBot (Perplexity)
- meta-externalagent (Meta)
- Amazonbot (Amazon)
- Googlebot (Google)
- Bingbot (Microsoft)
### Blocked
- GPTBot (OpenAI)
- Google-Extended (Google)
- CCBot (Common Crawl)
- Bytespider (ByteDance)
- Applebot-Extended (Apple)
## Sitemaps
- https://example.com/sitemap-index.xmlDebug mode
Set debug: true to print the resolved configuration to the build log:
crawlerPolicy({ debug: true })Build output:
[@casoon/astro-crawler-policy] [debug] registry version: 2026-04-09
[@casoon/astro-crawler-policy] [debug] environment: production
[@casoon/astro-crawler-policy] [debug] preset: citationFriendly
[@casoon/astro-crawler-policy] [debug] content signals: search=yes, aiInput=yes, aiTrain=no
[@casoon/astro-crawler-policy] [debug] bot: GPTBot → disallow
[@casoon/astro-crawler-policy] [debug] bot: OAI-SearchBot → allow
...
[@casoon/astro-crawler-policy] [debug] sitemap: https://example.com/sitemap-index.xmlBot registry
The following bots are known and can be referenced by ID in bots: {}:
| ID | Provider | Categories | Group |
|---|---|---|---|
| GPTBot | OpenAI | ai-training | verifiedAi |
| OAI-SearchBot | OpenAI | ai-search, ai-input | verifiedAi |
| ClaudeBot | Anthropic | ai-input, ai-training | verifiedAi |
| claude-web | Anthropic | ai-input | verifiedAi |
| Google-Extended | Google | ai-training | verifiedAi |
| CCBot | Common Crawl | ai-training | verifiedAi |
| PerplexityBot | Perplexity | ai-search, ai-input | verifiedAi |
| Bytespider | ByteDance | ai-training | verifiedAi |
| meta-externalagent | Meta | ai-input, ai-training | verifiedAi |
| Amazonbot | Amazon | ai-search, ai-input | verifiedAi |
| Applebot-Extended | Apple | ai-training | verifiedAi |
| Googlebot | Google | search | searchEngines |
| Bingbot | Microsoft | search | searchEngines |
Extending the registry
The built-in registry covers the most common crawlers. To support bots not yet listed, use extraBots:
crawlerPolicy({
extraBots: [
{
id: 'MyCustomBot',
provider: 'Acme Corp',
userAgents: ['MyCustomBot/1.0'],
categories: ['ai-training'],
verified: true
}
],
bots: {
MyCustomBot: 'disallow'
}
})Extra bots participate in group rules, per-bot overrides, audit checks, and llms.txt output — the same as built-in bots.
Keeping the registry up to date: The registry ships as part of the package. As new crawlers emerge, updates are released as patch versions. Run npm update @casoon/astro-crawler-policy to get the latest bot data. The REGISTRY_VERSION export contains the date of the last registry update.
Audit warnings
The plugin emits warnings and info messages during the build:
| Code | Level | Condition |
|---|---|---|
| MISSING_SITE_URL | warn | No site set in Astro config |
| NO_SITEMAP | info | No sitemaps configured |
| DUPLICATE_USER_AGENT_RULE | warn | Two rules share the same User-agent |
| UNLOCKED_NON_PRODUCTION_ENVIRONMENT | warn | Staging/preview not globally blocked |
| NON_STANDARD_DIRECTIVES | info | Content signals may trigger GSC syntax warnings |
| AI_INPUT_WITHOUT_ALLOWED_BOTS | warn | aiInput enabled but all AI bots blocked |
| UNKNOWN_BOT_ID | warn | A bot ID in bots: {} is not in the registry |
| GROUP_BOT_OVERRIDE_CONFLICT | info | Bot override contradicts its group rule |
Audit settings:
crawlerPolicy({
audit: {
warnOnMissingSitemap: true, // default
warnOnConflicts: true // default
}
})Programmatic usage
The core modules are exported for use outside of the Astro integration:
import {
compilePolicy,
renderRobotsTxt,
renderLlmsTxt,
auditPolicy,
defaultRegistry,
REGISTRY_VERSION
} from '@casoon/astro-crawler-policy';
const policy = compilePolicy({
options: { preset: 'citationFriendly', sitemaps: ['/sitemap-index.xml'] },
site: 'https://example.com',
environment: 'production'
});
const robotsTxt = renderRobotsTxt(policy);
const llmsTxt = renderLlmsTxt(policy, 'https://example.com');
const issues = auditPolicy(policy, { site: 'https://example.com', registry: defaultRegistry });Generated output examples
citationFriendly (default)
crawlerPolicy({
preset: 'citationFriendly',
sitemaps: ['/sitemap-index.xml']
})# Generated by @casoon/astro-crawler-policy
# preset: citationFriendly
User-agent: *
Content-Signal: search=yes, ai-input=yes, ai-train=no
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Allow: /
User-agent: ClaudeBot
Allow: /
User-agent: claude-web
Allow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: PerplexityBot
Allow: /
User-agent: Bytespider
Disallow: /
User-agent: meta-externalagent
Allow: /
User-agent: Amazonbot
Allow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /
Sitemap: https://example.com/sitemap-index.xmlseoOnly
crawlerPolicy({ preset: 'seoOnly' })# Generated by @casoon/astro-crawler-policy
# preset: seoOnly
User-agent: *
Content-Signal: search=yes, ai-input=no, ai-train=no
Allow: /
User-agent: GPTBot
Disallow: /
User-agent: OAI-SearchBot
Disallow: /
User-agent: ClaudeBot
Disallow: /
User-agent: claude-web
Disallow: /
User-agent: Google-Extended
Disallow: /
User-agent: CCBot
Disallow: /
User-agent: PerplexityBot
Disallow: /
User-agent: Bytespider
Disallow: /
User-agent: meta-externalagent
Disallow: /
User-agent: Amazonbot
Disallow: /
User-agent: Applebot-Extended
Disallow: /
User-agent: Googlebot
Allow: /
User-agent: Bingbot
Allow: /lockdown (staging/preview)
crawlerPolicy({
env: {
staging: { preset: 'lockdown' },
preview: { preset: 'lockdown' }
}
})When CONTEXT=staging or NODE_ENV=staging:
# Generated by @casoon/astro-crawler-policy
# preset: lockdown
User-agent: *
Content-Signal: search=no, ai-input=no, ai-train=no
Disallow: /This tool only works for crawlers and AI bots that actually respect robots.txt. Respect, however, is rare these days.
