@xiaozhu2007/robotstxt-parser

v0.1.1

Published

7 months ago

A comprehensive JavaScript library for parsing, validating, and generating robots.txt files with ease.

0High
0Medium
0Low

xiaozhu2007

robots parser typescript bun web-crawling seo

robotstxt-parser — Robots.txt Parser

A comprehensive JavaScript library for parsing, validating, and generating robots.txt files with ease.

✨ Features

📖 Parse robots.txt files with full directive support
✅ Validate robots.txt content and catch common mistakes
🔨 Generate robots.txt files programmatically
🎯 Check URL permissions for specific user agents
⏱️ Extract crawl delays and sitemap URLs
🔗 Fluent API with method chaining support

📦 Installation

bun install @xiaozhu2007/robotstxt-parser

🚀 Quick Start

Parsing a robots.txt file

import { RobotsParser } from "@xiaozhu2007/robotstxt-parser";

const parser = new RobotsParser();
const content = `
User-agent: *
Disallow: /admin/
Allow: /admin/public/
Crawl-delay: 10

Sitemap: https://example.com/sitemap.xml
`;

parser.parse(content);

// Check if a URL is allowed
console.log(parser.isAllowed("Googlebot", "/admin/public/page.html")); // true
console.log(parser.isAllowed("Googlebot", "/admin/secret.html")); // false

// Get crawl delay
console.log(parser.getCrawlDelay("Googlebot")); // 10

// Get sitemaps
console.log(parser.getSitemaps()); // ['https://example.com/sitemap.xml']

Building a robots.txt file

import { RobotsBuilder } from "@xiaozhu2007/robotstxt-parser";

const robots = new RobotsBuilder()
  .userAgent("*")
  .disallow("/admin/")
  .disallow("/private/")
  .allow("/admin/public/")
  .crawlDelay(5)
  .userAgent("Googlebot")
  .disallow("/temp/")
  .sitemap("https://example.com/sitemap.xml")
  .build();

console.log(robots);

Output:

# Robots.txt file generated by RobotsParser

User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public/
Crawl-delay: 5

User-agent: Googlebot
Disallow: /temp/

# Sitemaps
Sitemap: https://example.com/sitemap.xml

📚 API Reference

RobotsParser

`parse(content)`

Parse robots.txt content and extract all directives.

Parameters:

content (string): The robots.txt file content

Returns: RobotsParser - Returns the parser instance for chaining

`isAllowed(userAgent, url)`

Check if a URL is allowed for a specific user agent.

Parameters:

userAgent (string): The user agent string
url (string): The URL path to check

Returns: boolean - True if allowed, false if disallowed

`getCrawlDelay(userAgent)`

Get the crawl delay for a specific user agent.

Parameters:

userAgent (string): The user agent string

Returns: number|null - Crawl delay in seconds or null if not set

`getSitemaps()`

Get all sitemap URLs from the robots.txt file.

Returns: string[] - Array of sitemap URLs

`validate()`

Validate the parsed robots.txt content.

Returns: Object with validation results:

{
  valid: boolean,
  errors: Array<{type: string, message: string, line?: number}>,
  warnings: Array<{type: string, message: string}>
}

`generate(options)`

Generate a robots.txt file from the current rules.

Parameters:

options (object): Generation options
- includeComments (boolean): Include header comments (default: true)
- sortUserAgents (boolean): Sort user agents alphabetically (default: false)

Returns: string - Generated robots.txt content

`getSummary()`

Get a summary of the parsed robots.txt file.

Returns: Object with summary information:

{
  userAgents: string[],
  totalRules: number,
  sitemapCount: number,
  commentCount: number
}

RobotsBuilder