@xiaozhu2007/robotstxt-parser
v0.1.1
Published
A comprehensive JavaScript library for parsing, validating, and generating robots.txt files with ease.
Downloads
10
Maintainers
Readme
robotstxt-parser — Robots.txt Parser
A comprehensive JavaScript library for parsing, validating, and generating robots.txt files with ease.
✨ Features
- 📖 Parse robots.txt files with full directive support
- ✅ Validate robots.txt content and catch common mistakes
- 🔨 Generate robots.txt files programmatically
- 🎯 Check URL permissions for specific user agents
- ⏱️ Extract crawl delays and sitemap URLs
- 🔗 Fluent API with method chaining support
📦 Installation
bun install @xiaozhu2007/robotstxt-parser🚀 Quick Start
Parsing a robots.txt file
import { RobotsParser } from "@xiaozhu2007/robotstxt-parser";
const parser = new RobotsParser();
const content = `
User-agent: *
Disallow: /admin/
Allow: /admin/public/
Crawl-delay: 10
Sitemap: https://example.com/sitemap.xml
`;
parser.parse(content);
// Check if a URL is allowed
console.log(parser.isAllowed("Googlebot", "/admin/public/page.html")); // true
console.log(parser.isAllowed("Googlebot", "/admin/secret.html")); // false
// Get crawl delay
console.log(parser.getCrawlDelay("Googlebot")); // 10
// Get sitemaps
console.log(parser.getSitemaps()); // ['https://example.com/sitemap.xml']Building a robots.txt file
import { RobotsBuilder } from "@xiaozhu2007/robotstxt-parser";
const robots = new RobotsBuilder()
.userAgent("*")
.disallow("/admin/")
.disallow("/private/")
.allow("/admin/public/")
.crawlDelay(5)
.userAgent("Googlebot")
.disallow("/temp/")
.sitemap("https://example.com/sitemap.xml")
.build();
console.log(robots);Output:
# Robots.txt file generated by RobotsParser
User-agent: *
Disallow: /admin/
Disallow: /private/
Allow: /admin/public/
Crawl-delay: 5
User-agent: Googlebot
Disallow: /temp/
# Sitemaps
Sitemap: https://example.com/sitemap.xml📚 API Reference
RobotsParser
parse(content)
Parse robots.txt content and extract all directives.
Parameters:
content(string): The robots.txt file content
Returns: RobotsParser - Returns the parser instance for chaining
isAllowed(userAgent, url)
Check if a URL is allowed for a specific user agent.
Parameters:
userAgent(string): The user agent stringurl(string): The URL path to check
Returns: boolean - True if allowed, false if disallowed
getCrawlDelay(userAgent)
Get the crawl delay for a specific user agent.
Parameters:
userAgent(string): The user agent string
Returns: number|null - Crawl delay in seconds or null if not set
getSitemaps()
Get all sitemap URLs from the robots.txt file.
Returns: string[] - Array of sitemap URLs
validate()
Validate the parsed robots.txt content.
Returns: Object with validation results:
{
valid: boolean,
errors: Array<{type: string, message: string, line?: number}>,
warnings: Array<{type: string, message: string}>
}generate(options)
Generate a robots.txt file from the current rules.
Parameters:
options(object): Generation optionsincludeComments(boolean): Include header comments (default: true)sortUserAgents(boolean): Sort user agents alphabetically (default: false)
Returns: string - Generated robots.txt content
getSummary()
Get a summary of the parsed robots.txt file.
Returns: Object with summary information:
{
userAgents: string[],
totalRules: number,
sitemapCount: number,
commentCount: number
}RobotsBuilder
userAgent(userAgent)
Set the current user agent for subsequent rules.
Parameters:
userAgent(string): User agent name (e.g., '*', 'Googlebot')
Returns: RobotsBuilder - Returns the builder for chaining
disallow(path)
Add a disallow rule for the current user agent.
Parameters:
path(string): Path to disallow
Returns: RobotsBuilder - Returns the builder for chaining
allow(path)
Add an allow rule for the current user agent.
Parameters:
path(string): Path to allow
Returns: RobotsBuilder - Returns the builder for chaining
crawlDelay(seconds)
Set crawl delay for the current user agent.
Parameters:
seconds(number): Delay in seconds
Returns: RobotsBuilder - Returns the builder for chaining
sitemap(url)
Add a sitemap URL.
Parameters:
url(string): Sitemap URL
Returns: RobotsBuilder - Returns the builder for chaining
build(options)
Build and return the robots.txt content.
Parameters:
options(object): Same asRobotsParser.generate()options
Returns: string - Generated robots.txt content
🔍 Advanced Examples
Validating a robots.txt file
const parser = new RobotsParser();
parser.parse(content);
const validation = parser.validate();
if (!validation.valid) {
console.error("Validation errors:", validation.errors);
}
if (validation.warnings.length > 0) {
console.warn("Warnings:", validation.warnings);
}Handling wildcards and patterns
const parser = new RobotsParser();
parser.parse(`
User-agent: *
Disallow: /*.json$
Allow: /api/*.json$
Disallow: /temp*
`);
console.log(parser.isAllowed("*", "/data.json")); // false
console.log(parser.isAllowed("*", "/api/users.json")); // true
console.log(parser.isAllowed("*", "/temporary")); // falseGetting detailed summaries
const parser = new RobotsParser();
parser.parse(content);
const summary = parser.getSummary();
console.log(
`Found ${summary.totalRules} rules for ${summary.userAgents.length} user agents`,
);
console.log(`Sitemaps: ${summary.sitemapCount}`);🎯 Supported Directives
- ✅
User-agent: Specify target crawler - ✅
Disallow: Block access to paths - ✅
Allow: Explicitly allow access to paths - ✅
Crawl-delay: Set delay between requests - ✅
Sitemap: Specify sitemap locations - ✅ Pattern matching with
*and$ - ✅ Comments with
#
🤝 Contributing
To install dependencies:
bun installTo run:
bun run index.tsContributions are welcome! Please feel free to submit a Pull Request.
📄 License
MIT License - feel free to use this library in your projects!
🐛 Bug Reports
If you discover any bugs, please create an issue with detailed information about the problem.
