@playfulsparkle/robotstxt-js
v1.0.10
Published
A lightweight, Open Source robots.txt parser written in JavaScript
Maintainers
Readme
robotstxt.js
robotstxt.js is a lightweight JavaScript library for parsing robots.txt files. It provides a compliant solution in both browser and Node.js environments.
Directives
- Clean-param
- Host
- Sitemap
- User-agent
- Allow
- Disallow
- Crawl-delay
- Cache-delay
- Comment
- NoIndex
- Request-rate
- Robot-version
- Visit-time
Benefits
- Accurately parse and interpret
robots.txtrules. - Ensure compliance with robots.txt standards to avoid accidental blocking of legitimate bots.
- Easily check URL permissions for different user agents programmatically.
- Simplify the process of working with
robots.txtin JavaScript applications.
Usage
Here's how to use robotstxt.js to analyze robots.txt content and check crawler permissions.
Node.js
const { robotstxt } = require("@playfulsparkle/robotstxt-js")
...
### JavaScript
```javascript
// Parse robots.txt content
const robotsTxtContent = `
User-Agent: GoogleBot
Allow: /public
Disallow: /private
Crawl-Delay: 5
Sitemap: https://example.com/sitemap.xml
`;
const parser = robotstxt(robotsTxtContent);
// Check URL permissions
console.log(parser.isAllowed("/public/data", "GoogleBot")); // true
console.log(parser.isDisallowed("/private/admin", "GoogleBot")); // true
// Get specific user agent group
const googleBotGroup = parser.getGroup("googlebot"); // Case-insensitive
if (googleBotGroup) {
console.log("Crawl Delay:", googleBotGroup.getCrawlDelay()); // 5
console.log("Rules:", googleBotGroup.getRules().map(rule =>
`${rule.type}: ${rule.path}`
)); // ["allow: /public", "disallow: /private"]
}
// Get all sitemaps
console.log("Sitemaps:", parser.getSitemaps()); // ["https://example.com/sitemap.xml"]
// Check default rules (wildcard *)
console.log(parser.isAllowed("/protected", "*")); // true (if no wildcard rules exist)Installation
NPM
npm i @playfulsparkle/robotstxt-jsYarn
yarn add @playfulsparkle/robotstxt-jsBower (deprecated)
Bower
bower install playfulsparkle/robotstxt.jsAPI Documentation
Core Methods
robotstxt(content: string): RobotsTxtParser- Creates a new parser instance with the providedrobots.txtcontent.getReports(): string[]- Get an array of parsing error, warning etc.isAllowed(url: string, userAgent: string): boolean- Check if a URL is allowed for the specified user agent (throws if parameters are missing).isDisallowed(url: string, userAgent: string): boolean- Check if a URL is disallowed for the specified user agent (throws if parameters are missing).getGroup(userAgent: string): Group | undefined- Get the rules group for a specific user agent (case-insensitive match).getSitemaps(): string[]- Get an array of discovered sitemap URLs from Sitemap directives.getCleanParams(): string[]- Retrieve Clean-param directives for URL parameter sanitization.getHost(): string | undefined- Get canonical host declaration for domain normalization.
Group Methods (via getGroup() result)
User Agent Info
getName(): string- User agent name for this group.getComment(): string[]- Associated comment from the Comment directive.getRobotVersion(): string | undefined- Robots.txt specification version.getVisitTime(): string | undefined- Recommended crawl time window.
Crawl Management
getCacheDelay(): number | undefined- Cache delay in seconds.getCrawlDelay(): number | undefined- Crawl delay in seconds.getRequestRates(): string[]- Request rate limitations.
Rule Access
getRules(): Rule[]- All rules (allow/disallow/noindex) for this group.addRule(type: string, path: string): void- Add rule (throws if type missing, throws if path missing).
Specification Support
Full Support
- User-agent groups and inheritance
- Allow/Disallow directives
- Wildcard pattern matching (
*) - End-of-path matching (
$) - Crawl-delay directives
- Sitemap discovery
- Case-insensitive matching
- Default user-agent (
*) handling - Multiple user-agent declarations
- Rule precedence by specificity
Support
Node.js
robotstxt.js runs in all active Node versions (6.x+).
Browser Support
This library is written using modern JavaScript ES2015 (ES6) features. It is expected to work in the following browser versions and later:
| Browser | Minimum Supported Version | |--------------------------|---------------------------| | Desktop Browsers | | | Chrome | 49 | | Edge | 13 | | Firefox | 45 | | Opera | 36 | | Safari | 14.1 | | Mobile Browsers | | | Chrome Android | 49 | | Firefox for Android | 45 | | Opera Android | 36 | | Safari on iOS | 14.5 | | Samsung Internet | 5.0 | | WebView Android | 49 | | WebView on iOS | 14.5 | | Other | | | Node.js | 6.13.0 |
Specifications
- Google robots.txt specifications
- Yandex robots.txt specifications
- Sean Conner: "An Extended Standard for Robot Exclusion"
- Martijn Koster: "A Method for Web Robots Control"
- Martijn Koster: "A Standard for Robot Exclusion"
- RFC 7231, ~~2616~~
- RFC 7230, ~~2616~~
- RFC 5322, ~~2822~~, ~~822~~
- RFC 3986, ~~1808~~
- RFC 1945
- RFC 1738
- RFC 952
License
robotstxt.js is licensed under the terms of the BSD 3-Clause License.
