@playfulsparkle/robotstxt-js

v1.0.10

Published

8 months ago

A lightweight, Open Source robots.txt parser written in JavaScript

0High
0Medium
0Low

zsoltoroszlany82

javascript robotstxt parser

robotstxt.js

robotstxt.js is a lightweight JavaScript library for parsing robots.txt files. It provides a compliant solution in both browser and Node.js environments.

Directives

Clean-param
Host
Sitemap
User-agent
- Allow
- Disallow
- Crawl-delay
- Cache-delay
- Comment
- NoIndex
- Request-rate
- Robot-version
- Visit-time

Benefits

Accurately parse and interpret robots.txt rules.
Ensure compliance with robots.txt standards to avoid accidental blocking of legitimate bots.
Easily check URL permissions for different user agents programmatically.
Simplify the process of working with robots.txt in JavaScript applications.

Usage

Here's how to use robotstxt.js to analyze robots.txt content and check crawler permissions.

Node.js

const { robotstxt } = require("@playfulsparkle/robotstxt-js")
...

### JavaScript

```javascript
// Parse robots.txt content
const robotsTxtContent = `
User-Agent: GoogleBot
Allow: /public
Disallow: /private
Crawl-Delay: 5
Sitemap: https://example.com/sitemap.xml
`;

const parser = robotstxt(robotsTxtContent);

// Check URL permissions
console.log(parser.isAllowed("/public/data", "GoogleBot"));   // true
console.log(parser.isDisallowed("/private/admin", "GoogleBot")); // true

// Get specific user agent group
const googleBotGroup = parser.getGroup("googlebot"); // Case-insensitive
if (googleBotGroup) {
    console.log("Crawl Delay:", googleBotGroup.getCrawlDelay()); // 5
    console.log("Rules:", googleBotGroup.getRules().map(rule =>
        `${rule.type}: ${rule.path}`
    )); // ["allow: /public", "disallow: /private"]
}

// Get all sitemaps
console.log("Sitemaps:", parser.getSitemaps()); // ["https://example.com/sitemap.xml"]

// Check default rules (wildcard *)
console.log(parser.isAllowed("/protected", "*")); // true (if no wildcard rules exist)

Installation

NPM

npm i @playfulsparkle/robotstxt-js

Yarn

yarn add @playfulsparkle/robotstxt-js

Bower (deprecated)

Bower

bower install playfulsparkle/robotstxt.js

API Documentation

Core Methods

robotstxt(content: string): RobotsTxtParser - Creates a new parser instance with the provided robots.txt content.
getReports(): string[] - Get an array of parsing error, warning etc.
isAllowed(url: string, userAgent: string): boolean - Check if a URL is allowed for the specified user agent (throws if parameters are missing).
isDisallowed(url: string, userAgent: string): boolean - Check if a URL is disallowed for the specified user agent (throws if parameters are missing).
getGroup(userAgent: string): Group | undefined - Get the rules group for a specific user agent (case-insensitive match).
getSitemaps(): string[] - Get an array of discovered sitemap URLs from Sitemap directives.
getCleanParams(): string[] - Retrieve Clean-param directives for URL parameter sanitization.
getHost(): string | undefined - Get canonical host declaration for domain normalization.

Group Methods (via `getGroup()` result)

User Agent Info

getName(): string - User agent name for this group.
getComment(): string[] - Associated comment from the Comment directive.
getRobotVersion(): string | undefined - Robots.txt specification version.
getVisitTime(): string | undefined - Recommended crawl time window.

Crawl Management

getCacheDelay(): number | undefined - Cache delay in seconds.
getCrawlDelay(): number | undefined - Crawl delay in seconds.
getRequestRates(): string[] - Request rate limitations.

Rule Access

getRules(): Rule[] - All rules (allow/disallow/noindex) for this group.
addRule(type: string, path: string): void - Add rule (throws if type missing, throws if path missing).

Specification Support

Full Support

User-agent groups and inheritance
Allow/Disallow directives
Wildcard pattern matching (*)
End-of-path matching ($)
Crawl-delay directives
Sitemap discovery
Case-insensitive matching
Default user-agent (*) handling
Multiple user-agent declarations
Rule precedence by specificity

Support

Node.js

robotstxt.js runs in all active Node versions (6.x+).

Browser Support

This library is written using modern JavaScript ES2015 (ES6) features. It is expected to work in the following browser versions and later:

| Browser | Minimum Supported Version | |--------------------------|---------------------------| | Desktop Browsers | | | Chrome | 49 | | Edge | 13 | | Firefox | 45 | | Opera | 36 | | Safari | 14.1 | | Mobile Browsers | | | Chrome Android | 49 | | Firefox for Android | 45 | | Opera Android | 36 | | Safari on iOS | 14.5 | | Samsung Internet | 5.0 | | WebView Android | 49 | | WebView on iOS | 14.5 | | Other | | | Node.js | 6.13.0 |

Specifications

License

robotstxt.js is licensed under the terms of the BSD 3-Clause License.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

robotstxt.js

Directives

Benefits

Usage

Node.js

Installation

NPM

Yarn

Bower (deprecated)

Bower

API Documentation

Core Methods

Group Methods (via getGroup() result)

User Agent Info

Crawl Management

Rule Access

Specification Support

Full Support

Support

Node.js

Browser Support

Specifications

License

Group Methods (via `getGroup()` result)