google-robotstxt-parser

v1.2.0

Published

2 days ago

Pure JavaScript port of Google's robots.txt parser (google/robotstxt). Works in Node.js and the browser.

0High
0Medium
0Low

vorticoncmdr

robots.txt robotstxt googlebot google parser matcher seo crawler rep rfc9309

google-robotstxt-parser

Live Demo

A pure JavaScript port of Google's official robotstxt C++ library. Runs in both Node.js and the browser with no dependencies.

Implements the same parsing rules, typo tolerance, and URL-matching logic that Google's own crawler uses to evaluate robots.txt files.

Installation

npm install google-robotstxt-parser

Usage

import { RobotsMatcher } from 'google-robotstxt-parser';

const matcher = new RobotsMatcher();
const robotsContent = `
User-agent: *
Dissallow: /secret/   # Typo accepted by Google!
`;

const isAllowed = matcher.allowedByRobots(robotsContent, ['Googlebot'], 'https://example.com/secret/page');
console.log(isAllowed); // false

Check a single user-agent

const allowed = matcher.oneAgentAllowedByRobots(robotsContent, 'Googlebot', 'https://example.com/public/');
console.log(allowed); // true

Check multiple user-agents at once

allowedByRobots accepts an array — the URL is blocked if any of the agents is disallowed.

const allowed = matcher.allowedByRobots(robotsContent, ['Googlebot', 'Bingbot'], 'https://example.com/page');

API

`RobotsMatcher`

| Method | Description | |---|---| | allowedByRobots(robotsTxt, userAgents, url) | Returns true if the URL is accessible to at least one of the given user-agents. | | oneAgentAllowedByRobots(robotsTxt, userAgent, url) | Convenience wrapper for a single user-agent string. | | disallow() | Returns the raw disallow decision after a parse (useful after calling allowedByRobots). | | everSeenSpecificAgent() | true if the parsed file contained a rule group for the queried agent specifically. | | matchingLine() | Line number of the winning allow/disallow rule, or 0 if none matched. |

`parseRobotsTxt(robotsBody, handler)`

Low-level parser. Pass a RobotsParseHandler subclass to react to individual directives without running the full matcher.

import { parseRobotsTxt, RobotsParseHandler } from 'google-robotstxt-parser';

class MyHandler extends RobotsParseHandler {
  handleDisallow(lineNum, value) {
    console.log(`Line ${lineNum}: Disallow ${value}`);
  }
}

parseRobotsTxt(robotsContent, new MyHandler());

Compatibility with Google's parser

This library matches Google's behaviour in several ways that differ from a naive implementation:

Typo tolerance — common misspellings like Dissallow, Disalow, User agent are accepted.
Pattern priority — longer patterns win over shorter ones, regardless of order.
Specific agent beats wildcard — if the robots.txt contains a group for the queried agent, the User-agent: * group is ignored entirely for that agent.
URL normalisation — non-ASCII characters in allow/disallow patterns are percent-encoded to match Google's canonicalisation.
UTF-8 BOM — silently stripped at the start of the file.
Line length cap — lines longer than ~16 KB are truncated, matching the C++ implementation.

`/index.html` and `/index.htm` normalisation

When an Allow pattern ends in /index.html or /index.htm but does not match the requested URL, Google's parser applies a Google-specific fallback: the pattern is re-evaluated as the parent directory path anchored with $ — i.e. /dir/index.html is re-tried as /dir/$. This means the rule grants access to the exact directory URL (/dir/) but not to arbitrary paths beneath it or to /dir/index.htm (without the trailing l). It is therefore more precise than a plain Allow: /dir/ prefix match. This behaviour is inherited directly from the upstream C++ implementation (robots.cc) and is verified by the GoogleOnly_IndexHTMLisDirectory test.

Browser usage

The library is a standard ES module with no Node.js-specific APIs, so it works directly in the browser:

<script type="module">
  import { RobotsMatcher } from './robots.js';

  const matcher = new RobotsMatcher();
  console.log(matcher.oneAgentAllowedByRobots('User-agent: *\nDisallow: /', 'MyBot', 'https://example.com/'));
</script>

License

Apache 2.0 — same as the upstream google/robotstxt repository.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

google-robotstxt-parser

Installation

Usage

Check a single user-agent

Check multiple user-agents at once

API

RobotsMatcher

parseRobotsTxt(robotsBody, handler)

Compatibility with Google's parser

/index.html and /index.htm normalisation

Browser usage

License

`RobotsMatcher`

`parseRobotsTxt(robotsBody, handler)`

`/index.html` and `/index.htm` normalisation