google-robotstxt-parser
v1.2.0
Published
Pure JavaScript port of Google's robots.txt parser (google/robotstxt). Works in Node.js and the browser.
Maintainers
Readme
google-robotstxt-parser
A pure JavaScript port of Google's official robotstxt C++ library. Runs in both Node.js and the browser with no dependencies.
Implements the same parsing rules, typo tolerance, and URL-matching logic that Google's own crawler uses to evaluate robots.txt files.
Installation
npm install google-robotstxt-parserUsage
import { RobotsMatcher } from 'google-robotstxt-parser';
const matcher = new RobotsMatcher();
const robotsContent = `
User-agent: *
Dissallow: /secret/ # Typo accepted by Google!
`;
const isAllowed = matcher.allowedByRobots(robotsContent, ['Googlebot'], 'https://example.com/secret/page');
console.log(isAllowed); // falseCheck a single user-agent
const allowed = matcher.oneAgentAllowedByRobots(robotsContent, 'Googlebot', 'https://example.com/public/');
console.log(allowed); // trueCheck multiple user-agents at once
allowedByRobots accepts an array — the URL is blocked if any of the agents is disallowed.
const allowed = matcher.allowedByRobots(robotsContent, ['Googlebot', 'Bingbot'], 'https://example.com/page');API
RobotsMatcher
| Method | Description |
|---|---|
| allowedByRobots(robotsTxt, userAgents, url) | Returns true if the URL is accessible to at least one of the given user-agents. |
| oneAgentAllowedByRobots(robotsTxt, userAgent, url) | Convenience wrapper for a single user-agent string. |
| disallow() | Returns the raw disallow decision after a parse (useful after calling allowedByRobots). |
| everSeenSpecificAgent() | true if the parsed file contained a rule group for the queried agent specifically. |
| matchingLine() | Line number of the winning allow/disallow rule, or 0 if none matched. |
parseRobotsTxt(robotsBody, handler)
Low-level parser. Pass a RobotsParseHandler subclass to react to individual directives without running the full matcher.
import { parseRobotsTxt, RobotsParseHandler } from 'google-robotstxt-parser';
class MyHandler extends RobotsParseHandler {
handleDisallow(lineNum, value) {
console.log(`Line ${lineNum}: Disallow ${value}`);
}
}
parseRobotsTxt(robotsContent, new MyHandler());Compatibility with Google's parser
This library matches Google's behaviour in several ways that differ from a naive implementation:
- Typo tolerance — common misspellings like
Dissallow,Disalow,User agentare accepted. - Pattern priority — longer patterns win over shorter ones, regardless of order.
- Specific agent beats wildcard — if the robots.txt contains a group for the queried agent, the
User-agent: *group is ignored entirely for that agent. - URL normalisation — non-ASCII characters in allow/disallow patterns are percent-encoded to match Google's canonicalisation.
- UTF-8 BOM — silently stripped at the start of the file.
- Line length cap — lines longer than ~16 KB are truncated, matching the C++ implementation.
/index.html and /index.htm normalisation
When an Allow pattern ends in /index.html or /index.htm but does not match the requested URL, Google's parser applies a Google-specific fallback: the pattern is re-evaluated as the parent directory path anchored with $ — i.e. /dir/index.html is re-tried as /dir/$. This means the rule grants access to the exact directory URL (/dir/) but not to arbitrary paths beneath it or to /dir/index.htm (without the trailing l). It is therefore more precise than a plain Allow: /dir/ prefix match. This behaviour is inherited directly from the upstream C++ implementation (robots.cc) and is verified by the GoogleOnly_IndexHTMLisDirectory test.
Browser usage
The library is a standard ES module with no Node.js-specific APIs, so it works directly in the browser:
<script type="module">
import { RobotsMatcher } from './robots.js';
const matcher = new RobotsMatcher();
console.log(matcher.oneAgentAllowedByRobots('User-agent: *\nDisallow: /', 'MyBot', 'https://example.com/'));
</script>License
Apache 2.0 — same as the upstream google/robotstxt repository.
