npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@trybyte/robotstxt-parser

v1.2.0

Published

Google's robots.txt parser ported to TypeScript - RFC 9309 compliant

Readme

robotstxt-parser

A TypeScript port of Google's official C++ robots.txt parser, fully compliant with RFC 9309 (Robots Exclusion Protocol).

Features

  • RFC 9309 Compliant: Implements the official Robots Exclusion Protocol specification
  • Google-Compatible: Matches Google's crawler behavior, including handling of edge cases and typos
  • Zero Dependencies: Pure TypeScript implementation with no runtime dependencies
  • Type-Safe: Full TypeScript support with comprehensive type definitions
  • Pattern Matching: Supports wildcards (*) and end anchors ($) in patterns
  • Typo Tolerance: Accepts common typos like disalow, useragent, site-map
  • Bulk Checking: Parse once, check many URLs efficiently with ParsedRobots

Installation

# Using npm
npm install robotstxt-parser

# Using bun
bun add robotstxt-parser

# Using pnpm
pnpm add robotstxt-parser

Quick Start

import { RobotsMatcher } from "robotstxt-parser";

const robotsTxt = `
User-agent: *
Disallow: /private/
Allow: /public/

User-agent: Googlebot
Allow: /
`;

const matcher = new RobotsMatcher();

// Check if a URL is allowed for a specific user agent
const isAllowed = matcher.oneAgentAllowedByRobots(
	robotsTxt,
	"MyBot",
	"https://example.com/public/page.html",
);
console.log(isAllowed); // true

// Check with multiple user agents
const allowed = matcher.allowedByRobots(
	robotsTxt,
	["Googlebot", "MyBot"],
	"https://example.com/private/secret.html",
);
console.log(allowed); // true (Googlebot is allowed everywhere)

Bulk Checking

For checking many URLs against the same robots.txt, use ParsedRobots to avoid re-parsing:

import { ParsedRobots } from "robotstxt-parser";

const robotsTxt = `
User-agent: *
Disallow: /private/
Allow: /public/
`;

// Parse once
const parsed = ParsedRobots.parse(robotsTxt);

// Check many URLs efficiently
const urls = [
	"https://example.com/public/page1.html",
	"https://example.com/private/secret.html",
	"https://example.com/about",
];

const results = parsed.checkUrls("MyBot", urls);
for (const result of results) {
	console.log(`${result.url}: ${result.allowed ? "allowed" : "blocked"}`);
}
// Output:
// https://example.com/public/page1.html: allowed
// https://example.com/private/secret.html: blocked
// https://example.com/about: allowed

API Reference

RobotsMatcher

The main class for checking URL access against robots.txt rules.

import { RobotsMatcher } from "robotstxt-parser";

const matcher = new RobotsMatcher();

Methods

| Method | Description | | ---------------------------------------------------- | ------------------------------------------------------------------ | | oneAgentAllowedByRobots(robotsTxt, userAgent, url) | Check if URL is allowed for a single user agent | | allowedByRobots(robotsTxt, userAgents[], url) | Check if URL is allowed for any of the user agents | | disallow() | Returns true if URL is disallowed (after calling allowedByRobots) | | disallowIgnoreGlobal() | Same as disallow() but ignores * rules | | everSeenSpecificAgent() | Returns true if robots.txt contained rules for the specified agent | | matchingLine() | Returns the line number that matched, or 0 | | static isValidUserAgentToObey(userAgent) | Validates user agent format (only [a-zA-Z_-] allowed) | | static parse(robotsTxt) | Returns a ParsedRobots instance for bulk URL checking | | static batchCheck(robotsTxt, userAgent, urls[]) | Convenience method for bulk checking (parses + checks) |

ParsedRobots

Efficient bulk URL checking by separating parsing from matching. Parse once, check many URLs.

import { ParsedRobots } from "robotstxt-parser";

const parsed = ParsedRobots.parse(robotsTxt);

// Check multiple URLs
const results = parsed.checkUrls("Googlebot", urls);

// Check a single URL
const result = parsed.checkUrl("Googlebot", "https://example.com/page");

Methods

| Method | Description | | ------------------------------ | ----------------------------------------------------- | | static parse(robotsTxt) | Parse robots.txt and return a ParsedRobots instance | | checkUrls(userAgent, urls[]) | Check multiple URLs, returns UrlCheckResult[] | | checkUrl(userAgent, url) | Check a single URL, returns UrlCheckResult | | hasSpecificAgent(userAgent) | Returns true if robots.txt has rules for this agent | | getExplicitAgents() | Returns array of user-agents explicitly mentioned |

UrlCheckResult

interface UrlCheckResult {
	url: string; // The URL that was checked
	allowed: boolean; // Whether crawling is allowed
	matchingLine: number; // Line number of matching rule (0 if none)
	matchedPattern: string; // The pattern that matched
	matchedRuleType: "allow" | "disallow" | "none";
}

parseRobotsTxt

Low-level parsing function for custom handling.

import { parseRobotsTxt, RobotsParseHandler } from "robotstxt-parser";

class MyHandler extends RobotsParseHandler {
	handleRobotsStart(): void {
		/* ... */
	}
	handleRobotsEnd(): void {
		/* ... */
	}
	handleUserAgent(lineNum: number, value: string): void {
		/* ... */
	}
	handleAllow(lineNum: number, value: string): void {
		/* ... */
	}
	handleDisallow(lineNum: number, value: string): void {
		/* ... */
	}
	handleSitemap(lineNum: number, value: string): void {
		/* ... */
	}
	handleUnknownAction(lineNum: number, action: string, value: string): void {
		/* ... */
	}
}

parseRobotsTxt(robotsTxtContent, new MyHandler());

RobotsParsingReporter

A parse handler that collects detailed information about each line.

import {
	parseRobotsTxt,
	RobotsParsingReporter,
	RobotsTagName,
} from "robotstxt-parser";

const reporter = new RobotsParsingReporter();
parseRobotsTxt(robotsTxt, reporter);

console.log(reporter.validDirectives()); // Count of valid directives
console.log(reporter.unusedDirectives()); // Count of unrecognized tags
console.log(reporter.lastLineSeen()); // Last line number parsed
console.log(reporter.parseResults()); // Array of RobotsParsedLine objects

RobotsMatchStrategy

Interface for implementing custom matching strategies.

import {
	RobotsMatchStrategy,
	LongestMatchRobotsMatchStrategy,
} from "robotstxt-parser";

// Default implementation uses longest-match strategy
const strategy = new LongestMatchRobotsMatchStrategy();

// Custom implementation
class MyStrategy implements RobotsMatchStrategy {
	matchAllow(path: string, pattern: string): number {
		// Return priority (pattern length on match, -1 on no match)
	}
	matchDisallow(path: string, pattern: string): number {
		// Return priority (pattern length on match, -1 on no match)
	}
}

Types

import {
	KeyType, // Enum: USER_AGENT, SITEMAP, ALLOW, DISALLOW, UNKNOWN
	RobotsTagName, // Enum: Unknown, UserAgent, Allow, Disallow, Sitemap, Unused
	LineMetadata, // Interface for line parsing metadata
	RobotsParsedLine, // Interface for complete parsed line info
} from "robotstxt-parser";

Utility Functions

import {
	getPathParamsQuery, // Extract path from URL
	maybeEscapePattern, // Normalize percent-encoding
	matches, // Check if path matches pattern
} from "robotstxt-parser";

// Extract path from URL
getPathParamsQuery("https://example.com/path?query=1"); // '/path?query=1'

// Check pattern matching
matches("/foo/bar", "/foo/*"); // true
matches("/foo/bar", "/baz"); // false

Pattern Matching

The parser supports standard robots.txt pattern syntax:

| Pattern | Matches | | ------------ | --------------------------------- | | /path | Any URL starting with /path | | /path* | Same as /path (implicit) | | *.php | Any URL containing .php | | /path$ | Exactly /path (end anchor) | | /fish*.php | /fish.php, /fish123.php, etc. |

Priority: When both Allow and Disallow match, the longer pattern wins.

Production Usage

This library is designed for correctness and RFC 9309 compliance. When using it in production environments that fetch robots.txt from untrusted sources, consider these safeguards:

File Size Limits

The library does not enforce file size limits. Both RFC 9309 and Google require parsing at least 500 KiB. Implement size checks before parsing:

const MAX_ROBOTS_SIZE = 500 * 1024; // 500 KiB (per RFC 9309)

async function fetchAndParse(url: string) {
  const response = await fetch(url);
  const contentLength = response.headers.get('content-length');

  if (contentLength && parseInt(contentLength) > MAX_ROBOTS_SIZE) {
    throw new Error('robots.txt too large');
  }

  const text = await response.text();
  if (text.length > MAX_ROBOTS_SIZE) {
    throw new Error('robots.txt too large');
  }

  return ParsedRobots.parse(text);
}

Timeouts

Implement timeouts when fetching robots.txt to prevent hanging requests.

Google-Specific Behaviors

This library is a port of Google's C++ parser and includes several behaviors that are Google-specific extensions beyond RFC 9309:

| Behavior | Google | RFC 9309 | |----------|--------|----------| | Line length limit | Truncates at 16,664 bytes | No limit specified | | Typo tolerance | Accepts "disalow", "useragent", etc. | "MAY be lenient" (unspecified) | | index.html normalization | Allow: /path/index.html also allows /path/ | Not specified | | User-agent * with trailing text | * foo treated as global agent | Not specified |

The core matching behavior (longest-match-wins, case-insensitive user-agent matching, UTF-8 encoding) follows RFC 9309.

Note: This library only handles parsing and matching. HTTP behaviors like redirect following, caching, and status code handling are your responsibility to implement.

Project Structure

src/
├── index.ts           # Main entry point, re-exports public API
├── matcher.ts         # RobotsMatcher class - URL matching logic
├── parsed-robots.ts   # ParsedRobots class - bulk URL checking
├── parser.ts          # robots.txt parsing engine
├── pattern-matcher.ts # Wildcard pattern matching algorithm
├── match-strategy.ts  # Match priority strategy interface
├── parsed-key.ts      # Directive key recognition (with typo support)
├── reporter.ts        # RobotsParsingReporter for analysis
├── url-utils.ts       # URL path extraction and encoding
├── types.ts           # TypeScript interfaces and enums
└── constants.ts       # Configuration constants

tests/
├── matcher.test.ts    # URL matching tests
├── bulk-check.test.ts # Bulk URL checking tests
├── reporter.test.ts   # Parser reporting tests
└── url-utils.test.ts  # URL utility tests

Development

# Install dependencies
bun install

# Run tests
bun test

# Build for distribution
bun run build

License

Apache-2.0

This is a TypeScript port of Google's robots.txt parser.