@k0nf/crawler-detector

v1.2.0

Published

10 days ago

High-speed crawler detection library using optimized local databases for IP and user-agent matching

0High
0Medium
0Low

k0nf

crawler bot detection user-agent ip spider scraper

Crawler Detector

⚡ High-speed crawler detection library for JavaScript using optimized local databases. Detects web crawlers and bots using IP addresses and user-agent patterns with microsecond-level performance.

Features

🚀 Blazing Fast - Optimized detection with early-exit logic (< 0.02ms per check)
🌐 IPv4 & IPv6 Support - Full support for both IPv4 and IPv6 addresses using BigInt
📦 Zero Runtime Dependencies - No external API calls, all data is local
🎯 Dual Detection - Identifies crawlers by IP address and user-agent patterns
🔄 Auto-Updated Patterns - Easy rebuild from latest crawler sources
📊 Binary Search - Efficient CIDR range lookup using integer comparisons
✅ Well Tested - Comprehensive test suite with 39 passing tests

Coverage

IP Database:

IPv4: 20.7 million addresses (11,057 exact IPs + 749 CIDR ranges)
IPv6: 2.6 sextillion addresses (475 exact IPs + 777 CIDR blocks)
Sources: Googlebot (official API), Yandex, Meta/Facebook, TikTok, 27 sources from GoodBots

User-Agent Database:

602 patterns (503 substrings + 99 regex)
Source: monperrus/crawler-user-agents

Installation

npm install @k0nf/crawler-detector

Usage

Basic Detection

const { isCrawler } = require('@k0nf/crawler-detector');

// Check both IP and user-agent
const isBot = isCrawler(
  '66.249.64.1',
  'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
);
console.log(isBot); // true

IP-Only Detection

const { isCrawlerByIP } = require('@k0nf/crawler-detector');

const isFromCrawlerIP = isCrawlerByIP('66.249.64.1');
console.log(isFromCrawlerIP); // true (Googlebot IP range)

User-Agent-Only Detection

const { isCrawlerByUA } = require('@k0nf/crawler-detector');

const hasCrawlerUA = isCrawlerByUA('Mozilla/5.0 (compatible; Googlebot/2.1)');
console.log(hasCrawlerUA); // true

IPv6 Support

const { isCrawlerByIP } = require('@k0nf/crawler-detector');

// Detect Googlebot IPv6
const isGooglebot = isCrawlerByIP('2001:4860:4801:10::1');
console.log(isGooglebot); // true

// Detect Yandex IPv6
const isYandex = isCrawlerByIP('2a02:6b8::1');
console.log(isYandex); // true

How It Works

The library uses two optimized local databases:

1. User-Agent Patterns

Patterns are sourced from monperrus/crawler-user-agents (raw JSON).

During build time, patterns are:

Downloaded from the GitHub repository
Classified into substrings (fast path) or regex patterns (slower path)
Normalized and lowercased for case-insensitive matching
Deduplicated and optimized

Detection order (fastest first, early-exit):

Substring matching - simple .includes() checks
Regex pattern matching - pre-compiled patterns

2. IP Database

IP ranges are manually curated from known crawler sources and stored as:

Exact IPs - Direct Set lookup (O(1))
CIDR ranges - Converted to integer pairs and sorted for binary search (O(log n))

Detection order (fastest first, early-exit):

Exact IP match
CIDR range binary search

Performance

Expected performance characteristics:

User-Agent Detection: < 0.1ms average (substring match)
IP Detection: < 0.01ms average (exact match) or < 0.1ms (CIDR range)
Combined Detection: < 1ms total

Actual test results (10,000 iterations):

Ran 10,000 detections in 110ms
Average: 0.011ms per detection
✅ Well under 1ms target

Updating Databases

Rebuild All Databases

npm run build

This will:

Fetch the latest user-agent patterns from GitHub
Build the optimized user-agent database
Build the IP database from seed file
Validate all generated data

Update Only User-Agent Patterns

npm run build:ua

Update Only IP Database

npm run build:ip

Validate Databases

npm run validate

Adding Custom Crawler IPs

Edit scripts/seed-crawler-ips.txt and add IPs or CIDR ranges (one per line):

# Custom crawler IPs
203.0.113.5
198.51.100.0/24

Then rebuild:

npm run build:ip

Testing

Run the test suite:

npm test

Tests cover:

✅ Known crawler detection (Googlebot, Bingbot, etc.)
✅ Regular browser exclusion (Chrome, Firefox, Safari)
✅ Edge cases (null, undefined, empty strings)
✅ IP conversion and binary search logic
✅ Performance benchmarks

Architecture

crawler-detector/
├── index.js                      # Main API entry point
├── lib/
│   ├── ip-detector.js           # IP detection logic
│   └── ua-detector.js           # User-agent detection logic
├── data/
│   ├── crawler-ips.json         # Built IP database
│   └── crawler-ua-patterns.json # Built UA patterns database
├── scripts/
│   ├── fetch-ua-patterns.js     # Fetch patterns from GitHub
│   ├── build-ip-database.js     # Build IP database
│   ├── validate-data.js         # Validate databases
│   ├── build-all.js             # Build all databases
│   └── seed-crawler-ips.txt     # Source IP list
└── test/
    └── test.js                   # Test suite

Detection Flow

isCrawler(ip, userAgent)
    │
    ├─> Check IP (if provided)
    │   ├─> Exact match? → return true
    │   └─> In CIDR range? → return true
    │
    └─> Check User-Agent (if provided)
        ├─> Substring match? → return true
        └─> Regex match? → return true
        
return false

Data Sources

User-Agent Patterns: monperrus/crawler-user-agents
- Direct JSON source: crawler-user-agents.json
- 602 patterns maintained by community
IP Ranges (IPv4 & IPv6):
- Googlebot: Official IPv4/IPv6 ranges from Google API
  - 142 IPv6 /64 blocks automatically fetched
- 27 Bot Sources: GoodBots Repository
  - Includes Bingbot, Yandex, Facebook, Twitter, Telegram, Ahrefs, and more
- Manual Additions:
  - Yandex: 15 IPv4 ranges + 2a02:6b8::/29 IPv6
  - Meta/Facebook: 7 IPv4 ranges (AS32934)
  - TikTok: 67 IPv4 ranges (AS138699, AS137775, AS396986)

API Reference

`isCrawler(ip, userAgent)`

Detects if either the IP or user-agent belongs to a known crawler.

Parameters:

ip (string|null) - IP address to check
userAgent (string|null) - User-Agent string to check

Returns: boolean - true if crawler detected

`isCrawlerByIP(ip)`

Detects if the IP address belongs to a known crawler.

Parameters:

ip (string) - IP address to check

Returns: boolean - true if crawler IP detected

`isCrawlerByUA(userAgent)`

Detects if the user-agent belongs to a known crawler.

Parameters:

userAgent (string) - User-Agent string to check

Returns: boolean - true if crawler user-agent detected

License

MIT

Contributing

Contributions welcome! Please:

Add tests for new functionality
Update documentation
Run npm test before submitting

Credits

Special thanks to monperrus/crawler-user-agents for maintaining the comprehensive crawler user-agent database.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Crawler Detector

Features

Coverage

Installation

Usage

Basic Detection

IP-Only Detection

User-Agent-Only Detection

IPv6 Support

How It Works

1. User-Agent Patterns

2. IP Database

Performance

Updating Databases

Rebuild All Databases

Update Only User-Agent Patterns

Update Only IP Database

Validate Databases

Adding Custom Crawler IPs

Testing

Architecture

Detection Flow

Data Sources

API Reference

isCrawler(ip, userAgent)

isCrawlerByIP(ip)

isCrawlerByUA(userAgent)

License

Contributing

Credits

`isCrawler(ip, userAgent)`

`isCrawlerByIP(ip)`

`isCrawlerByUA(userAgent)`