@k0nf/crawler-detector
v1.2.0
Published
High-speed crawler detection library using optimized local databases for IP and user-agent matching
Maintainers
Readme
Crawler Detector
⚡ High-speed crawler detection library for JavaScript using optimized local databases. Detects web crawlers and bots using IP addresses and user-agent patterns with microsecond-level performance.
Features
- 🚀 Blazing Fast - Optimized detection with early-exit logic (< 0.02ms per check)
- 🌐 IPv4 & IPv6 Support - Full support for both IPv4 and IPv6 addresses using BigInt
- 📦 Zero Runtime Dependencies - No external API calls, all data is local
- 🎯 Dual Detection - Identifies crawlers by IP address and user-agent patterns
- 🔄 Auto-Updated Patterns - Easy rebuild from latest crawler sources
- 📊 Binary Search - Efficient CIDR range lookup using integer comparisons
- ✅ Well Tested - Comprehensive test suite with 39 passing tests
Coverage
IP Database:
- IPv4: 20.7 million addresses (11,057 exact IPs + 749 CIDR ranges)
- IPv6: 2.6 sextillion addresses (475 exact IPs + 777 CIDR blocks)
- Sources: Googlebot (official API), Yandex, Meta/Facebook, TikTok, 27 sources from GoodBots
User-Agent Database:
- 602 patterns (503 substrings + 99 regex)
- Source: monperrus/crawler-user-agents
Installation
npm install @k0nf/crawler-detectorUsage
Basic Detection
const { isCrawler } = require('@k0nf/crawler-detector');
// Check both IP and user-agent
const isBot = isCrawler(
'66.249.64.1',
'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
);
console.log(isBot); // trueIP-Only Detection
const { isCrawlerByIP } = require('@k0nf/crawler-detector');
const isFromCrawlerIP = isCrawlerByIP('66.249.64.1');
console.log(isFromCrawlerIP); // true (Googlebot IP range)User-Agent-Only Detection
const { isCrawlerByUA } = require('@k0nf/crawler-detector');
const hasCrawlerUA = isCrawlerByUA('Mozilla/5.0 (compatible; Googlebot/2.1)');
console.log(hasCrawlerUA); // trueIPv6 Support
const { isCrawlerByIP } = require('@k0nf/crawler-detector');
// Detect Googlebot IPv6
const isGooglebot = isCrawlerByIP('2001:4860:4801:10::1');
console.log(isGooglebot); // true
// Detect Yandex IPv6
const isYandex = isCrawlerByIP('2a02:6b8::1');
console.log(isYandex); // trueHow It Works
The library uses two optimized local databases:
1. User-Agent Patterns
Patterns are sourced from monperrus/crawler-user-agents (raw JSON).
During build time, patterns are:
- Downloaded from the GitHub repository
- Classified into substrings (fast path) or regex patterns (slower path)
- Normalized and lowercased for case-insensitive matching
- Deduplicated and optimized
Detection order (fastest first, early-exit):
- Substring matching - simple
.includes()checks - Regex pattern matching - pre-compiled patterns
2. IP Database
IP ranges are manually curated from known crawler sources and stored as:
- Exact IPs - Direct Set lookup (O(1))
- CIDR ranges - Converted to integer pairs and sorted for binary search (O(log n))
Detection order (fastest first, early-exit):
- Exact IP match
- CIDR range binary search
Performance
Expected performance characteristics:
- User-Agent Detection: < 0.1ms average (substring match)
- IP Detection: < 0.01ms average (exact match) or < 0.1ms (CIDR range)
- Combined Detection: < 1ms total
Actual test results (10,000 iterations):
Ran 10,000 detections in 110ms
Average: 0.011ms per detection
✅ Well under 1ms targetUpdating Databases
Rebuild All Databases
npm run buildThis will:
- Fetch the latest user-agent patterns from GitHub
- Build the optimized user-agent database
- Build the IP database from seed file
- Validate all generated data
Update Only User-Agent Patterns
npm run build:uaUpdate Only IP Database
npm run build:ipValidate Databases
npm run validateAdding Custom Crawler IPs
Edit scripts/seed-crawler-ips.txt and add IPs or CIDR ranges (one per line):
# Custom crawler IPs
203.0.113.5
198.51.100.0/24Then rebuild:
npm run build:ipTesting
Run the test suite:
npm testTests cover:
- ✅ Known crawler detection (Googlebot, Bingbot, etc.)
- ✅ Regular browser exclusion (Chrome, Firefox, Safari)
- ✅ Edge cases (null, undefined, empty strings)
- ✅ IP conversion and binary search logic
- ✅ Performance benchmarks
Architecture
crawler-detector/
├── index.js # Main API entry point
├── lib/
│ ├── ip-detector.js # IP detection logic
│ └── ua-detector.js # User-agent detection logic
├── data/
│ ├── crawler-ips.json # Built IP database
│ └── crawler-ua-patterns.json # Built UA patterns database
├── scripts/
│ ├── fetch-ua-patterns.js # Fetch patterns from GitHub
│ ├── build-ip-database.js # Build IP database
│ ├── validate-data.js # Validate databases
│ ├── build-all.js # Build all databases
│ └── seed-crawler-ips.txt # Source IP list
└── test/
└── test.js # Test suiteDetection Flow
isCrawler(ip, userAgent)
│
├─> Check IP (if provided)
│ ├─> Exact match? → return true
│ └─> In CIDR range? → return true
│
└─> Check User-Agent (if provided)
├─> Substring match? → return true
└─> Regex match? → return true
return falseData Sources
- User-Agent Patterns: monperrus/crawler-user-agents
- Direct JSON source: crawler-user-agents.json
- 602 patterns maintained by community
- IP Ranges (IPv4 & IPv6):
- Googlebot: Official IPv4/IPv6 ranges from Google API
- 142 IPv6 /64 blocks automatically fetched
- 27 Bot Sources: GoodBots Repository
- Includes Bingbot, Yandex, Facebook, Twitter, Telegram, Ahrefs, and more
- Manual Additions:
- Yandex: 15 IPv4 ranges +
2a02:6b8::/29IPv6 - Meta/Facebook: 7 IPv4 ranges (AS32934)
- TikTok: 67 IPv4 ranges (AS138699, AS137775, AS396986)
- Yandex: 15 IPv4 ranges +
- Googlebot: Official IPv4/IPv6 ranges from Google API
API Reference
isCrawler(ip, userAgent)
Detects if either the IP or user-agent belongs to a known crawler.
Parameters:
ip(string|null) - IP address to checkuserAgent(string|null) - User-Agent string to check
Returns: boolean - true if crawler detected
isCrawlerByIP(ip)
Detects if the IP address belongs to a known crawler.
Parameters:
ip(string) - IP address to check
Returns: boolean - true if crawler IP detected
isCrawlerByUA(userAgent)
Detects if the user-agent belongs to a known crawler.
Parameters:
userAgent(string) - User-Agent string to check
Returns: boolean - true if crawler user-agent detected
License
MIT
Contributing
Contributions welcome! Please:
- Add tests for new functionality
- Update documentation
- Run
npm testbefore submitting
Credits
Special thanks to monperrus/crawler-user-agents for maintaining the comprehensive crawler user-agent database.
