@k0nf/crawler-detector
v1.3.0
Published
High-speed crawler detection library using optimized local databases for IP and user-agent matching
Maintainers
Readme
Crawler Detector
⚡ High-speed crawler detection library for JavaScript using optimized local databases. Detects web crawlers and bots using IP addresses and user-agent patterns with microsecond-level performance.
Features
- 🚀 Blazing Fast - Optimized detection with early-exit logic (< 0.02ms per check)
- 🌐 IPv4 & IPv6 Support - Full support for both IPv4 and IPv6 addresses using BigInt
- 📦 Zero Runtime Dependencies - No external API calls, all data is local
- 🎯 Dual Detection - Identifies crawlers by IP address and user-agent patterns
- 🔄 Auto-Updated Patterns - Easy rebuild from latest crawler sources
- 📊 Binary Search - Efficient CIDR range lookup using integer comparisons
- 🔌 Framework Middleware - Built-in support for Express, Next.js, Remix, Koa, Fastify, Hapi
- ✅ Well Tested - Comprehensive test suite with 94 passing tests
Coverage
IP Database:
- IPv4: 20.7 million addresses (11,057 exact IPs + 749 CIDR ranges)
- IPv6: 2.6 sextillion addresses (475 exact IPs + 777 CIDR blocks)
- Sources: Googlebot (official API), Yandex, Meta/Facebook, TikTok, 27 sources from GoodBots
User-Agent Database:
- 602 patterns (503 substrings + 99 regex)
- Source: monperrus/crawler-user-agents
Installation
npm install @k0nf/crawler-detectorUsage
Basic Detection
const { isCrawler } = require('@k0nf/crawler-detector');
// Check both IP and user-agent
const isBot = isCrawler(
'66.249.64.1',
'Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)'
);
console.log(isBot); // trueIP-Only Detection
const { isCrawlerByIP } = require('@k0nf/crawler-detector');
const isFromCrawlerIP = isCrawlerByIP('66.249.64.1');
console.log(isFromCrawlerIP); // true (Googlebot IP range)User-Agent-Only Detection
const { isCrawlerByUA } = require('@k0nf/crawler-detector');
const hasCrawlerUA = isCrawlerByUA('Mozilla/5.0 (compatible; Googlebot/2.1)');
console.log(hasCrawlerUA); // trueIPv6 Support
const { isCrawlerByIP } = require('@k0nf/crawler-detector');
// Detect Googlebot IPv6
const isGooglebot = isCrawlerByIP('2001:4860:4801:10::1');
console.log(isGooglebot); // true
// Detect Yandex IPv6
const isYandex = isCrawlerByIP('2a02:6b8::1');
console.log(isYandex); // trueFramework Middleware
Built-in middleware for popular Node.js frameworks. All middleware support these options:
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| block | boolean | false | Block crawlers with HTTP error |
| blockStatusCode | number | 403 | Status code for blocked requests |
| blockMessage | string | 'Forbidden' | Response body for blocked requests |
| onCrawlerDetected | function | null | Custom callback when crawler detected |
| ipHeaders | string[] | ['x-forwarded-for', ...] | Headers to check for client IP |
| ipOnly | boolean | false | Only check IP, skip user-agent |
| uaOnly | boolean | false | Only check user-agent, skip IP |
Express / Connect
const express = require('express');
const crawlerDetector = require('@k0nf/crawler-detector/middleware/express');
const app = express();
// Tag all requests with req.isCrawler (boolean)
app.use(crawlerDetector());
// Block crawlers from specific routes
app.use('/api', crawlerDetector({ block: true }));
// Custom handling
app.use(crawlerDetector({
onCrawlerDetected: (req, res, next) => {
console.log(`Bot detected: ${req.headers['user-agent']}`);
next(); // continue processing
}
}));
app.get('/', (req, res) => {
if (req.isCrawler) {
res.send('Hello bot!');
} else {
res.send('Hello human!');
}
});Next.js
API Routes (Pages Router):
// pages/api/hello.js
const { withCrawlerDetection } = require('@k0nf/crawler-detector/middleware/nextjs');
function handler(req, res) {
res.json({ isCrawler: req.isCrawler });
}
module.exports = withCrawlerDetection(handler);
// Block crawlers
module.exports = withCrawlerDetection(handler, { block: true });getServerSideProps:
// pages/index.js
const { isCrawlerRequest } = require('@k0nf/crawler-detector/middleware/nextjs');
export async function getServerSideProps(context) {
const isBot = isCrawlerRequest(context);
return { props: { isBot } };
}App Router Route Handlers:
// app/api/hello/route.js
const { isCrawlerAppRoute } = require('@k0nf/crawler-detector/middleware/nextjs');
export const runtime = 'nodejs'; // Required - not compatible with Edge runtime
export async function GET(request) {
const isBot = isCrawlerAppRoute(request);
return Response.json({ isBot });
}Note: Requires Node.js runtime. Not compatible with Edge runtime since the detection databases are loaded from the filesystem.
Remix / React Router v7
In loaders and actions:
// app/routes/index.tsx
import { isCrawlerRequest } from '@k0nf/crawler-detector/middleware/remix';
export async function loader({ request }) {
const isBot = isCrawlerRequest(request);
if (isBot) {
return json({ content: 'SEO-optimized content', isBot: true });
}
return json({ content: 'Full interactive content', isBot: false });
}Block crawlers with middleware:
import { createCrawlerMiddleware } from '@k0nf/crawler-detector/middleware/remix';
const crawlerMiddleware = createCrawlerMiddleware({ block: true });
export function middleware(request) {
const blocked = crawlerMiddleware(request);
if (blocked) return blocked;
}Koa
const Koa = require('koa');
const crawlerDetector = require('@k0nf/crawler-detector/middleware/koa');
const app = new Koa();
// Tag all requests - sets ctx.state.isCrawler
app.use(crawlerDetector());
// Block crawlers
app.use(crawlerDetector({ block: true }));
app.use(async (ctx) => {
if (ctx.state.isCrawler) {
ctx.body = 'Hello bot!';
} else {
ctx.body = 'Hello human!';
}
});Fastify
const fastify = require('fastify')();
const crawlerDetectorPlugin = require('@k0nf/crawler-detector/middleware/fastify');
// Register plugin - decorates request.isCrawler
fastify.register(crawlerDetectorPlugin);
// With options
fastify.register(crawlerDetectorPlugin, { block: true });
fastify.get('/', (request, reply) => {
if (request.isCrawler) {
reply.send('Hello bot!');
} else {
reply.send('Hello human!');
}
});Hapi
const Hapi = require('@hapi/hapi');
const crawlerDetectorPlugin = require('@k0nf/crawler-detector/middleware/hapi');
const server = Hapi.server({ port: 3000 });
await server.register({
plugin: crawlerDetectorPlugin,
options: { block: false }
});
server.route({
method: 'GET',
path: '/',
handler: (request, h) => {
if (request.plugins.crawlerDetector.isCrawler) {
return 'Hello bot!';
}
return 'Hello human!';
}
});Raw Node.js HTTP
const http = require('http');
const withCrawlerDetection = require('@k0nf/crawler-detector/middleware/http');
const server = http.createServer(
withCrawlerDetection((req, res) => {
if (req.isCrawler) {
res.end('Hello bot!');
} else {
res.end('Hello human!');
}
})
);
// Block crawlers
const server = http.createServer(
withCrawlerDetection(handler, { block: true })
);How It Works
The library uses two optimized local databases:
1. User-Agent Patterns
Patterns are sourced from monperrus/crawler-user-agents (raw JSON).
During build time, patterns are:
- Downloaded from the GitHub repository
- Classified into substrings (fast path) or regex patterns (slower path)
- Normalized and lowercased for case-insensitive matching
- Deduplicated and optimized
Detection order (fastest first, early-exit):
- Substring matching - simple
.includes()checks - Regex pattern matching - pre-compiled patterns
2. IP Database
IP ranges are manually curated from known crawler sources and stored as:
- Exact IPs - Direct Set lookup (O(1))
- CIDR ranges - Converted to integer pairs and sorted for binary search (O(log n))
Detection order (fastest first, early-exit):
- Exact IP match
- CIDR range binary search
Performance
Expected performance characteristics:
- User-Agent Detection: < 0.1ms average (substring match)
- IP Detection: < 0.01ms average (exact match) or < 0.1ms (CIDR range)
- Combined Detection: < 1ms total
Actual test results (10,000 iterations):
Ran 10,000 detections in 110ms
Average: 0.011ms per detection
✅ Well under 1ms targetUpdating Databases
Rebuild All Databases
npm run buildThis will:
- Fetch the latest user-agent patterns from GitHub
- Build the optimized user-agent database
- Build the IP database from seed file
- Validate all generated data
Update Only User-Agent Patterns
npm run build:uaUpdate Only IP Database
npm run build:ipValidate Databases
npm run validateAdding Custom Crawler IPs
Edit scripts/seed-crawler-ips.txt and add IPs or CIDR ranges (one per line):
# Custom crawler IPs
203.0.113.5
198.51.100.0/24Then rebuild:
npm run build:ipTesting
Run the test suite:
npm testTests cover:
- ✅ Known crawler detection (Googlebot, Bingbot, etc.)
- ✅ Regular browser exclusion (Chrome, Firefox, Safari)
- ✅ Edge cases (null, undefined, empty strings)
- ✅ IP conversion and binary search logic
- ✅ Performance benchmarks
Architecture
crawler-detector/
├── index.js # Main API entry point
├── lib/
│ ├── ip-detector.js # IP detection logic
│ └── ua-detector.js # User-agent detection logic
├── data/
│ ├── crawler-ips.json # Built IP database
│ └── crawler-ua-patterns.json # Built UA patterns database
├── scripts/
│ ├── fetch-ua-patterns.js # Fetch patterns from GitHub
│ ├── build-ip-database.js # Build IP database
│ ├── validate-data.js # Validate databases
│ ├── build-all.js # Build all databases
│ └── seed-crawler-ips.txt # Source IP list
└── test/
└── test.js # Test suiteDetection Flow
isCrawler(ip, userAgent)
│
├─> Check IP (if provided)
│ ├─> Exact match? → return true
│ └─> In CIDR range? → return true
│
└─> Check User-Agent (if provided)
├─> Substring match? → return true
└─> Regex match? → return true
return falseData Sources
- User-Agent Patterns: monperrus/crawler-user-agents
- Direct JSON source: crawler-user-agents.json
- 602 patterns maintained by community
- IP Ranges (IPv4 & IPv6):
- Googlebot: Official IPv4/IPv6 ranges from Google API
- 142 IPv6 /64 blocks automatically fetched
- 27 Bot Sources: GoodBots Repository
- Includes Bingbot, Yandex, Facebook, Twitter, Telegram, Ahrefs, and more
- Manual Additions:
- Yandex: 15 IPv4 ranges +
2a02:6b8::/29IPv6 - Meta/Facebook: 7 IPv4 ranges (AS32934)
- TikTok: 67 IPv4 ranges (AS138699, AS137775, AS396986)
- Yandex: 15 IPv4 ranges +
- Googlebot: Official IPv4/IPv6 ranges from Google API
API Reference
isCrawler(ip, userAgent)
Detects if either the IP or user-agent belongs to a known crawler.
Parameters:
ip(string|null) - IP address to checkuserAgent(string|null) - User-Agent string to check
Returns: boolean - true if crawler detected
isCrawlerByIP(ip)
Detects if the IP address belongs to a known crawler.
Parameters:
ip(string) - IP address to check
Returns: boolean - true if crawler IP detected
isCrawlerByUA(userAgent)
Detects if the user-agent belongs to a known crawler.
Parameters:
userAgent(string) - User-Agent string to check
Returns: boolean - true if crawler user-agent detected
License
MIT
Contributing
Contributions welcome! Please:
- Add tests for new functionality
- Update documentation
- Run
npm testbefore submitting
Credits
Special thanks to monperrus/crawler-user-agents for maintaining the comprehensive crawler user-agent database.
