agent-traffic-classifier
v0.2.0
Published
Classify web traffic into humans, bots, AI agents, and programmatic clients from access logs and HTTP header signals
Maintainers
Readme
agent-traffic-classifier
Classify web server traffic into humans, bots, AI agents, and programmatic clients. Built for analyzing access logs and HTTP header signals to understand how AI agents interact with your site.
Status: Early development (0.x) This library is currently in early iteration and does not promise stability guarantees.
What it does
Given access log entries (and optionally HTTP header signals), the library:
- Classifies each request by user-agent into a category: human, AI crawler, AI assistant, AI search, coding agent, search crawler, SEO bot, monitoring, social preview, feed reader, programmatic client, or unknown
- Detects AI agents that use standard browser user-agents (Claude Code, Cursor, Kiro, Gemini CLI) via HTTP header heuristics and signal attribution
- Identifies agent frameworks by Accept header patterns, missing browser security headers, and conversation-tracking headers, even when the specific agent is unknown
- Attributes country-level intelligence to unidentified agents using IP ranges from Regional Internet Registries, enabling suspected identification (e.g., "Kimi / Doubao / DeepSeek (suspected)" for Chinese AI assistants)
- Clusters sessions by correlating signal data with access logs to reclassify traffic that would otherwise look human
- Cross-references programmatic traffic with signal data to upgrade HTTP client requests (httpx, undici, etc.) to agent when they share IPs with known agent signals
- Detects proxy-based agents (like Cursor) via a duplicate-request heuristic: same path + UA from different IPs within a short window
- Aggregates classified entries into daily summary documents with category breakdowns, top paths, referrers, bot/agent/programmatic stats, and status codes
Install
npm install agent-traffic-classifierQuick start
Access logs only
The simplest usage: parse Apache access logs, classify, and aggregate.
import {
parseLine,
createClassifier,
createFilter,
reclassifyEntries,
aggregate,
} from 'agent-traffic-classifier';
import type { LogEntry } from 'agent-traffic-classifier';
const classify = createClassifier();
const shouldSkip = createFilter();
// Parse log lines into entries
const lines = [
'1.1.1.1 - - [04/Apr/2026:10:00:00 -0700] "GET /about/ HTTP/1.1" 200 5000 "https://google.com" "Mozilla/5.0 Chrome/120"',
'2.2.2.2 - - [04/Apr/2026:10:00:01 -0700] "GET /about/ HTTP/1.1" 200 5000 "-" "GPTBot/1.0"',
'3.3.3.3 - - [04/Apr/2026:10:00:02 -0700] "GET / HTTP/1.1" 200 3000 "-" "curl/7.68.0"',
];
const entries: LogEntry[] = lines.map((line) => parseLine(line)).filter((e) => e !== null);
// Classify (no signal data, so pass null for seeds)
const classified = reclassifyEntries(entries, null, classify);
// Aggregate into daily summaries
const docs = aggregate(classified, {
domain: 'example.com',
tzOffsetMinutes: -420, // PDT
shouldSkip,
});
// docs[0].summary.byCategory => { human: {...}, 'ai-crawler': {...}, programmatic: {...} }With signal data and IP intelligence
For deeper agent detection, capture HTTP headers from requests that exhibit agent-like behavior (content negotiation, llms.txt requests, etc.) and feed them as signal entries. Combined with IP intelligence, this lets the library identify agents that use standard browser user-agents and attribute unidentified traffic to suspected services.
import {
parseLine,
createClassifier,
createFilter,
createSignalClassifier,
createCountryLookup,
createCloudProviderLookup,
buildSessionProfiles,
buildAgentSeeds,
reclassifyEntries,
detectDuplicateRequestAgents,
crossReferenceSignalIps,
aggregate,
} from 'agent-traffic-classifier';
import type { LogEntry, SignalEntry } from 'agent-traffic-classifier';
const classify = createClassifier();
const shouldSkip = createFilter();
// Initialize IP intelligence (async, fetches public range data once)
const countryLookup = await createCountryLookup(['CN']); // Only fetch ranges for countries you need
const cloudLookup = await createCloudProviderLookup(); // Google, AWS, Cloudflare ranges
const { classifySignalEntry, getSignalSummary } = createSignalClassifier({
ipLookup: (ip) => {
const info: Record<string, string> = {};
const country = countryLookup(ip);
if (country) info.country = country;
const provider = cloudLookup(ip);
if (provider) info.cloudProvider = provider;
return info;
},
});
// Parse access log entries
const entries: LogEntry[] = logLines.map((line) => parseLine(line)).filter((e) => e !== null);
// Signal entries from your capture mechanism (middleware, edge function, etc.)
const signalEntries: SignalEntry[] = [
{
ip: '5.5.5.5',
timestamp: 1743789604, // Unix epoch seconds
domain: 'example.com',
headers: { 'User-Agent': 'Claude-User/1.0' },
trigger: 'content-negotiation',
},
];
// Build agent seeds from signals and reclassify access log entries
const seeds = buildAgentSeeds(signalEntries, classifySignalEntry);
const domainSeeds = seeds.get('example.com') ?? null;
const classified = reclassifyEntries(entries, domainSeeds, classify);
// Build per-IP session profiles (static assets + self-site referrers)
// to suppress false-positive "Cursor (suspected)" during traffic spikes
const sessionProfiles = buildSessionProfiles(classified, 'example.com');
// Detect proxy-based agents (e.g., Cursor's duplicate-request pattern)
const withDuplicates = detectDuplicateRequestAgents(classified, { sessionProfiles });
// Upgrade programmatic clients that share IPs with known agent signals
const withCrossRef = crossReferenceSignalIps(
withDuplicates,
signalEntries,
'example.com',
classifySignalEntry,
);
// Aggregate with signal summary
const docs = aggregate(withCrossRef, {
domain: 'example.com',
tzOffsetMinutes: -420,
shouldSkip,
signalEntries,
classifySignalEntry,
getSignalSummary,
});Custom log formats
The library operates on format-agnostic LogEntry and SignalEntry interfaces. The Apache parser is a convenience adapter; you can construct entries from any source:
import type { LogEntry, SignalEntry } from 'agent-traffic-classifier';
// From nginx JSON logs, CDN exports, middleware, etc.
const entry: LogEntry = {
ip: '1.2.3.4',
timestamp: 1743789600, // Unix epoch seconds (UTC)
method: 'GET',
path: '/about/',
status: 200,
size: 5000,
referrer: null,
userAgent: 'Mozilla/5.0 ...',
};Categories
Every request is classified into one of these categories:
| Category | Description |
| ---------------- | --------------------------------------------------------------------------------------------------- |
| human | Regular browser traffic |
| agent | AI coding agents (Claude Code, Claude Agent, Cursor, Kiro, GitHub Copilot, Gemini CLI, MCP clients) |
| ai-crawler | AI training data crawlers (GPTBot, ClaudeBot, etc.) |
| ai-assistant | AI assistants fetching live content (ChatGPT-User, GoogleAgent-URLContext) |
| ai-search | AI-powered search engines (PerplexityBot, OAI-SearchBot, Kagibot) |
| search-crawler | Traditional search engines (Googlebot, Bingbot) |
| seo-bot | SEO/marketing bots (AhrefsBot, SemrushBot) |
| monitoring | Uptime monitors (UptimeRobot, Pingdom) |
| social-preview | Link preview fetchers (Twitterbot, Slackbot, Mastodon, WhatsApp) |
| feed-reader | Feed readers and news apps (FreshRSS, Feedly, HackerNews app) |
| programmatic | HTTP clients (curl, axios, python-requests, httpx, trafilatura) |
| other-bot | Bots detected by isbot but not in the curated list |
| unknown | Empty or missing user-agent |
Classification priority: curated bot list > programmatic client heuristic > isbot fallback > human.
Signal heuristics
When HTTP header signals are available, the library applies a chain of heuristics to identify agents that use standard browser user-agents. The chain is ordered by specificity (first match wins):
- Known agent UAs: Claude Code (
Claude-User), Claude Agent (Claude-Agent), Gemini CLI (Google-Gemini-CLI), markdown.new - Dev tool exclusion: curl and other known developer tools are excluded from agent classification
- Chrome 122 / macOS 14.7.2: Frozen browser fingerprint used by Chinese AI assistant services. With CN country IP, returns "Kimi / Doubao / DeepSeek (suspected)"
- Cursor (Sentry Baggage): Definitively identifies Cursor via its Sentry org credentials leaked in the
Baggageheader - Cursor (Traceparent): Generic Chrome UA with OpenTelemetry tracing headers, excluding VS Code
- Conversation tracking headers:
X-Conversation-IdorX-Conversation-Request-Idare definitively agent headers - text/x-markdown Accept: The unofficial markdown MIME type is only sent by purpose-built agents
- Accept header taxonomy: Known Accept preference patterns that identify agent frameworks (axios-pattern, text-first, Cursor, got-pattern, markdown variants)
- Missing browser headers: Chrome UA requesting markdown without
Sec-Ch-Ua(a header real Chrome always sends) - Trigger-based fallback: Requests with agent triggers (
content-negotiation,llms-txt) but no heuristic match are classified as "unidentified"
All heuristics are exported individually so you can reorder, replace, or extend the chain.
IP intelligence
The library includes adapters for IP-to-country and IP-to-cloud-provider lookups. These are optional, async-init, sync-lookup: you call the async factory once at startup, and it returns a synchronous lookup function.
import {
createCountryLookup,
createCloudProviderLookup,
createIpLookup,
buildCidrIndex,
} from 'agent-traffic-classifier';
// Country lookup from RIR delegation data (fetches from APNIC, RIPE, etc.)
const countryLookup = await createCountryLookup(['CN', 'RU']);
// Cloud provider lookup (fetches published ranges from Google, AWS, Cloudflare)
const cloudLookup = await createCloudProviderLookup();
// Combined convenience factory
const ipLookup = await createIpLookup({
countries: ['CN'],
cloudProviders: true,
});
// Or build your own CIDR index for custom ranges
const customIndex = buildCidrIndex([
{ cidr: '10.0.0.0/8', tag: 'internal' },
{ cidr: '172.16.0.0/12', tag: 'internal' },
]);
const tag = customIndex('10.1.2.3'); // => 'internal'The IpLookup interface ((ip: string) => IpInfo) can be implemented with any data source. The built-in adapters are convenience layers; pass your own function if you have a different IP intelligence source.
Configuration
Every module uses a factory function that accepts an options object. All options have sensible defaults. Each option replaces (not merges with) its default, so spread the default if you want to extend.
Classifier
import { createClassifier, defaultBotDb } from 'agent-traffic-classifier';
const classify = createClassifier({
// Prepend custom bots (checked first due to priority ordering)
bots: [
{ pattern: 'MyBot', name: 'MyBot', company: 'Me', category: 'ai-crawler' },
...defaultBotDb.bots,
],
// Add custom programmatic client patterns
programmaticClients: ['my-http-lib', ...DEFAULT_PROGRAMMATIC],
});Filter
The filter determines which requests are counted in aggregation. Requests matching skip patterns are excluded from all stats (category counts, top paths, referrers, status codes).
import { createFilter, DEFAULT_SKIP_SUBSTRINGS } from 'agent-traffic-classifier';
const shouldSkip = createFilter({
// Extend default substrings with site-specific patterns
skipSubstrings: [...DEFAULT_SKIP_SUBSTRINGS, '-staging-'],
// Per-site paths (empty by default)
siteSkipPaths: ['/old-section/'],
});The default substrings cover common vulnerability scanner probes (wp-admin, phpinfo, .git/, .ssh/, .aws/, xmlrpc, _profiler, etc.) using substring matching, so they catch all prefix variants at once (e.g., /wp/wp-admin/, /blog/wp-admin/, /old/wp-admin/).
Signal classifier
import {
createSignalClassifier,
DEFAULT_KNOWN_AGENTS,
DEFAULT_HEURISTICS,
} from 'agent-traffic-classifier';
const { classifySignalEntry, getSignalSummary } = createSignalClassifier({
// Add a new known agent UA pattern
knownAgents: [{ pattern: 'MyAgent', name: 'My Agent', company: 'Me' }, ...DEFAULT_KNOWN_AGENTS],
// Add a custom header-based heuristic
heuristics: [
(entry, ipInfo) => {
if (entry.headers?.['X-My-Agent']) {
return { isAgent: true, name: 'MyAgent', company: 'Me' };
}
return null;
},
...DEFAULT_HEURISTICS,
],
// Optional IP intelligence for country/cloud attribution
ipLookup: (ip) => ({ country: 'US' }),
});Session options
import { buildSessionProfiles, detectDuplicateRequestAgents } from 'agent-traffic-classifier';
// Build session profiles from the full (unfiltered) classified entries.
// This checks each IP for static asset requests and self-site referrers,
// which are strong indicators of real browser sessions.
const sessionProfiles = buildSessionProfiles(classified, 'example.com');
const result = detectDuplicateRequestAgents(classified, {
windowSeconds: 120, // Signal seed matching window (default: 60)
proxyWindowSeconds: 5, // Duplicate-request pairing window (default: 2)
proxyAgent: {
// Override the default Cursor identity
name: 'Windsurf',
company: 'Codeium',
suspectedName: 'Windsurf (suspected)',
},
// Suppress false-positive "suspected" labels when both IPs in a
// duplicate pair have browser-like sessions (static assets + self-referrers)
sessionProfiles,
});Aggregation
import { aggregate } from 'agent-traffic-classifier';
const docs = aggregate(classified, {
domain: 'example.com',
tzOffsetMinutes: -420,
shouldSkip, // Entries matching this filter are excluded from all stats
topPathsLimit: 100, // Max top paths per day (default: 50)
topItemPathsLimit: 20, // Max top paths per bot/agent (default: 10)
topReferrersLimit: 50, // Max referrers per day (default: 30)
topPathsSkipCategories: ['seo-bot', 'other-bot'], // Categories excluded from top paths
normalizePath: (raw) => ({
// Custom path normalization
path: raw.toLowerCase(),
utmSource: null,
}),
});API
Adapters
parseLine(line)-- Parse an Apache Combined Log Format line into aLogEntryreadLogFiles(dir)-- Read and parse.logand.log.gzfiles from a directoryparseApacheTs(raw)-- Convert an Apache timestamp string to Unix epoch secondsparseApacheTzOffset(raw)-- Extract timezone offset in minutes from an Apache timestampparseSignalLog(dir)-- Parse JSONL signal log files from a directory intoSignalEntry[]
Core
createClassifier(options?)-- Returns(userAgent: string) => ClassifyResultcreateFilter(options?)-- Returns(entry: LogEntry) => boolean(true = skip)createSignalClassifier(options?)-- Returns{ classifySignalEntry, getSignalSummary }
Sessions
buildSessionProfiles(entries, domain)-- Build per-IP session profiles (static assets, self-site referrers) for false-positive suppressionbuildAgentSeeds(signalEntries, classifySignalEntry)-- Build agent seeds grouped by domainreclassifyEntries(entries, domainSeeds, classifyFn, options?)-- Reclassify access log entries using signal seedsdetectDuplicateRequestAgents(entries, options?)-- Detect proxy-based agents via duplicate-request heuristiccrossReferenceSignalIps(entries, signalEntries, domain, classifySignalEntry)-- Upgrade programmatic entries to agent when their IP appears in signal data
IP intelligence
createIpLookup(options?)-- Combined country + cloud provider lookup factorycreateCountryLookup(countries)-- Country lookup from RIR delegation datacreateCloudProviderLookup(options?)-- Cloud provider lookup from published rangesbuildCidrIndex(entries)-- Build a CIDR lookup index from custom rangesparseIpv4(ip),parseCidr(cidr),matchesCidr(ip, cidr)-- Low-level IPv4 utilities
Aggregation
aggregate(entries, options)-- Aggregate classified entries intoDaySummary[]normalizePath(rawPath)-- Normalize URL paths (trailing slashes, utm_source extraction)extractDateKey(epochSeconds, tzOffsetMinutes)-- Convert epoch + offset toYYYY-MM-DDdate string
Defaults
All defaults are exported so you can extend them:
import {
// Bot database
defaultBotDb,
// Programmatic clients
DEFAULT_PROGRAMMATIC,
DEFAULT_EXACT_PROGRAMMATIC,
// Agent detection
DEFAULT_KNOWN_AGENTS,
DEFAULT_DEV_TOOLS,
DEFAULT_AGENT_TRIGGERS,
DEFAULT_HEURISTICS,
SUSPECTED_AGENTS,
DEFAULT_ACCEPT_TAXONOMY,
// Individual heuristics
sentryBaggageHeuristic,
cursorHeuristic,
chrome122Heuristic,
conversationTrackingHeuristic,
markdownMimeHeuristic,
acceptTaxonomyHeuristic,
missingBrowserHeadersHeuristic,
// Skip patterns
DEFAULT_SKIP_EXTENSIONS,
DEFAULT_SKIP_PATHS,
DEFAULT_SKIP_PREFIXES,
DEFAULT_SKIP_SUBSTRINGS,
// Session config
DEFAULT_WINDOW_SECONDS,
DEFAULT_PROXY_WINDOW_SECONDS,
CURSOR_PROXY_AGENT,
// Aggregation
DEFAULT_TOP_PATHS_SKIP_CATEGORIES,
DEFAULT_TOP_ITEM_PATHS_LIMIT,
// Category constants
CATEGORY_HUMAN,
CATEGORY_AGENT,
CATEGORY_FEED_READER,
CATEGORY_PROGRAMMATIC,
CATEGORY_OTHER_BOT,
CATEGORY_UNKNOWN,
AI_CATEGORY_PREFIX,
UNIDENTIFIED_AGENT,
} from 'agent-traffic-classifier';License
MIT
