@vectus/agentkit-next

v0.1.12

Published

5 months ago

Drop-in Next.js middleware helper that logs crawler traffic, protects /_v routes, and serves robots/sitemap responses.

0High
0Medium
0Low

adamvectus

nextjs middleware crawler seo edge robots sitemap

@vectus/agentkit-next

Turn one middleware import into a crawler-aware safety net:

log every visit under /_v/ (or any prefixes you specify)
rewrite/redirect crawlers to a safe page while letting humans through
serve custom robots.txt + sitemap.xml directly from middleware (no route files)

This repo contains the package source plus a minimal loader snippet that can be ignored.

Installation

npm install @vectus/agentkit-next

The package expects next as a peer dependency (Next.js 13.4+ or 14.x). It ships ESM output and type definitions.

5-line usage

// proxy.ts (Next.js 16+ literal config requirement)
import { vectusAgentkit } from '@vectus/agentkit-next';

export const config = {
  matcher: ['/((?!_next|assets|favicon\\.ico).*)'],
};

export default vectusAgentkit({
  remoteConfigUrl: 'https://api.example.com/agentkit/config.json',
  remoteConfigHeaders: { Authorization: `Bearer ${process.env.AGENTKIT_TOKEN}` },
  remoteSitemapUrl: 'https://api.example.com/sitemap.xml',
  remoteSitemapCrawlerOnly: true,
  devMode: true,
});

Next.js 16+ insists that export const config be a static object literal in the file made of raw strings. Inline the matcher array (copy the string above or add your own patterns). The middleware itself can be the default export—config is optional if you want it to run for every route.

Your remote config response can look like:

{
  "logEndpoint": "https://api.example.com/log-visit",
  "redirectMap": {
    "/_v/facts": "/",
    "/_v/pricing": "/pricing",
    "/_v/changelog": "/changelog"
  }
}

The middleware caches this for 60 seconds, uses the map to build sitemap.xml, and rewrites crawlers on a per-path basis while POSTing every visit to the configured logEndpoint.

Older projects (Next.js 13–15) can keep the helper if they prefer:

// middleware.ts (legacy helper pattern)
import { vectusAgentkit, vectusAgentkitConfig } from '@vectus/agentkit-next';

export const config = vectusAgentkitConfig();

export default vectusAgentkit({
  redirectTo: '/',
});

What you get automatically:

Any request to /robots.txt returns a robots file that disallows your protected prefixes and links to /sitemap.xml.
Any request to /sitemap.xml returns a tiny, cache-friendly sitemap containing the URLs you list.
Middleware watches the prefixes you mark (default: /_v/). Whenever a crawler hits them, you optionally log the visit and rewrite or redirect to a safe page while tagging the original path as ?p=.
Real users keep the original experience (pass-through or 404, your choice).
Optionally proxy /sitemap.xml directly from your backend for crawler requests so you can keep LLM-focused URLs up-to-date without redeploying.

With logAllPaths enabled (default), every page visit—protected or not—gets POSTed to your API endpoint with a rich payload (see below).

Options

| Option | Default | Description | | --- | --- | --- | | blockPrefixes | ['/_v/'] | Path prefixes that trigger crawler handling + logging. | | redirectTo | '/' | Destination path/URL for crawler visits. Absolute URLs are honored. | | originalPathParam | 'p' | Query param appended to the destination so you can read the source path. Set false to disable. | | respondWith | 'rewrite' | 'rewrite' keeps the original URL; 'redirect' issues a redirect (redirectStatus controls the code). | | redirectStatus | 307 | HTTP status used when respondWith === 'redirect'. | | logEndpoint | undefined | When set, middleware fires a fetch with visit metadata. | | logMethod | 'POST' | HTTP method for the log call. | | logPayload | undefined | Extra object or (ctx) => ({ ... }) merged into the log payload. | | llmSitemapPaths | ['/'] | URLs included in the generated sitemap (fallback when no redirect map is supplied). | | llmRedirectMap | derived from llmSitemapPaths + redirectTo | Dictionary where each sitemap path maps to a redirect target. Provide manually or via remote config. | | disallowInRobots | Same as blockPrefixes | Explicit list of Disallow: directives. | | crawlerMatcher | Regex for popular bots (GPTBot, ChatGPT-User, Perplexity, Claude, major search engines) | (ua) => boolean; override if you have signature checks. | | nonCrawlerStrategy | 'next' | 'next' lets humans through, 'block' returns a 404 for protected paths. | | sitemapCacheControl | public, s-maxage=300, stale-while-revalidate=86400 | Cache header for the sitemap response. | | robotsCacheControl | Same | Cache header for the robots response. | | extraRobotsLines | [] | Additional lines appended to robots.txt. | | devMode | false | When true, logs crawler/human handling decisions to the console. | | logAllPaths | true | Log every page request (set to false if you only care about protected prefixes). | | remoteSitemapUrl | undefined | When set, /sitemap.xml gets proxied from this backend endpoint. | | remoteSitemapHeaders | {} | Extra headers merged into the remote sitemap fetch (auth tokens, etc.). | | remoteSitemapCrawlerOnly | true | Only pull from remoteSitemapUrl when the requester is a crawler; set false to proxy for everyone. | | remoteConfigUrl | undefined | Optional JSON endpoint that returns { logEndpoint, redirectMap }. Overrides local values. | | remoteConfigHeaders | {} | Extra headers sent with the remote config fetch (bearer tokens, etc.). | | remoteConfigTtlMs | 60000 | How long to cache the remote config before refreshing. |

Log payload structure

Every log contains path info, the full URL, HTTP method, crawler status, query params, IP metadata (country/city/timezone), request IDs, device hints, fetch metadata, and any extra fields you add via logPayload.

The default payload looks like this:

{
  "path": "/_v/secret",
  "pathname": "/_v/secret",
  "search": "?p=foo",
  "query": { "p": "foo" },
  "fullUrl": "https://app.example.com/_v/secret?p=foo",
  "origin": "https://app.example.com",
  "protocol": "https",
  "ts": 1714000000000,
  "isoTimestamp": "2024-04-24T12:00:00.000Z",
  "method": "GET",
  "host": "app.example.com",
  "userAgent": "GPTBot",
  "crawler": true,
  "scope": "protected",
  "event": "crawler-blocked",
  "referer": "https://example.com/about",
  "ip": "203.0.113.10",
  "country": "US",
  "region": "CA",
  "city": "San Francisco",
  "device": "\"Googlebot/2.1\"",
  "acceptLanguage": "en-US,en;q=0.9",
  "secFetchSite": "same-origin",
  "requestId": "req_123",
  "cfRay": "abcd1234-SFO",
  "contentType": "text/html",
  "sitemapSource": "remote",
  "target": "https://app.example.com/"
}

Attach your own fields via logPayload:

vectusAgentkit({
  logEndpoint: 'https://logs.example.com',
  logPayload: ({ request }) => ({
    country: request.headers.get('cf-ipcountry') ?? 'unknown'
  })
});

// While debugging locally
vectusAgentkit({ devMode: true });

// Log every visit, even outside /_v/
vectusAgentkit({ logAllPaths: true });

// Proxy sitemap.xml from a backend just for crawlers
vectusAgentkit({
  remoteSitemapUrl: 'https://api.example.com/sitemap.xml',
  remoteSitemapCrawlerOnly: true,
});

// Fetch log endpoint + redirect map from your API every minute
vectusAgentkit({
  remoteConfigUrl: 'https://api.example.com/agentkit/config.json',
  remoteConfigHeaders: { Authorization: `Bearer ${process.env.AGENTKIT_TOKEN}` },
  remoteConfigTtlMs: 120000,
});

### Remote sitemap passthrough

- When `remoteSitemapUrl` is provided, `/sitemap.xml` is fetched from that endpoint. By default only crawler UAs trigger the proxy; non-crawler requests fall through to your existing Next.js route or static file. Set `remoteSitemapCrawlerOnly: false` if you want every request to proxy.
- `remoteSitemapHeaders` lets you forward API keys or feature flags to the backend, and the middleware automatically forwards the current `Host` and `User-Agent` headers.
- If the remote call fails, the middleware falls back to the static sitemap and logs the `sitemap-remote-fallback` event so your logging pipeline can alert you.

### Remote config + per-path redirects

- Point `remoteConfigUrl` at a JSON endpoint that returns the log collector URL (`logEndpoint`) and a dictionary (`redirectMap`) mapping each sitemap path (e.g., `/_v/facts`) to its safe redirect target (e.g., `/facts`).
- The middleware caches this object for `remoteConfigTtlMs` (default 60s) and merges it with any inline options. Remote values win.
- Every crawler rewrite consults the dictionary so different `/_v/*` pages can flow to different canonical routes, and the same dictionary is used to build `sitemap.xml`. No redeploy is needed to add/remove entries—just update the API response.

Middleware matcher helper

vectusAgentkitConfig() still exists if you want the helper to merge custom matchers (useful on Next.js 13–15). Do not call it inside export const config on Next.js 16+—inline the literal object instead.
vectusAgentkitMatcher (array) plus DEFAULT_CRAWLER_REGEX are exported for convenience (tests, custom tooling, older Next versions). On Next.js 16+ they cannot be referenced from config because the compiler requires raw literals.

// Literal (Next.js 16+)
export const config = {
  matcher: ['/((?!_next|assets|favicon\\.ico).*)', '/robots.txt'],
};

// Helper (older Next.js)
export const config = vectusAgentkitConfig({
  matcher: ['/((?!_next).*)', '/robots.txt', '/sitemap.xml']
});

Development scripts

npm run build – emits build/ with .js + .d.ts
npm run dev – watch mode TypeScript build (for local development)
npm run lint – type-check only

Release checklist

Update version in package.json.
npm run build
npm publish --access public

That’s it—you now have a shareable, documented middleware helper that satisfies the earlier product discussion.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@vectus/agentkit-next

Installation

5-line usage

Options

Log payload structure

Middleware matcher helper

Development scripts

Release checklist