@exellix/narrix-web-scoper

v2.0.0

Published

2 months ago

CNI-aware web context planner and mapper for Exellix. Uses @x12i/search-adapter for web retrieval.

0High
0Medium
0Low

exellix

narrix web-scoper search-adapter cni

`@exellix/narrix-web-scoper`

An independent TypeScript library for generating web-search plans from an entity and mapping @exellix/search-adapter results into a stable WebContext shape.

It is designed to be embeddable in Narrix, but it does not require Narrix runner/engine to be useful—you can call it directly anywhere you can provide a SearchAdapter instance.

What it does:

Builds web search queries from an entity (or from a provided WebScopingMap)
Calls a SearchAdapter from @exellix/search-adapter (prefers searchMany(...) when available, with fallback to execute(...) / search(...))
Maps the result into a stable WebContext shape for Narrix to consume
Supports a second “gap-driven” search mode (scopeForGaps) when you have gapHints indicating missing context (Narrix uses this with detectGaps, but you can supply gapHints yourself)
Supports multi-question “question packs” (scopeQuestionPack) to produce a stable multi-scope web context artifact

For detailed adapter usage (how to configure createSearchAdapter, what request/response shapes are supported, and integration patterns), see the @exellix/search-adapter docs.

Status

This repo currently implements:

Eligibility gating (allowlist by datasetId / entityKind, or custom function)
Query building (buildQueries, buildGapQueries)
Search adapter integration (scope, scopeForGaps, scopeGeneric, scopeForGapsGeneric, scopeQuestionPack) via SearchAdapterLike.searchMany(...) (preferred) with fallback to execute(...) / search(...)
Result mapping from SearchExecutionResult into a stable WebContext shape
Deterministic caps for web context size (maxFindings, maxSources, and snippet caps when enabled)

It does not currently implement memory caching, per-call TTL support, or pack/runner integration described in docs/narrix-web-scoper-plan.md. See Gap analysis.

For the verified source-reading target pipeline (fetch → extract → claim-attributed reasoning), see docs/web-scoping-roadmap.md. The package exports stable contract types (WebAttributedClaim, DisciplinedReasoningInput, SourceContentFetcher, …) for orchestrators and future fetch packages—see Public API.

Requirements

Node: >=18 (see package.json)
Registry access: this package depends on @exellix/search-adapter from GitHub Packages

Environment variables

GITHUB_TOKEN: required to install dependencies from GitHub Packages (see .npmrc)
TAVILY_API_KEY: only required if you want to run the optional “real web” integration test (see tests/integration/scope.with-tavily.test.ts)

To get started quickly:

cp .env.example .env
# edit .env and set values

Install

1) Configure GitHub Packages auth

This repo uses GitHub Packages for scoped registries. The included .npmrc expects GITHUB_TOKEN to be set:

//npm.pkg.github.com/:_authToken=${GITHUB_TOKEN}

Set GITHUB_TOKEN to a GitHub token with permission to read the relevant packages (and SSO enabled if required by the org).

If you don’t want to commit a real .npmrc, copy the template:

cp .npmrc.example .npmrc

2) Install dependencies

npm ci

Build & test

# Build to dist/ (types + sourcemaps)
npm run build

# Run tests once
npm test

# Watch mode
npm run test:watch

Live Tavily integration (tests/integration/scope.with-tavily.test.ts) runs whenever TAVILY_API_KEY is set to a non-placeholder value (including CI secrets). Set RUN_LIVE_TAVILY=0 to skip those tests while keeping a key in .env.

npm run test:integration — same integration file with Vitest’s verbose reporter (easier to see what ran vs skipped).
Console lines prefixed with [live-tavily] report gate status, timing, and result counts when live tests actually execute.
If Tavily returns unauthorized, the first live test fails by default (it used to pass silently). Set TAVILY_LIVE_ALLOW_UNAUTHORIZED_PASS=1 only if you intentionally want a pass without proving Tavily.

Public API

The package exports a single factory plus types:

createWebScoper(config: WebScoperConfig): NarrixWebScoper
NarrixWebScoper methods: scope, scopeForGaps, scopeGeneric, scopeForGapsGeneric, scopeQuestionPack
Search / context types: WebContext, WebFinding, WebSource, WebSourceContentSource, WebSourceExcerptFrom, WebSourceRetrievalStage, WebScoperResult, GapSearchResult, GapHints, WebScopingMap, NarrixScope
Question pack types: WebScopeQuestion, WebScopePackInput, WebScopePackResult, WebScopeQuestionOutcome, WebContextScopes
Web-scoped persistence types (from @xronoces/xmemory-scoper): WebScopedDataDoc, WebScopedEntityRef
Verified-pipeline contracts (no I/O in this package): WebAttributedClaim, WebClaimFreshness, DisciplinedReasoningInput, SourceFetchRequest, SourceFetchResult, SourceFetchOk, SourceFetchErr, SourceContentFetcher

Entry point: src/index.ts. Roadmap: docs/web-scoping-roadmap.md.

Usage

Basic enrichment search (`scope`)

If enabled is false, scope() returns { available: false, reason: "disabled" }.

If enabled but no searchAdapter is provided, scope() returns { available: true, context: empty, cached: false } (a stubbed empty context with queriesUsed populated).

import { createWebScoper } from "@exellix/narrix-web-scoper";
import { createSearchAdapter } from "@exellix/search-adapter";

const adapter = createSearchAdapter({
  tavily: {
    apiKey: process.env.TAVILY_API_KEY!,
    maxResults: 3,
    includeAnswer: true,
  },
});

const scoper = createWebScoper({
  enabled: true,
  eligibility: { datasetIds: ["acme.vulnerabilities"] },
  searchAdapter: adapter,
  scoping: { maxQueries: 3 },
});

const result = await scoper.scope({
  datasetId: "acme.vulnerabilities",
  subjectId: "CVE-2024-9999",
  entityKind: "vulnerability",
  entity: { cveId: "CVE-2024-9999" },
  cni: {}, // passed through to planning and execution hints
});

if (result.available) {
  // result.context is the stable output shape for your app (and for Narrix, if embedded there)
  console.log(result.context.summary);
  console.log(result.context.findings);
} else {
  console.log(result.reason, result.error);
}

Gap-driven search (`scopeForGaps`)

This mode builds different queries based on gapHints (e.g. unknown dataset, missing schema, empty stories).

const gapResult = await scoper.scopeForGaps({
  datasetId: "acme.vulnerabilities",
  subjectId: "CVE-2024-9999",
  entityKind: "vulnerability",
  entity: { cveId: "CVE-2024-9999" },
  cni: {},
  gapHints: { missingSchema: true },
});

if (gapResult.found) {
  console.log(gapResult.gapType, gapResult.context.queriesUsed);
} else {
  console.log(gapResult.gapType, gapResult.reason);
}

Question packs (multi-scope web context)

Use scopeQuestionPack to run multiple simple web-scoping questions in one call. You get:

results: a map keyed by each question’s id with one outcome per entry: property_resolved, db_hit, web_fetch, or miss.
context.scopes: legacy-compatible per-scope entries (plus context.summary / findings / sources / queriesUsed from the primary scope).

Simple questions rule

Each question string must be a plain, direct English question someone would type into a search engine—short, not a mash-up of entity IDs, taxonomy labels, product names, or several sub-questions in one line. Cover more ground with more entries in the questions array, not longer composed strings. This package runs questions as-is and does not validate that rule at runtime; keeping questions simple is a caller responsibility (documentation and review).

Behaviour summary

Empty questions: no-op (available: true, empty results and scopes).
mappedProperty: optional dot-path into xmemorySnapshot. If it resolves to a non-empty value, web search and DB lookup are skipped for that question (unless forceWeb is true).
DB layer (optional): set config.webScopedData.getWebScopedData / saveWebScopedData (for example methods from createWebScopedDataApi in @xronoces/xmemory-scoper). Lookup uses the question text plus linkedEntities (or subjectId + entityKind when both are set). If callbacks are omitted or persistence fails, the pack still runs (web-only where applicable).
forceWeb: skips property resolution and getWebScopedData, but still calls saveWebScopedData after a successful web path when configured.
Parallel search, deduped URL fetch: all pack searches run concurrently (subject to concurrency). When page fetch is used, each normalized URL (scheme + host + path, tracking params stripped) is downloaded at most once; content is reused for every question that referenced that URL. Per-URL fetch failures are non-fatal.
Persisted shape: successful web answers are built as WebScopedDataDoc (imported from @xronoces/xmemory-scoper; linkedEntities, rawData.sources, and synthesizedData are always populated for new web rows).

De-dupe of identical question text is supported (dedupe: "normalized" by default). Failures are lenient: one scope can miss while others succeed.

const pack = await scoper.scopeQuestionPack({
  subject: "CVE-2024-9999",
  xmemorySnapshot: graphSnapshot, // optional: for `mappedProperty` on questions
  questions: [
    {
      id: "exploitationReality",
      purpose: "In-the-wild exploitation",
      question: "Is CVE-2024-9999 exploited in the wild?",
    },
    {
      id: "exploitCode",
      purpose: "Public exploit material",
      question: "What public exploit code exists for CVE-2024-9999?",
    },
  ],
  concurrency: 3,
});

if (pack.available) {
  console.log(pack.results.exploitationReality?.status);
  console.log(pack.context.summary); // primary scope (legacy)
  console.log(pack.context.scopes.exploitationReality.context?.findings);
}

Configuration

`WebScoperConfig`

Key fields used by the current implementation:

enabled?: boolean
searchAdapter?: SearchAdapterLike (from @exellix/search-adapter, expected to provide searchMany or search / execute at runtime; fetchUrlContent is used for deduped page fetch in question packs when present)
webScopedData?: { getWebScopedData?, saveWebScopedData? } — optional hooks aligned with @xronoces/xmemory-scoper / WebScopedDataDoc persistence
eligibility?: { datasetIds?: string[]; entityKinds?: string[]; isEligible?: (args) => boolean }
scoping?: { maxQueries?: number; freshnessDays?: number; maxFindings?: number; maxSources?: number; includeSourceSnippets?: boolean; maxSnippetCharsPerSource?: number; maxTotalWebContextChars?: number; snippetIncludeRawContent?: boolean | "markdown" | "text"; sourceExcerptFrom?: "providerContent" | "providerRawContent" | "content" | "rawContent"; fetchPages?: boolean; fetchTopK?: number; ... } (content / rawContent are deprecated aliases for the adapter’s providerContent / providerRawContent.)

Note: other fields exist in types (e.g. memory, cache) but are not wired for runtime caching in this package yet.

Source body fields (`WebSource`)

When scoping.includeSourceSnippets is true, each WebContext.sources[] entry may include provider-layer text from @exellix/search-adapter (normalized SearchSource.snippet, providerContent, providerRawContent; legacy clients may still see content / rawContent on the wire—those are the same roles under older names). This is discovery-time material from the search provider, not a guarantee of full-page or live-site truth unless a later fetch stage exists.

providerContent / providerRawContent: first-class copies of the adapter’s bounded excerpt and raw payload (when requested). Prefer these over legacy names.
content / rawContent: deprecated mirrors of providerContent / providerRawContent for older consumers.
snippet: primary excerpt for this source, chosen via scoping.sourceExcerptFrom (default providerContent, aliases content / rawContent):
- providerContent (default): providerContent → snippet.
- providerRawContent: providerRawContent → providerContent → snippet. If snippetIncludeRawContent is omitted, includeRawContent defaults to true so raw is requested.
snippetCharCount: code-point length of snippet when set.
contentOrigin, retrievalStage, matchedQueries: passed through from the adapter when present (provenance for trust and debugging).
contentSource: when contentOrigin is a known WebSourceContentSource, it is copied here; otherwise, if contentOrigin is absent, web-scoper infers search_api_raw_content, search_api_content, or search_api_snippet from which provider fields were populated (raw wins over bounded content over display snippet).
score / rank: passed through from the adapter when present.
URLs are normalized (trimmed, fragment stripped) before domain / url are set.

WebContext.summary / summaryOrigin / summaryIsProvider: top-level summary from the adapter; summaryOrigin labels synthesis (e.g. provider_answer); summaryIsProvider is true when that origin is provider_answer. Optional merge counters (discoveredSourceCount, etc.) are copied when the adapter supplies them.

WebFinding: isProviderDerived is set for provider_answer / provider_snippet kinds; isStrongEvidence is true when any linked source has retrievalStage fetched or extracted. Relevance is down-ranked for provider-answer findings before confidence is applied.

Defaults are backward compatible: includeSourceSnippets defaults to false, so these fields are omitted unless you opt in.

Output caps:

maxFindings / maxSources: caps WebContext.findings and WebContext.sources. Resolution order:
- input.cni.answerShapeHints.maxFindings/maxSources (if set)
- otherwise config.scoping.maxFindings/maxSources (defaults)

Snippet/text caps (only apply when snippets are enabled):

maxSnippetCharsPerSource: Unicode code-point cap applied per source to providerContent, providerRawContent (and legacy content / rawContent mirrors), and the text chosen for snippet (after sourceExcerptFrom). When set to a positive number, it is also forwarded as snippetMaxChars on the shared search request so the adapter can normalize earlier.
maxTotalWebContextChars: additional budget applied only to WebSource.snippet, across sources in array order (after each snippet’s per-source cap). It does not shrink stored providerContent / providerRawContent.

To request raw body text from the provider, set scoping.snippetIncludeRawContent (e.g. true or "markdown"); it is forwarded as includeRawContent (boolean true may be sent as "markdown" for SDK compatibility—see search-adapter docs).

WebFinding.support: when the adapter attaches support metadata (e.g. for provider_snippet findings), web-scoper preserves it on the mapped finding.

Query building

`buildQueries` (enrichment)

Order of precedence:

If scopingMap.queries is provided: templates are interpolated from entity, sorted by (weight ?? 1) descending, de-duped, and capped by maxQueries.
Otherwise: an “auto” strategy picks a primary identifier from common fields (id, cveId, name, hostname, productName, entityKey, identifier, key, …). It emits up to 3 queries:
- The identifier alone
- Identifier + "{entityKind} context"
- Identifier + "{entityKind} details"

`buildGapQueries` (gap-driven)

Builds up to 5 queries depending on gapHints:

unknownDataset: dataset/entity-kind discovery queries
missingSchema: schema/documentation/example queries
processorNotMatched: “what is this input” + context queries
emptyStories: broader “analysis context” queries
If no hints are set: falls back to a generic context query

Execution adapter integration & mapping

This package is adapter-centric at the type level: it requires a SearchAdapterLike with an execute(request: SearchExecutionRequest) method at runtime (typically created via createSearchAdapter from @exellix/search-adapter). The adapter is responsible for talking to Tavily or other providers and returning a SearchExecutionResult that narrix-web-scoper maps into WebContext.

Repo layout

src/index.ts: main factory + search-adapter mapping
src/query.ts: enrichment + gap query builders
src/eligibility.ts: eligibility checker
src/types.ts: public types (including adapter-facing types re-exported from @exellix/search-adapter)
tests/: unit + integration tests (mock adapter + real Tavily-backed adapter)
docs/narrix-web-scoper-plan.md: design/spec document (ahead of implementation)
docs/nx-gap-analysis.md: notes on Nx + workspace alignment

Security notes

Do not commit tokens. Use environment variables (this repo uses GITHUB_TOKEN via .npmrc).
Treat orchestrator outputs as untrusted input if you surface them outside internal systems.

Gap analysis

Implemented in code

Core API exists: createWebScoper(), scope(), scopeForGaps(), scopeGeneric(), scopeForGapsGeneric(), scopeQuestionPack(), buildQueries(), buildGapQueries(), isEligible()
Query building: auto + from-map interpolation/weighting/deduping
Search adapter mapping: supports the SearchExecutionResult shape from @exellix/search-adapter
Tests: unit tests for eligibility/querying and integration tests using both a mock adapter and a real Tavily-backed adapter via @exellix/search-adapter

Missing vs `docs/narrix-web-scoper-plan.md` (high priority)

Memory caching (config.memory, config.cache) is defined in types but not implemented
- No ttlSeconds, staleCutoffSeconds, cniHashPolicy, datasetId.webContext namespacing, or cached/ageSeconds logic
Config defaults from the plan are not enforced (e.g. enabled default false, cache defaults, freshness/maxEvidence, etc.)
Query strategy config (scoping.queryStrategy: "auto" | "fromMap") exists in types but is not used; current code always uses:
- from-map if scopingMap is provided to buildQueries()
- otherwise auto
Runner integration and _webContext CNI enrichment are not present (this is currently a library-only package)
Domain controls (focusDomains, excludeDomains, maxEvidence, freshnessDays) exist only in the planning doc / types but are not wired into orchestrator calls
Gap search caching policy (“not cached by default”) is not implemented because caching is not implemented at all

Repo/package hygiene gaps (recommended)

License: UNLICENSED with empty author (decide the intended licensing model)
Publishing metadata: consider adding repository, homepage, bugs, and files in package.json
Exports map: consider adding "exports" for NodeNext consumers (ESM/CJS clarity)
CI: no GitHub Actions/workflow included for npm test / npm run build
Formatting/linting: no formatter/linter config (Prettier/ESLint) or npm run lint
Release process: no changelog/versioning guidance (Changesets or similar)

Usability gaps (what’s unclear / what would make adoption smoother)

Config validation: there’s no runtime validation (e.g., enabled: true but missing searchAdapter)—today this silently returns an “empty context” stub; this is convenient for tests but surprising in production unless documented.
Deterministic IDs / caching hooks: outputs include cached: false always; if you intend real caching, the API should document cache keys and how subjectId is expected to be chosen.