@x12i/search-adapter

v1.5.1

Published

2 months ago

Tavily-backed web search adapter for Narrix and TypeScript apps.

0High
0Medium
0Low

x12i

narrix search tavily adapter

`@x12i/search-adapter`

Tavily-backed web search adapter for Narrix and other TypeScript/Node applications.

This package separates two ideas that used to be conflated:

Discovery — URLs and text the search provider returned (Tavily).
Evidence — URLs this adapter actually fetched over HTTP and extracted text from (optional second stage).

SearchResult exposes discoveredSources and evidenceSources separately so downstream code can tell what was found vs what was scoped.

Environment

TAVILY_API_KEY=your_tavily_token_here

Override per adapter: createSearchAdapter({ tavily: { apiKey: "…" } }).

Install & build

npm install
npm run build
npm test

Integration tests call Tavily only when TAVILY_API_KEY is set.
npm run test:integration:proof — with SEARCH_ADAPTER_TEST_PROOF=1, logs discovery/evidence previews.
npm run proof:search -- "query" — live stats for discovery fields; set SEARCH_ADAPTER_FETCH=1 to also run fetchPages on the top URLs (see script header).

Public API

import {
  createSearchAdapter,
  type SearchAdapter,
  type SearchAdapterConfig,
  type SearchRequest,
  type SearchManyRequest,
  type SearchResult,
  type DiscoveredSource,
  type EvidenceSource,
  type ProviderFinding,
} from "@x12i/search-adapter";

`createSearchAdapter(config?: SearchAdapterConfig)`

Top-level `config`

| Option | Default | Role | |--------|---------|------| | includeSourceSnippets | true | When false, every DiscoveredSource has snippet: "" and no providerContent / providerRawContent (overridable per request). | | redactQuery | (none) | (query) => string run on the query after validation and before Tavily. The returned string is what leaves your process for search. Must not be empty/whitespace-only. |

`config.tavily`

| Option | Default | Role | |--------|---------|------| | apiKey | process.env.TAVILY_API_KEY | Tavily API key. | | apiBaseUrl | (SDK default) | Passed as Tavily apiBaseURL. | | timeoutMs | 15000 | Client timeout budget → Tavily timeout (seconds). | | maxRetries | 1 | Retries for retryable provider failures. | | defaultTopic / defaultSearchDepth / includeAnswer / includeRawContent / snippetMaxChars / maxResults | see src/config.ts | Request defaults. |

`config.fetch` (second-stage HTTP fetch)

Fetch runs only when config.fetch.enabled === true and the request sets fetchPages: true.

| Option | Default | Role | |--------|---------|------| | enabled | false | Master switch for evidence fetching. | | topK | 5 | Max URLs to fetch per search (also capped by request.fetchTopK). | | concurrency | 4 | Max concurrent GETs while fetching the batch from a single search() (bounded pool). | | maxAttempts | 3 | Total HTTP GET attempts per URL (evidence fetch + fetchUrlContent). Retries use exponential backoff with jitter on 408, 429, 502, 503 and transient network errors; honors Retry-After when parseable. | | timeoutMs | 12000 | Per-attempt fetch timeout (each retry gets a fresh budget). | | maxContentChars | 500000 | Max extracted text per URL. | | userAgent | (package string) | User-Agent header. |

Adapter methods

search(request) — Resolves defaults, validates ResolvedSearchRequest, applies redactQuery when configured, runs Tavily → mapTavilyDiscovery (discovery-only), optionally fetchEvidenceSources, then assembles SearchResult (discovery + evidence layers).
searchMany(request) — Same concurrency / stopOnError behavior; merges discoveredSources, evidenceSources, providerFindings, and findings in separate maps.
fetchUrlContent(url, options?) — Fetches one URL with the same rules as evidence GETs (timeouts, byte caps, maxAttempts / backoff). Never throws; always returns an EvidenceSource (check fetchOk). Does not require fetch.enabled.
healthCheck() — API key configured (no network).

Parallel URL fetches and rate limits

Inside one search({ fetchPages: true }) call: at most fetch.concurrency requests are in flight at any time; each URL is still tried up to fetch.maxAttempts times with backoff between attempts. There is no separate global QPS limit—saturating many hosts in parallel is your tradeoff.
Many parallel fetchUrlContent calls: each invocation runs its own retry loop with no cross-call throttling. To stay polite to a single origin, cap parallelism yourself (e.g. a pool of size ≤ fetch.concurrency) or serialize fetches.

PII / sensitive tokens in queries

Prefer redactQuery on the adapter config to strip or replace patterns (hostnames, ticket IDs, bearer fragments) before runTavilySearch.
Keep replacements non-empty so validation still passes, e.g. replace internal-host with [REDACTED_HOST] rather than deleting the whole string.
Per-request override is not exposed; build different adapter instances if you need different redactors.

Tavily include_raw_content: TavilyIncludeRawContent = boolean | "markdown" | "text" (API docs). Request value true is sent to the SDK as "markdown".

`DiscoveredSource` (Tavily / search API)

What the provider returned for a hit—not proof you fetched the live page yourself.

| Field | Notes | |--------|--------| | url, normalizedUrl, domain, title, publishedAt | Normalized URL used for dedupe in searchMany. | | snippet | Always present (string). Usable excerpt (Tavily snippet → content → raw, truncated by snippetMaxChars). Empty string when the provider sent nothing or when includeSourceSnippets is false. | | providerContent | From Tavily’s content only, truncated. | | providerRawContent | From raw_content / rawContent when requested; not capped by snippetMaxChars. | | snippetKind | "snippet" | "provider_content" | "provider_raw_content". | | providerScore, rank | From Tavily when present. | | matchedQueries | Which queries returned this URL (filled/merged in searchMany). |

`EvidenceSource` (HTTP fetch)

What this process requested and optionally extracted. Check fetchOk before treating extractedText as reliable.

| Field | Notes | |--------|--------| | fetchOk, httpStatus, fetchError | Outcome of the GET. | | origin | fetched_html | fetched_text | fetched_json | fetched_pdf (PDF not extracted yet). | | extractedText | Plain-ish text (HTML stripped heuristically). | | authorityScore, freshnessScore, qualityScore | Simple heuristics (gov/CVE/vendor domains, publishedAt age, length/success). | | derivedFromDiscoveredSourceIds | Discovery row IDs this fetch came from. | | matchedQueries | Carried from discovery. |

Tavily mapper (`mapTavilyDiscovery`)

Returns TavilyDiscoveryResult only—never a full SearchResult:

discoveredSources
providerSummary / providerSummaryOrigin (from Tavily’s answer)
providerFindings (answer + top snippet hints)

That keeps the Tavily step honest: it is discovery-stage output, not scoped evidence.

`SearchResult`

interface SearchResult {
  ok: boolean;
  provider: "tavily";
  query: string;
  providerSummary?: string;
  providerSummaryOrigin?: "provider_answer";
  providerFindings: ProviderFinding[];
  findings: SearchFinding[]; // evidence-backed / merged; empty until you add that layer
  discoveredSources: DiscoveredSource[];
  evidenceSources: EvidenceSource[];
  request: ResolvedSearchRequest;
  timing: SearchTiming;
  error?: SearchError;
  raw?: { providerResponse?: unknown };
}

`ProviderFinding` (from discovery only)

provider_answer — Tavily’s answer; sourceIds is empty (do not treat as grounded in every URL).
provider_hint — Short rows from top discovery snippets when there is no answer—hints, not verified claims.

`SearchFinding` (evidence / merge layer)

Reserved for source_claim, derived, cross_source_consensus, etc. The adapter currently returns findings: []; populate when you merge fetched text or rank evidence.

`SearchManyResult.merged`

providerFindings — Deduped provider hints across queries.
findings — Deduped evidence-backed findings (usually empty until implemented).
discoveredSources / evidenceSources — Merged separately by normalized URL.
queriesUsed — Sub-query strings in order.
totalDiscoveredSources / totalEvidenceSources — Counts after merge.

Example: discovery only (default)

const adapter = createSearchAdapter();

const result = await adapter.search({
  query: "CVE-2024-9999",
  maxResults: 5,
  includeAnswer: true,
});

if (result.ok) {
  console.log(result.discoveredSources.length, result.evidenceSources.length);
  console.log(result.providerSummary, result.providerSummaryOrigin);
  console.log(result.providerFindings.length, result.findings.length);
}

Example: discovery + evidence fetch

const adapter = createSearchAdapter({
  fetch: { enabled: true, topK: 3, timeoutMs: 15000 },
});

const result = await adapter.search({
  query: "CVE-2024-9999 advisory",
  maxResults: 5,
  fetchPages: true,
  fetchTopK: 2,
});

if (result.ok) {
  for (const e of result.evidenceSources) {
    if (e.fetchOk) console.log(e.url, e.extractedText?.slice(0, 500));
  }
}

Example: per-URL content after deduplication

const adapter = createSearchAdapter({
  fetch: { maxAttempts: 3, timeoutMs: 15000, maxContentChars: 200_000 },
});

const row = await adapter.fetchUrlContent("https://example.com/doc");
if (row.fetchOk) {
  console.log(row.extractedText?.slice(0, 500));
} else {
  console.warn(row.fetchError, row.httpStatus);
}

Example: query redaction

const adapter = createSearchAdapter({
  tavily: { apiKey: process.env.TAVILY_API_KEY },
  redactQuery: (q) =>
    q.replace(/\b[a-z0-9-]+\.internal\.company\b/gi, "[internal-host]"),
});

Errors

SearchError includes optional context: { stage?, query?, provider? } with stage among validate | provider_call | map | fetch.

Migration

sources / SearchSource → discoveredSources + evidenceSources.
summary → providerSummary; provider-only rows → providerFindings (not findings).
findings on SearchResult is now for evidence-backed claims only (often empty until you add merge logic).

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@x12i/search-adapter

Environment

Install & build

Public API

createSearchAdapter(config?: SearchAdapterConfig)

Top-level config

config.tavily

config.fetch (second-stage HTTP fetch)