@vertana/context-web
v0.2.0
Published
Web context gathering for Vertana - fetch and extract content from linked pages
Maintainers
Readme
@vertana/context-web
Web context gathering for Vertana — fetch and extract content from linked pages to provide additional context for translation.
Features
The recommended way to give the translator access to web context is to expose passive sources, which the translator only invokes when it decides it actually needs them:
fetchWebPage: A passive context source that fetches a single URL and extracts the main content using Mozilla's Readability algorithm. The LLM calls it on demand with a specific URL.searchWeb: A passive context source that performs a web search (DuckDuckGo Lite) and returns a list of results (title, URL, snippet).
A required helper is also provided for short, trusted link sets where you want links fetched up-front:
fetchLinkedPages: A required context source factory that extracts links from the source text and fetches their content before translation begins. By default it fetches up to ten links (configurable viamaxLinks). This is a convenience helper; see the warning below before using it on large or untrusted documents.
Plus a low-level utility:
extractLinks: Extracts URLs from text in various formats (plain text, Markdown, HTML).
Installation
Deno
deno add jsr:@vertana/context-webnpm
npm add @vertana/context-webpnpm
pnpm add @vertana/context-webUsage
The recommended pattern uses passive sources, so the translator decides which URLs (if any) are worth fetching:
import { translate } from "@vertana/facade";
import { fetchWebPage, searchWeb } from "@vertana/context-web";
import { openai } from "@ai-sdk/openai";
const text = `
Check out this article: https://example.com/article
It explains the concept in detail.
`;
const result = await translate(openai("gpt-4o"), "ko", text, {
contextSources: [
// The translator may fetch a specific URL when it needs more context.
fetchWebPage,
// The translator may run a web search when it needs more context.
searchWeb,
],
});Eagerly fetching linked pages
If you have a short, trusted set of links and you want them pulled in before
translation begins, fetchLinkedPages does that (up to ten links by default;
raise maxLinks to widen or lower the cap):
import { translate } from "@vertana/facade";
import { fetchLinkedPages } from "@vertana/context-web";
import { openai } from "@ai-sdk/openai";
const text = "Check out https://example.com/article for details.";
const result = await translate(openai("gpt-4o"), "ko", text, {
contextSources: [
fetchLinkedPages({ text, mediaType: "text/plain" }),
],
});Use character budgets to keep fetched reference material smaller than the source text:
fetchLinkedPages({
text,
mediaType: "text/plain",
maxCharsPerPage: 2000,
maxTotalChars: 6000,
});For longer pages, you can summarize each fetched page with an explicit model:
const summarizerModel = openai("gpt-4o");
fetchLinkedPages({
text,
mediaType: "text/plain",
summarize: { model: summarizerModel, maxChars: 800 },
});[!WARNING] Pulling many large pages into required context can confuse the translator: when the combined reference material is much larger than the source text, and especially when it is in the target language, the model may echo a fetched page back instead of translating the actual input. For large or untrusted link sets, prefer the passive
fetchWebPagesource above so the translator only fetches what it actually needs.
