website-profile-extractor
v0.1.0
Published
Fetch a website and use an LLM to extract structured profile data against a user-supplied schema. Pluggable Gemini / OpenAI / Anthropic backends. Lead enrichment without Clearbit.
Maintainers
Readme
website-profile-extractor
Fetch any website, hand it to an LLM with a JSON schema, get back structured profile data. BYO LLM (Gemini, OpenAI, Anthropic — or any function that returns JSON).
Lead enrichment without Clearbit. Profile completion without a vendor.
npm install website-profile-extractorWhy
LLMs are great at extracting structured fields from messy HTML. But every team writes the same boilerplate: fetch, strip tags, build a prompt from a schema, parse fences, merge with existing data, decide which fields to skip.
This package gives you extractProfile({ url, schema, provider }) and provider adapters for the major APIs. The schema is JSON-schema-ish; the result is typed.
Quick start
import { extractProfile } from "website-profile-extractor";
import { geminiProvider } from "website-profile-extractor/providers/gemini";
const provider = geminiProvider({ apiKey: process.env.GEMINI_API_KEY! });
const result = await extractProfile({
url: "https://acme-kennels.de",
provider,
schema: {
kennel_name: { type: "string", description: "Business name" },
phone: { type: "string", description: "Primary phone number", strict: true },
email: { type: "string" },
address: { type: "string" },
breeds: { type: "array", items: { type: "string" }, description: "Dog breeds raised" },
club_memberships: { type: "array", items: { type: "string" } },
},
});
console.log(result.data);
// {
// kennel_name: "Acme Kennels",
// phone: "+49 123 456 789",
// email: "[email protected]",
// address: "Musterstraße 1, 10115 Berlin",
// breeds: ["Labrador Retriever", "Golden Retriever"],
// club_memberships: ["VDH", "DRC"]
// }Provider adapters
import { geminiProvider } from "website-profile-extractor/providers/gemini";
import { openAIProvider } from "website-profile-extractor/providers/openai";
import { anthropicProvider } from "website-profile-extractor/providers/anthropic";
const a = geminiProvider({ apiKey, model: "gemini-1.5-flash" });
const b = openAIProvider({ apiKey, model: "gpt-4o-mini" });
const c = anthropicProvider({ apiKey, model: "claude-haiku-4-5-20251001" });A provider is anything that implements { generate(prompt) => Promise<{ json: string }> } — write your own for local models, Ollama, Vertex, Bedrock, etc.
Filling missing fields only
Pass an existing object and the extractor will only fill empty/null/missing keys. This makes it safe to re-run on a record without overwriting human edits.
const result = await extractProfile({
url,
schema,
provider,
existing: profile, // already-known fields
});
// result.filled → keys that were extracted from the page
// result.skipped → keys that were already populated or unknownSchema fields
type FieldSchema = {
type: "string" | "number" | "boolean" | "array" | "object";
description?: string; // becomes part of the prompt
items?: FieldSchema; // for arrays
properties?: Record<string, FieldSchema>; // for objects
enum?: Array<string | number>;
strict?: boolean; // refuse to invent values
};The package builds a single, well-structured prompt from your schema, asks for JSON, and parses the response (handles `````json` fences, surrounding prose, etc.).
Options
extractProfile({
url, // page to fetch
schema, // see above
provider, // LLM
existing?, // already-known fields to skip
instructions?, // appended to system prompt
maxHtmlChars?, // truncate cleaned text (default 60,000)
userAgent?, // override UA header
fetch?, // inject fetch impl
});License
MIT
