@sschepis/brand-ingestor

v1.0.0

Published

13 days ago

LLM-powered brand intelligence library. Point it at a URL, get back a comprehensive brand profile — company info, brand identity, full product catalog, taxonomy, and metadata.

0High
0Medium
0Low

lonestar108

brand ingestor scraper shopify llm ai marketing product-catalog brand-identity

brand-ingestor

LLM-powered brand intelligence library. Point it at a URL, pass an LLM adapter, get back a comprehensive brand profile — company info, brand identity, full product catalog, taxonomy, metadata schema, and site content.

Designed to give an AI marketing system everything it needs to create ads and run campaigns for a brand.

Install

npm install
npx playwright install chromium

Usage

import { ingestBrand, LLMProvider, BrandProfile } from 'brand-ingestor';

const profile: BrandProfile = await ingestBrand('https://philosophy.com', {
  llmProvider: myLLMAdapter,
  maxPages: 20,      // optional, default 50
  concurrency: 2,    // optional, default 2
});

LLM Adapter

You provide an object implementing LLMProvider:

interface LLMProvider {
  generateObject: <T>(prompt: string, schema: ZodType<T>) => Promise<T>;
}

The library sends a prompt and a Zod schema. Your adapter returns a parsed object matching the schema. How you call your LLM is up to you.

Example with Vercel AI SDK + LM Studio:

import { generateObject } from 'ai';
import { createOpenAI } from '@ai-sdk/openai';

const lm = createOpenAI({ baseURL: 'http://localhost:1234/v1', apiKey: 'not-needed' });
const model = lm('openai/gpt-oss-20b');

const llmProvider: LLMProvider = {
  generateObject: async (prompt, schema) => {
    const { object } = await generateObject({ model, schema, prompt, mode: 'json' });
    return object;
  }
};

What It Returns

ingestBrand() returns a BrandProfile with these sections:

`company` — Corporate & Business Info

| Field | Description | |-------|------------| | name | Brand name | | legalName | Legal/registered name | | description | What the company does | | tagline | Primary slogan | | foundedYear | Year founded | | headquarters | HQ location | | parentCompany | Parent company if any | | industry | Line of business | | contactEmail | Public contact email | | contactPhone | Public phone number | | contactAddress | Physical address | | socialProfiles[] | Platform, URL, and handle for each social account |

`brand` — Brand Identity

| Field | Description | |-------|------------| | logos[] | Logo URLs with context (favicon, header, etc.) | | colors[] | Brand colors as hex codes with usage context | | fonts[] | Font families used on the site | | taglines[] | Slogans/taglines found in site content | | voiceTone | Brand voice description (e.g. "warm and aspirational") | | brandValues[] | Core values (e.g. "self-care", "simplicity") | | targetDemographic | Target audience description | | brandPersonality | Brand personality description |

`taxonomy` — Product Taxonomy

| Field | Description | |-------|------------| | collections[] | All collections/categories with title, handle, description, product count | | productTypes[] | All unique product types | | tags[] | All unique tags across products | | vendors[] | All unique vendors | | priceRange | Min, max, average price and currency |

`products[]` — Full Product Catalog

Each product includes:

| Field | Description | |-------|------------| | id, handle, url | Identifiers | | name, description, descriptionHtml | Content | | productType, vendor, tags[] | Classification | | variants[] | Each variant: id, title, SKU, price, compareAtPrice, available, weight, option values | | options[] | Option definitions (e.g. Size: ["4oz", "8oz", "16oz"]) | | images[] | Image URL, alt text, dimensions, position | | publishedAt, createdAt, updatedAt | Timestamps |

`metadata` — Product Metadata Schema

Computed from the product data so you understand the structure:

| Field | Description | |-------|------------| | optionDefinitions[] | Every option name, all values seen across products, how many products use it | | tagCloud[] | Every tag with frequency count | | fieldPopulation[] | For each product field: how many products have it populated (fill rate %) |

`content` — Site Content

| Field | Description | |-------|------------| | pages[] | Static pages (about, FAQ, policies) with title, handle, and text content | | metaDescription | Site meta description | | metaKeywords[] | Meta keywords | | ogImage | Open Graph image URL | | favicon | Favicon URL |

Top-level metadata

| Field | Description | |-------|------------| | platform | Detected platform ("shopify" or "generic") | | sourceUrl | The URL that was ingested | | ingestedAt | ISO timestamp |

How It Works

Platform detection — Probes /products.json to detect Shopify. More platform detectors can be added.
Shopify path — Fetches products, collections, and pages from Shopify's public JSON APIs (no auth needed). Paginates automatically. Renders the homepage with Playwright for brand identity extraction.
Generic path — Crawls the site with Playwright (renders JS), uses LLM to categorize URLs, extract products from page text, and derive corporate info.
Both paths — Scrape HTML for logos, colors, fonts, social links, meta tags, and JSON-LD structured data. Use LLM to derive brand voice, values, personality, and target demographic.

CLI (for testing)

npx tsx src/cli.ts https://philosophy.com

Env vars: LLM_BASE_URL, LLM_API_KEY, LLM_MODEL, MAX_PAGES.

Writes output to brand-profile-{hostname}.json.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme