llm-quote-extractor

v0.1.0

Published

22 days ago

Parse any AI response (ChatGPT, Claude, Gemini, Perplexity, AIO) for brand mentions, cited URLs, ranked recommendations, and platform fingerprint. Pure CPU regex, no LLM call.

Downloads

177

0High
0Medium
0Low

ravirdp

ai-search llm brand-monitoring chatgpt claude gemini perplexity ai-overview geo aeo citation-tracking

llm-quote-extractor

Parse any AI response (ChatGPT, Claude, Gemini, Perplexity, Google AI Overview) for brand mentions, cited URLs, ranked recommendations, and platform fingerprint. Pure CPU regex. No LLM call. Zero network.

A focused TypeScript library that takes an LLM-generated text response and returns a structured extraction — what brands the model named, what URLs it cited, what it recommended in what order, and which platform the response likely came from. Useful for AI search visibility tracking, brand monitoring research, citation analysis, and answer-engine optimization (AEO) workflows.

The same parsing pipeline runs the free Citare LLM Quote Extractor web tool. This package is the open-source extraction core.

Install

npm install llm-quote-extractor
# or
pnpm add llm-quote-extractor
# or
yarn add llm-quote-extractor

Quick start

import { extractFromLlmText } from "llm-quote-extractor";

const pasted = `
I'd be happy to help! For project management with AI workflows, here are my top picks:

1. **Linear** — fast, keyboard-driven, beloved by engineers (https://linear.app)
2. **Notion** — flexible knowledge base + project tracker (https://notion.so)
3. **Asana** — broad PM tool with strong integrations (https://asana.com)

Each has its strengths depending on team size and workflow style.
`;

const result = extractFromLlmText(pasted);

console.log(result.detectedPlatform);    // → "claude" (matches the "I'd be happy to help!" signature)
console.log(result.brandsNamed[0]);      // → { brand: "Linear", count: 1, ... }
console.log(result.citedUrls.length);    // → 3
console.log(result.rankedRecommendations);
// → [{ position: 1, text: "Linear — fast, keyboard-driven..." }, ...]

What it returns

type LlmQuoteExtractionResult = {
  // Platform fingerprint based on signature phrases
  detectedPlatform: "chatgpt" | "claude" | "gemini" | "perplexity" | "aio" | "unknown";

  // Brand candidates — Title-Case tokens, ranked by mention count
  brandsNamed: Array<{
    brand: string;
    count: number;
    firstSeenIndex: number;
    context: string;       // ~140-char snippet around first mention
  }>;
  topBrand: string | null; // shorthand for brandsNamed[0]?.brand

  // URLs the model cited
  citedUrls: Array<{
    url: string;
    domain: string;
    contextSnippet: string;
  }>;

  // Numbered list items if the model produced a ranking
  rankedRecommendations: Array<{
    position: number;       // the number the LLM gave (1, 2, 3...)
    text: string;
  }>;

  // Stats
  inputWordCount: number;
  sentenceCount: number;
  brandDiversity: number;   // count of distinct brand candidates
};

What it does NOT do

No LLM call. This is pure regex + heuristics. If you want LLM-powered disambiguation (resolving "Apple the company" vs "apple the fruit"), pair this with a separate LLM step downstream.
No sentiment scoring. Returns mentions; doesn't classify them as positive / negative.
No brand normalization. "GitHub" and "Github" are treated as the same brand by case-insensitive grouping, but "Linear" and "linear.app" are not yet merged.
No network requests. Doesn't fetch URLs to verify them, doesn't enrich domains.

These intentional scoping choices keep the library fast (<10ms for typical responses) and deterministic.

Platform fingerprint heuristics

The detector looks for signature phrases that are characteristic of each platform's response style:

| Platform | Signature signals | |---|---| | chatgpt | "Certainly!", "Here's...", structured markdown with **bold headers** | | claude | "I'd be happy to help", "Let me", measured first-person tone | | gemini | "Here are some", strong Google-style enumeration | | perplexity | [1] [2] numbered footnote citations | | aio | "Generative AI is experimental", concise paragraph form | | unknown | Returned when no signature matches |

The fingerprint is heuristic — false positives are possible (especially on short responses).

Brand candidate extraction

The parser extracts Title-Case tokens from the answer body and filters them against a stopword list of ~80 common non-brand words (Tuesday, January, Today, etc.). It's deliberately permissive — false positives are easier to filter downstream than false negatives are to recover.

For production brand-attribution at scale, you'll want a downstream LLM disambiguation step that takes this candidate list and confirms which are real brand names vs incidental Title-Case usage. That's the trade-off of being LLM-free: speed and zero cost at the price of perfect precision.

Where it came from

This library is the open-source extraction core of Citare, an AI search intelligence platform. The same parsing logic runs the free public tool at citare.ai/tools/llm-quote-extractor and the parsing layer inside Citare's Brand Radar (5-engine weekly brand visibility measurement).

If you find this useful and want richer measurement — disambiguated brand attribution, cross-platform 50-cell weekly dispatches, persona-anchored measurement — the Citare free tier covers one project with weekly dispatches at no cost.

Contributing

Issues and PRs welcome at github.com/ravirdp/llm-quote-extractor. The library is intentionally small and focused — major feature additions should ship as separate packages that build on this one.

License

MIT. See LICENSE.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

llm-quote-extractor

Install

Quick start

What it returns

What it does NOT do

Platform fingerprint heuristics

Brand candidate extraction

Where it came from

Contributing

License