llm-quote-extractor
v0.1.0
Published
Parse any AI response (ChatGPT, Claude, Gemini, Perplexity, AIO) for brand mentions, cited URLs, ranked recommendations, and platform fingerprint. Pure CPU regex, no LLM call.
Downloads
177
Maintainers
Readme
llm-quote-extractor
Parse any AI response (ChatGPT, Claude, Gemini, Perplexity, Google AI Overview) for brand mentions, cited URLs, ranked recommendations, and platform fingerprint. Pure CPU regex. No LLM call. Zero network.
A focused TypeScript library that takes an LLM-generated text response and returns a structured extraction — what brands the model named, what URLs it cited, what it recommended in what order, and which platform the response likely came from. Useful for AI search visibility tracking, brand monitoring research, citation analysis, and answer-engine optimization (AEO) workflows.
The same parsing pipeline runs the free Citare LLM Quote Extractor web tool. This package is the open-source extraction core.
Install
npm install llm-quote-extractor
# or
pnpm add llm-quote-extractor
# or
yarn add llm-quote-extractorQuick start
import { extractFromLlmText } from "llm-quote-extractor";
const pasted = `
I'd be happy to help! For project management with AI workflows, here are my top picks:
1. **Linear** — fast, keyboard-driven, beloved by engineers (https://linear.app)
2. **Notion** — flexible knowledge base + project tracker (https://notion.so)
3. **Asana** — broad PM tool with strong integrations (https://asana.com)
Each has its strengths depending on team size and workflow style.
`;
const result = extractFromLlmText(pasted);
console.log(result.detectedPlatform); // → "claude" (matches the "I'd be happy to help!" signature)
console.log(result.brandsNamed[0]); // → { brand: "Linear", count: 1, ... }
console.log(result.citedUrls.length); // → 3
console.log(result.rankedRecommendations);
// → [{ position: 1, text: "Linear — fast, keyboard-driven..." }, ...]What it returns
type LlmQuoteExtractionResult = {
// Platform fingerprint based on signature phrases
detectedPlatform: "chatgpt" | "claude" | "gemini" | "perplexity" | "aio" | "unknown";
// Brand candidates — Title-Case tokens, ranked by mention count
brandsNamed: Array<{
brand: string;
count: number;
firstSeenIndex: number;
context: string; // ~140-char snippet around first mention
}>;
topBrand: string | null; // shorthand for brandsNamed[0]?.brand
// URLs the model cited
citedUrls: Array<{
url: string;
domain: string;
contextSnippet: string;
}>;
// Numbered list items if the model produced a ranking
rankedRecommendations: Array<{
position: number; // the number the LLM gave (1, 2, 3...)
text: string;
}>;
// Stats
inputWordCount: number;
sentenceCount: number;
brandDiversity: number; // count of distinct brand candidates
};What it does NOT do
- No LLM call. This is pure regex + heuristics. If you want LLM-powered disambiguation (resolving "Apple the company" vs "apple the fruit"), pair this with a separate LLM step downstream.
- No sentiment scoring. Returns mentions; doesn't classify them as positive / negative.
- No brand normalization. "GitHub" and "Github" are treated as the same brand by case-insensitive grouping, but "Linear" and "linear.app" are not yet merged.
- No network requests. Doesn't fetch URLs to verify them, doesn't enrich domains.
These intentional scoping choices keep the library fast (<10ms for typical responses) and deterministic.
Platform fingerprint heuristics
The detector looks for signature phrases that are characteristic of each platform's response style:
| Platform | Signature signals |
|---|---|
| chatgpt | "Certainly!", "Here's...", structured markdown with **bold headers** |
| claude | "I'd be happy to help", "Let me", measured first-person tone |
| gemini | "Here are some", strong Google-style enumeration |
| perplexity | [1] [2] numbered footnote citations |
| aio | "Generative AI is experimental", concise paragraph form |
| unknown | Returned when no signature matches |
The fingerprint is heuristic — false positives are possible (especially on short responses).
Brand candidate extraction
The parser extracts Title-Case tokens from the answer body and filters them against a stopword list of ~80 common non-brand words (Tuesday, January, Today, etc.). It's deliberately permissive — false positives are easier to filter downstream than false negatives are to recover.
For production brand-attribution at scale, you'll want a downstream LLM disambiguation step that takes this candidate list and confirms which are real brand names vs incidental Title-Case usage. That's the trade-off of being LLM-free: speed and zero cost at the price of perfect precision.
Where it came from
This library is the open-source extraction core of Citare, an AI search intelligence platform. The same parsing logic runs the free public tool at citare.ai/tools/llm-quote-extractor and the parsing layer inside Citare's Brand Radar (5-engine weekly brand visibility measurement).
If you find this useful and want richer measurement — disambiguated brand attribution, cross-platform 50-cell weekly dispatches, persona-anchored measurement — the Citare free tier covers one project with weekly dispatches at no cost.
Contributing
Issues and PRs welcome at github.com/ravirdp/llm-quote-extractor. The library is intentionally small and focused — major feature additions should ship as separate packages that build on this one.
License
MIT. See LICENSE.
