@veldica/prose-analyzer
v1.3.0
Published
Deterministic prose style metrics for measuring variety, density, repetition, and narrative texture. Designed for AI editorial pipelines and writing tools.
Maintainers
Readme
@veldica/prose-analyzer
Deterministic prose style metrics for measuring variety, density, repetition, and narrative texture. While readability (via @veldica/readability) measures accessibility, prose analysis measures the sophistication and vividness of writing.
Designed for high-precision editorial pipelines, creative writing tools, and content auditing systems where deterministic, explainable signals are required.
Why Prose Analysis?
In an era of AI-generated content, understanding the "texture" of prose is critical:
- Measuring "AI-ness": Detect the flat lexical fingerprints of LLMs by measuring moving-average diversity and repetition ratios.
- "Show, Don't Tell" Auditing: Use Scene Density and Sensory Language proxies to mathematically evaluate the vividness of narrative sections.
- Deterministic Consistency: Unlike LLM-based critiques, these scores are 100% repeatable, local-first, and require zero network calls.
- Length-Independent Metrics: Implements an $O(N)$ MATTR (Moving Average TTR) algorithm to accurately and efficiently measure vocabulary richness in texts of any length, from single paragraphs to full-length novels.
Features
- High-Performance Lexical Analysis: Optimized $O(N)$ MATTR algorithm and
Map-based frequency tracking for massive datasets. - Narrative Texture Heuristics: Deterministic proxies for "active" vs. "passive" writing styles.
- Linguistic Signal Detection: Categorization of sensory and abstract language usage with "Sanitize Once" performance.
- Resilient Logic: Hardened against regex backtracking and synchronous blocking on large inputs.
- Tree-Shakable: ESM-first design with modular exports for lexical, narrative, and categorization tools.
Core Capabilities
1. Lexical Variety & Density
Measures how "rich" the vocabulary is and the ratio of content-carrying words vs. structural filler (stopwords).
- TTR (Type-Token Ratio): Basic unique word ratio.
- MATTR (Moving-Average TTR): A more robust measure of diversity that isn't biased by text length.
- Lexical Density: The percentage of content words (nouns, verbs, adjectives) using high-precision filtering.
2. Repetition Identification
Automatically detects overused content words while ignoring common stopwords. Provides both raw counts and frequency ratios.
3. Narrative Texture
Heuristic-based signals for analyzing the "pacing" of prose:
- Dialogue Ratio: The balance between spoken word and narration.
- Scene Density: A proxy for "active" writing, calculated by correlating dialogue frequency, high sentence length variation, and short sentence clusters.
- Exposition Density: Detects "wall of text" explanation patterns based on paragraph length and sentence complexity.
4. Linguistic Signal Detection
- Sensory Language: Detection of words related to sight, sound, smell, taste, and touch.
- Abstract Concepts: Identification of non-concrete terminology (e.g., "policy", "concept", "theory").
Installation
npm install @veldica/prose-analyzerQuick Start
Comprehensive Analysis (Recommended)
The easiest way to get a full stylistic profile is using the analyzeDocument bridge with the Veldica tokenizer.
import { tokenize } from "@veldica/prose-tokenizer";
import { analyzeDocument } from "@veldica/prose-analyzer";
const text = `The sun was a bright, heavy disk. "It is time," he said.
The sky turned from blue to black as the shadows grew long.`;
// 1. Prepare text with the Veldica tokenizer
const doc = tokenize(text);
// 2. Run combined analysis using the bridge
const results = analyzeDocument(doc);
console.log(results.lexical.lexical_diversity_mattr); // 0.85 (High variety)
console.log(results.narrative.scene_density_proxy); // 0.45 (Active pacing)
console.log(results.narrative.sensory_term_density); // 0.23 (Vivid imagery)Advanced Usage (Manual Mapping)
If you are using a custom tokenizer, you can call analyzeProse directly by providing the required counts.
import { analyzeProse } from "@veldica/prose-analyzer";
const results = analyzeProse(
sentences,
words,
sentenceWordCounts,
paragraphWordCounts,
{
lexical: { windowSize: 100 }
}
);Advanced Lexical Options
You can tune the analysis (e.g., changing the MATTR sliding window) to suit specific content types.
import { analyzeLexical } from "@veldica/prose-analyzer";
const lexical = analyzeLexical(words, {
windowSize: 100, // Larger window for long-form prose
topRepeatedCount: 5, // Return fewer, more significant repetitions
});Signal Detection
Use the underlying categorization tools for custom filtering logic.
import { isSensory, isAbstract } from "@veldica/prose-analyzer";
isSensory("bright"); // true
isAbstract("policy"); // trueAPI Reference
analyzeDocument(doc, options?): ProseAnalysisResults
The recommended entry point for users of @veldica/prose-tokenizer. Automatically maps nested counts and runs a full analysis.
doc: ATokenizedDocumentfrom the Veldica tokenizer.options: (Optional) Object containinglexicalandnarrativeconfigurations.
analyzeProse(sentences, words, sentenceWordCounts, paragraphWordCounts, options?): ProseAnalysisResults
Low-level combined entry point. Useful if you are using a custom tokenizer.
options: (Optional) Object containinglexicalandnarrativeconfigurations.
analyzeLexical(words: string[], options?: LexicalAnalysisOptions): LexicalBase
Performs frequency and diversity analysis.
windowSize: (Default: 50) The sliding window for MATTR calculation.topRepeatedCount: (Default: 10) Number of repeated words to return.isStopword: (Optional) Custom function to override the default stopword detector.
analyzeNarrative(sentences, words, sentenceWordCounts, paragraphWordCounts, options?): FictionMetrics
Calculates narrative pacing and texture.
options.isSensory: (Optional) Custom function for sensory signal detection.options.isAbstract: (Optional) Custom function for abstract concept detection.
isSensory(word: string, isNormalized?: boolean): boolean
Checks if a word belongs to the sensory language category. Set isNormalized to true if the word is already lowercased and stripped of non-alpha characters.
isAbstract(word: string, isNormalized?: boolean): boolean
Checks if a word belongs to the abstract concepts category. Set isNormalized to true if the word is already lowercased and stripped of non-alpha characters.
Practical Use Cases
- AI Editorial Pipelines: Score AI-generated prose for "human-like" lexical variety.
- Creative Writing Tools: Provide real-time feedback on "show vs. tell" using scene density proxies.
- Content Quality Assurance: Identify repetitive or exposition-heavy sections in technical documentation.
- Style Consistency: Ensure multiple authors maintain a consistent narrative texture across a large project.
Limitations
- Language Support: Heuristics and categorizations (sensory/abstract) are currently optimized for English prose.
- Dialogue Detection: Relies on common quotation mark heuristics. Complex formatting (e.g., non-standard punctuation in experimental fiction) may require custom options.
- Contextual Nuance: Like all deterministic metrics, these scores measure "signals" rather than "meaning." They should be used to support, not replace, human editorial judgment.
Philosophy
We believe prose analysis should be deterministic and explainable. Unlike LLM-based scores which are opaque and probabilistic, @veldica/prose-analyzer provides raw linguistic signals that are:
- Fast: Executed in milliseconds without network calls.
- Private: No data ever leaves your machine.
- Consistent: The same text always yields the same result.
- Signal-Focused: We measure the mechanics of the prose (how it is built) rather than attempting to interpret its meaning.
The Veldica Suite
This package is part of the Veldica ecosystem of modular, high-performance linguistic tools:
@veldica/prose-tokenizer: High-precision text segmentation.@veldica/readability: Industry-standard readability formulas.@veldica/prose-analyzer: Stylistic variety and narrative texture.
Contributing
Contributions are welcome! Please follow our established technical mandates:
- Fork the Project
- Create your Feature Branch (
git checkout -b feature/NewMetric) - Ensure all changes pass the unit test suite (
npm test). - Push to the Branch (
git push origin feature/NewMetric) - Open a Pull Request
Ownership & Authority
This package is maintained by Veldica as a core part of our writing analysis platform. It is built for production environments that demand mathematical precision and deterministic reliability.
- Full Documentation: veldica.com/prose-analyzer
- Veldica Platform: veldica.com
- Report Bugs: GitHub Issues
License
MIT © Veldica
