@grouchoab/glossary
v0.2.1
Published
Extract glossaries and phrases from Markdown (unigrams + n-grams)
Maintainers
Readme
@grouchoab/glossary
Extract glossaries and phrases from Markdown. Recursively scans a directory for .md/.mdx, tokenizes, and returns a mergeable glossary of terms (unigrams) and phrases (n‑grams) mapped to the files where they occur.
- Unigrams + phrases: surface concepts like
event sourcingorbounded context. - Noise controls: stopwords, min token length, min/max document frequency, edge stopword trimming for phrases.
- Mergeable: combine glossaries from multiple runs/repos without double counting.
- Pure TS, zero deps.
Install
npm i -D @grouchoab/glossary
# or
pnpm add -D @grouchoab/glossaryUsage
import { extractGlossary, mergeGlossaries } from '@grouchoab/glossary';
const g1 = await extractGlossary('./docs', {
includePhrases: true,
ngramMin: 2,
ngramMax: 3,
maxDocFreqRatio: 0.8,
});
const g2 = await extractGlossary('./more-docs', {
includePhrases: false,
});
const combined = mergeGlossaries([g1, g2]);
console.log(JSON.stringify(combined, null, 2));API
extractGlossary(rootPath: string, options?: ExtractOptions): Promise<Glossary>
Walks rootPath, reads every .md/.mdx, strips Markdown decorations, tokenizes, and returns a JSON‑friendly glossary.
Options (selected):
exts: string[]— file extensions to include (default[".md", ".mdx"]).includeHidden: boolean— include dotfiles/directories (defaultfalse).stopwords: Iterable<string>— extra stopwords (case‑insensitive unlesscaseSensitive).minLen: number— minimum token length (default2).minDocFreq: number— minimum documents containing term (default1).maxDocFreqRatio: number— drop overly common terms (default0.85).includePhrases: boolean— include n‑grams (defaultfalse).ngramMin/Max: number— n‑gram range (defaults2..3).dropPhraseEdgesIfStopword: boolean— drop phrases whose first/last token is a stopword (defaulttrue).
Return type:
interface Glossary {
meta: { version: number; createdAt: string; totalDocs: number; rootPaths: string[] };
terms: Array<{ term: string; docFreq: number; files: string[]; isPhrase?: boolean }>;
}mergeGlossaries(glossaries: Glossary[]): Glossary
Merges glossaries by unioning file sets per term and recomputing docFreq.
CLI
A small CLI is included:
# index ./docs with phrases (bigrams+trigrams), drop ultra-common terms >80%
npx groucho-glossary ./docs --phrases --max=0.8 > glossary.json
# Or after local install
groucho-glossary ./docs --phrases --max=0.8 > glossary.json
# knobs
--min=<n> # min document frequency (default 1)
--max=<ratio> # max document frequency ratio (default 0.85)
--phrases # include n-grams
--nmin=<n> # min n for n-grams (default 2)
--nmax=<n> # max n for n-grams (default 3)Design Notes
- Tokenizer strips fenced/inline code, HTML, frontmatter, Markdown punctuation; diacritics removed for stable matching.
- Stopwords: lightweight English set is built-in; extend via
options.stopwords. - Phrases: sequential n‑grams over normalized tokens; optional edge stopword drop reduces noise like
of the. - Combine‑ability: paths in results normalized to forward slashes.
License
MIT
