@grouchoab/glossary

v0.2.1

Published

8 months ago

Extract glossaries and phrases from Markdown (unigrams + n-grams)

0High
0Medium
0Low

dspnikder

glossary markdown nlp ngrams text-mining

@grouchoab/glossary

Extract glossaries and phrases from Markdown. Recursively scans a directory for .md/.mdx, tokenizes, and returns a mergeable glossary of terms (unigrams) and phrases (n‑grams) mapped to the files where they occur.

Unigrams + phrases: surface concepts like event sourcing or bounded context.
Noise controls: stopwords, min token length, min/max document frequency, edge stopword trimming for phrases.
Mergeable: combine glossaries from multiple runs/repos without double counting.
Pure TS, zero deps.

Install

npm i -D @grouchoab/glossary
# or
pnpm add -D @grouchoab/glossary

Usage

import { extractGlossary, mergeGlossaries } from '@grouchoab/glossary';

const g1 = await extractGlossary('./docs', {
  includePhrases: true,
  ngramMin: 2,
  ngramMax: 3,
  maxDocFreqRatio: 0.8,
});

const g2 = await extractGlossary('./more-docs', {
  includePhrases: false,
});

const combined = mergeGlossaries([g1, g2]);
console.log(JSON.stringify(combined, null, 2));

API

`extractGlossary(rootPath: string, options?: ExtractOptions): Promise<Glossary>`

Walks rootPath, reads every .md/.mdx, strips Markdown decorations, tokenizes, and returns a JSON‑friendly glossary.

Options (selected):

exts: string[] — file extensions to include (default [".md", ".mdx"]).
includeHidden: boolean — include dotfiles/directories (default false).
stopwords: Iterable<string> — extra stopwords (case‑insensitive unless caseSensitive).
minLen: number — minimum token length (default 2).
minDocFreq: number — minimum documents containing term (default 1).
maxDocFreqRatio: number — drop overly common terms (default 0.85).
includePhrases: boolean — include n‑grams (default false).
ngramMin/Max: number — n‑gram range (defaults 2..3).
dropPhraseEdgesIfStopword: boolean — drop phrases whose first/last token is a stopword (default true).

Return type:

interface Glossary {
  meta: { version: number; createdAt: string; totalDocs: number; rootPaths: string[] };
  terms: Array<{ term: string; docFreq: number; files: string[]; isPhrase?: boolean }>;
}

`mergeGlossaries(glossaries: Glossary[]): Glossary`

Merges glossaries by unioning file sets per term and recomputing docFreq.

CLI

A small CLI is included:

# index ./docs with phrases (bigrams+trigrams), drop ultra-common terms >80%
npx groucho-glossary ./docs --phrases --max=0.8 > glossary.json

# Or after local install
groucho-glossary ./docs --phrases --max=0.8 > glossary.json

# knobs
--min=<n>      # min document frequency (default 1)
--max=<ratio>  # max document frequency ratio (default 0.85)
--phrases      # include n-grams
--nmin=<n>     # min n for n-grams (default 2)
--nmax=<n>     # max n for n-grams (default 3)

Design Notes

Tokenizer strips fenced/inline code, HTML, frontmatter, Markdown punctuation; diacritics removed for stable matching.
Stopwords: lightweight English set is built-in; extend via options.stopwords.
Phrases: sequential n‑grams over normalized tokens; optional edge stopword drop reduces noise like of the.
Combine‑ability: paths in results normalized to forward slashes.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@grouchoab/glossary

Install

Usage

API

extractGlossary(rootPath: string, options?: ExtractOptions): Promise<Glossary>

mergeGlossaries(glossaries: Glossary[]): Glossary

CLI

Design Notes

License

`extractGlossary(rootPath: string, options?: ExtractOptions): Promise<Glossary>`

`mergeGlossaries(glossaries: Glossary[]): Glossary`