diverse-lemmas
v1.0.0
Published
Offline multilingual lemmatization for browser and Node.js. 50+ languages, on-demand downloads, local caching.
Maintainers
Readme
diverse-lemmas
Offline multilingual lemmatization for browser and Node.js
50+ languages • On-demand downloads • Local caching • Zero runtime dependencies
Features
- 🌍 50+ languages - Hebrew, Korean, Spanish, German, Russian, and many more
- 📦 Tiny core package - Just 15KB, language data downloaded on demand
- 💾 Smart caching - IndexedDB (browser) or filesystem (Node.js)
- ⚡ Fast lookups - Sub-microsecond after initial load
- 🔌 Works everywhere - Browser, Node.js, Deno, Bun
Installation
npm install @hg0428/diverse-lemmasQuick Start
import { loadLanguage, lemmatize, LANGUAGES } from '@hg0428/diverse-lemmas';
// Load Spanish (downloads ~15MB on first use, then cached)
const es = await loadLanguage('es');
// Lemmatize words
es.lemmatizeWord('hablando');
// => { lemmas: ['hablar'], method: 'direct' }
es.lemmatizeWord('niños');
// => { lemmas: ['niño'], method: 'direct' }
// Or use the quick API
const result = await lemmatize('es', 'corriendo');
// => { lemmas: ['correr'], method: 'direct' }API
Loading Languages
import { loadLanguage, loadLanguages, getLemmatizer } from '@hg0428/diverse-lemmas';
// Load a single language
const de = await loadLanguage('de');
// Load multiple languages in parallel
const langs = await loadLanguages(['en', 'es', 'fr']);
// => Map { 'en' => Lemmatizer, 'es' => Lemmatizer, 'fr' => Lemmatizer }
// Get already-loaded lemmatizer (null if not loaded)
const cached = getLemmatizer('de');Lemmatization
// Single word
lemmatizer.lemmatizeWord('running');
// => { lemmas: ['run'], method: 'direct' }
// Multiple words
lemmatizer.lemmatizeWords(['dogs', 'running', 'quickly']);
// => [{ word: 'dogs', lemmas: ['dog'], method: 'direct' }, ...]
// Full text (auto-tokenizes)
lemmatizer.lemmatizeText('The dogs were running quickly');
// => [{ word: 'The', lemmas: ['the'], method: 'direct' }, ...]Cache Management
import {
getCachedLanguages,
isLanguageCached,
removeLanguage,
clearCache,
getCacheInfo
} from '@hg0428/diverse-lemmas';
// List cached languages
await getCachedLanguages();
// => ['en', 'es', 'de']
// Check if cached
await isLanguageCached('fr');
// => false
// Get cache info
await getCacheInfo('es');
// => { cachedAt: 1704067200000, sizeKB: 15360 }
// Remove from cache
await removeLanguage('de');
// Clear everything
await clearCache();Language Info
import { LANGUAGES, getSupportedLanguages, getLanguageInfo } from '@hg0428/diverse-lemmas';
// All supported languages
getSupportedLanguages();
// => ['he', 'ko', 'es', 'fr', 'it', 'en', 'de', ...]
// Get language metadata
getLanguageInfo('es');
// => { name: 'Spanish', source: 'simplemma', hasAmbiguity: true, sizeKB: 15360 }
// Large languages (>50MB) - consider warning users
LANGUAGES.fi.large // => true (Finnish: 97MB)
LANGUAGES.pl.large // => true (Polish: 97MB)
LANGUAGES.sw.large // => true (Swahili: 111MB)Configuration
import { setCDN, getCDN } from '@hg0428/diverse-lemmas';
// Use a custom CDN or self-hosted files
setCDN('https://my-cdn.com/lemmas');
// Get current CDN
getCDN();
// => 'https://my-cdn.com/lemmas'Supported Languages
| Category | Languages | |----------|-----------| | Full support (with ambiguity) | Hebrew, Korean, Spanish, French, Italian | | Major European | English, German, Portuguese, Dutch, Russian, Polish, Swedish | | Nordic | Danish, Norwegian (Bokmål/Nynorsk), Finnish, Icelandic | | Eastern European | Czech, Slovak, Hungarian, Romanian, Bulgarian, Ukrainian | | Celtic | Irish, Welsh, Scottish Gaelic | | Other | Turkish, Indonesian, Latin, and 20+ more |
See LANGUAGES.md for the complete list with sizes.
How It Works
- Core package (
@hg0428/diverse-lemmas) is tiny (~15KB) - Language files are hosted on CDN (jsDelivr by default)
- On first use, the language is downloaded and cached locally
- Subsequent uses load instantly from local cache
- Browser: Uses IndexedDB for persistent storage
- Node.js: Uses filesystem (
~/.diverse-lemmas/cache/)
Data Sources
- Simplemma - 49 languages from simplemma
- Hebrew - Custom dictionary + Stanza + verb conjugations
- Korean - Universal Dependencies Korean-GSD and Korean-Kaist
License
MIT
