ua-word-stress-wasm
v0.5.2
Published
Ukrainian word stress engine — dictionary lookup, IPA transcription, morphology. Rust/WASM, no init() required.
Downloads
48
Maintainers
Readme
ua-word-stress-wasm
Ukrainian word stress lookup + full IPA phonetic transcription, compiled to WebAssembly from Rust.
The dictionary is embedded in the WASM binary — no separate data file to host or fetch.
Works in browsers (ESM) and Node.js.
| Feature | ua-word-stress (TS trie) | ua-word-stress-wasm (this) |
|---|---|---|
| Stress lookup | ✓ | ✓ |
| Full IPA transcription | — | ✓ |
| Syllabification | — | ✓ |
| Morphology (POS, UD features, lemma) | — | ✓ |
| Data file to serve | 9.4 MB .ctrie.gz | none (embedded) |
| WASM binary size | — | ~14 MB |
Database statistics
| Metric | Value | |---|---| | Word forms | 3,008,723 | | Binary format | bzip2-compressed binary (V2) | | Phonetic pipeline passes | 6 | | IPA standard | IPA (Steriopolo 2012, Savchenko 2014) |
Installation
# pnpm (recommended)
pnpm add ua-word-stress-wasm
# npm
npm install ua-word-stress-wasm
# yarn
yarn add ua-word-stress-wasmQuick start
import { lookup, mark, stressIndex, stressIndexBatch, wordCount } from 'ua-word-stress-wasm';
// No init() needed — the WASM is loaded automatically by your bundler.
// Stress-marked word
mark('університет'); // → 'університе́т'
mark('замок'); // → 'за́мок' (first reading)
// Stress index (0-based vowel position; -1 = unknown)
stressIndex('мама'); // → 0
stressIndex('університет'); // → 4
// Full lookup with IPA, syllables and morphology
const r = lookup('замок');
r.readings[0].stressedForm; // → 'за́мок'
r.readings[0].ipa; // → 'zɑmɔk'
r.readings[0].ipaSyllables; // → ['ˈzɑ', 'mɔk']
r.readings[0].morph[0].pos; // → ['NOUN']
// Batch lookup — much faster than calling stressIndex() in a loop
const indices = stressIndexBatch(['мама', 'тато', 'xyz']);
// → Int32Array [0, 0, -1]
// Dictionary size
wordCount(); // → 3008723Vite / webpack / Rollup
No configuration needed. Vite, webpack 5, and Rollup all handle .wasm imports natively:
import { mark } from 'ua-word-stress-wasm';
mark('університет'); // → 'університе́т'Node.js (ESM, Node 20+)
import { mark } from 'ua-word-stress-wasm';
console.log(mark('привіт')); // → 'приві́т'API reference
mark(word: string): string
Returns the word with a combining acute accent (U+0301) placed over the stressed vowel.
Returns the word unchanged if it is not in the dictionary.
mark('мама'); // → 'ма́ма'
mark('університет'); // → 'університе́т'
mark('xyz'); // → 'xyz' (unknown — returned as-is)markBatch(words: Array<string>): Array<string>
Batch variant of mark. Takes an array of words and returns an array of stress-marked strings.
Words not in the dictionary are returned unchanged.
Significantly faster than calling mark() in a loop — ideal for processing pasted text blocks.
markBatch(['мама', 'тато', 'університет', 'xyz']);
// → ['ма́ма', 'та́то', 'університе́т', 'xyz']stressIndex(word: string): number
Returns the 0-based syllable index of the stressed syllable, or -1 if the word is not in the dictionary.
This is the minimal-overhead call — no object allocation. The value is the same as readings[0].syllableIndex from lookup() and is directly usable as a syllable position (e.g. syllables[syllableIndex] gives the stressed syllable).
stressIndex('мама'); // → 0 (first syllable: ма́-ма)
stressIndex('університет'); // → 4 (fifth syllable: у-ні-вер-си-те́т)
stressIndex('xyz'); // → -1 (unknown)stressIndexBatch(words: Array<string>): Int32Array
Batch variant of stressIndex. Takes a JS Array of strings and returns an Int32Array of 0-based syllable indices (one per word, -1 = unknown).
Significantly faster than looping over stressIndex() because the JS↔WASM encoding overhead is amortised.
stressIndexBatch(['мама', 'тато', 'дитина', 'xyz']);
// → Int32Array [0, 0, 1, -1]lookupBatch(words: Array<string>): Array<LookupResult>
Batch variant of lookup. Returns an array of full result objects, one per word.
Ideal for processing large texts — avoids per-call JS↔WASM overhead.
const results = lookupBatch(['мама', 'замок', 'xyz']);
results[0].readings[0].stressedForm; // → 'ма́ма'
results[1].readings[0].ipa; // → 'zɑmɔk'
results[2].readings; // → [] (unknown)lookup(word: string): LookupResult
Returns a full result object with all stress readings, IPA, syllabification, tokens and morphology.readings is an empty array when the word is not in the dictionary.
interface LookupResult {
form: string; // normalised input word
readings: Reading[];
}
interface Reading {
syllableIndex: number; // 0-based syllable index of the stressed syllable
stressFromEnd: number; // syllables from the end (1 = ultima, 2 = penult, …)
syllableCount: number;
form: string; // normalised form (same as input for most words)
stressedForm: string; // word with U+0301 accent on stressed vowel
wordSyllables: string[]; // Cyrillic syllables, e.g. ['за', 'мок']
ipa: string; // full IPA string, e.g. 'zɑmɔk'
ipaSyllables: string[]; // IPA per syllable with stress mark, e.g. ['ˈzɑ', 'mɔk']
tokens: Token[];
morph: MorphReading[];
confidence: string | null;
}
interface Token {
ipa: string; // IPA of this token, e.g. 'm', 'ɑ', 'tʲ'
source: string; // source grapheme(s)
type: string; // 'Vowel' | 'Consonant' | 'Glide' | 'Separator'
vowelIndex: number; // -1 for consonants; 0-based vowel position for vowels
stressed: boolean;
palatalized: boolean;
}
interface MorphReading {
pos: string[]; // e.g. ['NOUN'], ['VERB']
feats: Record<string, string[]>; // UD morphological features
lemma: string | null;
definition: string | null;
}Example — heteronym:
const r = lookup('замок');
// r.readings[0] → { stressedForm: 'за́мок', ipa: 'zɑmɔk', … }
// r.readings[1] → { stressedForm: 'замо́к', ipa: 'zɑmɔk', … }
// → use morph[0].feats to disambiguate by contextExample — IPA transcription:
const r = lookup('правда');
r.readings[0].ipa; // → 'prɑwdɑ'
r.readings[0].ipaSyllables; // → ['ˈprɑw', 'dɑ']
r.readings[0].tokens.map(t => t.ipa);
// → ['p', 'r', 'ɑ', 'w', 'd', 'ɑ']
// (в realised as [w] — post-vocalic before consonant)transcribe(word: string, syllableIndex: number): TranscriptionResult
Low-level IPA transcription for a word when you already know the syllable stress position.
Bypasses the dictionary lookup entirely — useful for OOV words or ML-predicted stress.
interface TranscriptionResult {
word: string;
stressIndex: number; // same as input syllableIndex
ipa: string;
ipaSyllables: string[];
wordSyllables: string[];
syllableCount: number;
tokens: Token[];
}transcribe('слово', 0);
// → { ipa: 'slɔwɔ', ipaSyllables: ['ˈslɔ', 'wɔ'], … }wordCount(): number
Returns the total number of word forms in the embedded dictionary.
wordCount(); // → 3008723Phonetic pipeline
The IPA transcription runs 6 sequential passes over the tokenised word:
| Pass | Module | Description |
|------|--------|-------------|
| 1 | Tokenizer | Cyrillic graphemes → phoneme tokens |
| 1.5 | Geminates | Merge adjacent identical consonants into tː |
| 2 | Palatalization | Regressive dental softening (ь, і, я, є, ю propagation) |
| 3 | Voicing assimilation | Obstruent voicing/devoicing before voiced/voiceless clusters |
| 3b | Place assimilation | Sibilant + affricate place assimilation |
| 4 | Vowel allophones | Stressed vowel marking; unstressed vowel reduction |
| 5 | /в/ allophones | Positional [w] / [u̯] / [ʋ] selection |
| 6 | Syllabifier | Sonority-based boundary placement (Savchenko 2014 rules) |
/в/ allophone rules
| Context | Allophone | Example |
|---------|-----------|---------|
| Post-vocalic + pre-consonantal | [w] | правда → prɑwdɑ |
| Post-vocalic + word-final | [u̯] | кров → krɔu̯ |
| Word-initial (default) | [ʋ] | він → ʋin |
Syllabification rules (summary)
The syllabifier follows Savchenko (2014) §20, applied after all assimilation passes:
| Rule | Condition | Result | |------|-----------|--------| | 1 | Single intervocalic consonant | Onset of next syllable | | 2a | Both obstruents, both voiceless | Both to next syllable | | 2b | Both obstruents, both voiced, same manner | Both to next syllable | | 2c | Obstruent + sonorant | Both to next syllable | | 2α | Both sonorants | Split: first in coda | | 2β | Sonorant first, any second | Sonorant in coda | | 2γ | Voiced + voiceless obstruent | Split between them | | 2δ | Voiced fricative + voiced stop | Split between them | | 3a | First consonant is sonorant | Sonorant in coda, rest to next | | 3b | Obstruents + sonorant | All to next syllable | | 3c | All voiceless obstruents | All to next syllable | | 3d | Voiced + voiceless(es) + sonorant | Split after voiced |
Data sources
The embedded dictionary aggregates four Ukrainian lexical databases:
| Source | Records | License | |--------|---------|---------| | kaikki.org Wiktionary extract | ~2 M word forms | CC BY-SA 4.0 | | lang-uk trie stress dictionary | ~2.9 M word forms | MIT | | lang-uk plain-text stress dictionary | ~2.9 M word forms | MIT / ULIF | | ua_variative_stressed_words (manual curation) | ~150 lemmas | — |
Duplicates are resolved by a lossless merge that preserves all unique readings.
Academic sources
The IPA transcription pipeline is implemented according to:
| Source | Scope | |--------|-------| | Стеріополо О. (2012). «Українська фонетична система у парадигмі МФА». Науковий вісник УжНУ. Філологія. Соціальні комунікації 27: 51–58. | IPA consonant/vowel system, transcription rules, allophony | | Савченко І. С. (2014). Фонетика, орфоепія і графіка сучасної української мови. | Syllabification rules (§19–20), sonority scale, phonological/phonetic level | | Касьянова (2015). [/в/ allophones in standard Ukrainian] | Positional allophony of /в/: [w], [u̯], [ʋ], [ʋʲ] rules |
Relation to ua-word-stress
ua-word-stress and ua-word-stress-wasm share the same underlying stress dictionary data but serve different use cases:
ua-word-stress— pure TypeScript, ~9 MB compressed trie served as a separate asset. Best for applications that only need stress lookup and want zero native code.ua-word-stress-wasm— Rust/WASM, ~14 MB self-contained binary. Best for applications that need full IPA output, syllabification, morphology, or prefer a zero-fetch dependency.
Both packages are published from the same repository: ua-stress-engine.
License
AGPL-3.0-or-later — see LICENSE.
The dictionary data retains the licenses of its contributing sources (see Data sources above). Wiktionary-derived data is CC BY-SA 4.0; attribution to Wiktionary contributors is required when redistributing.
