ua-word-stress-wasm

v0.5.2

Published

a month ago

Ukrainian word stress engine — dictionary lookup, IPA transcription, morphology. Rust/WASM, no init() required.

Downloads

0High
0Medium
0Low

tilitronic

ukrainian stress nlp ipa wasm

ua-word-stress-wasm

Ukrainian word stress lookup + full IPA phonetic transcription, compiled to WebAssembly from Rust.

The dictionary is embedded in the WASM binary — no separate data file to host or fetch.
Works in browsers (ESM) and Node.js.

| Feature | ua-word-stress (TS trie) | ua-word-stress-wasm (this) | |---|---|---| | Stress lookup | ✓ | ✓ | | Full IPA transcription | — | ✓ | | Syllabification | — | ✓ | | Morphology (POS, UD features, lemma) | — | ✓ | | Data file to serve | 9.4 MB .ctrie.gz | none (embedded) | | WASM binary size | — | ~14 MB |

Database statistics

| Metric | Value | |---|---| | Word forms | 3,008,723 | | Binary format | bzip2-compressed binary (V2) | | Phonetic pipeline passes | 6 | | IPA standard | IPA (Steriopolo 2012, Savchenko 2014) |

Installation

# pnpm (recommended)
pnpm add ua-word-stress-wasm

# npm
npm install ua-word-stress-wasm

# yarn
yarn add ua-word-stress-wasm

Quick start

import { lookup, mark, stressIndex, stressIndexBatch, wordCount } from 'ua-word-stress-wasm';

// No init() needed — the WASM is loaded automatically by your bundler.

// Stress-marked word
mark('університет');         // → 'університе́т'
mark('замок');               // → 'за́мок'  (first reading)

// Stress index (0-based vowel position; -1 = unknown)
stressIndex('мама');         // → 0
stressIndex('університет'); // → 4

// Full lookup with IPA, syllables and morphology
const r = lookup('замок');
r.readings[0].stressedForm;  // → 'за́мок'
r.readings[0].ipa;           // → 'zɑmɔk'
r.readings[0].ipaSyllables;  // → ['ˈzɑ', 'mɔk']
r.readings[0].morph[0].pos;  // → ['NOUN']

// Batch lookup — much faster than calling stressIndex() in a loop
const indices = stressIndexBatch(['мама', 'тато', 'xyz']);
// → Int32Array [0, 0, -1]

// Dictionary size
wordCount(); // → 3008723

Vite / webpack / Rollup

No configuration needed. Vite, webpack 5, and Rollup all handle .wasm imports natively:

import { mark } from 'ua-word-stress-wasm';

mark('університет'); // → 'університе́т'

Node.js (ESM, Node 20+)

import { mark } from 'ua-word-stress-wasm';

console.log(mark('привіт')); // → 'приві́т'

API reference

`mark(word: string): string`

Returns the word with a combining acute accent (U+0301) placed over the stressed vowel.
Returns the word unchanged if it is not in the dictionary.

mark('мама');         // → 'ма́ма'
mark('університет'); // → 'університе́т'
mark('xyz');          // → 'xyz'  (unknown — returned as-is)

`markBatch(words: Array<string>): Array<string>`

Batch variant of mark. Takes an array of words and returns an array of stress-marked strings.
Words not in the dictionary are returned unchanged.
Significantly faster than calling mark() in a loop — ideal for processing pasted text blocks.

markBatch(['мама', 'тато', 'університет', 'xyz']);
// → ['ма́ма', 'та́то', 'університе́т', 'xyz']

`stressIndex(word: string): number`

Returns the 0-based syllable index of the stressed syllable, or -1 if the word is not in the dictionary.

This is the minimal-overhead call — no object allocation. The value is the same as readings[0].syllableIndex from lookup() and is directly usable as a syllable position (e.g. syllables[syllableIndex] gives the stressed syllable).

stressIndex('мама');         // → 0  (first syllable: ма́-ма)
stressIndex('університет'); // → 4  (fifth syllable: у-ні-вер-си-те́т)
stressIndex('xyz');          // → -1 (unknown)

`stressIndexBatch(words: Array<string>): Int32Array`

Batch variant of stressIndex. Takes a JS Array of strings and returns an Int32Array of 0-based syllable indices (one per word, -1 = unknown).
Significantly faster than looping over stressIndex() because the JS↔WASM encoding overhead is amortised.

stressIndexBatch(['мама', 'тато', 'дитина', 'xyz']);
// → Int32Array [0, 0, 1, -1]

`lookupBatch(words: Array<string>): Array<LookupResult>`

Batch variant of lookup. Returns an array of full result objects, one per word.
Ideal for processing large texts — avoids per-call JS↔WASM overhead.

const results = lookupBatch(['мама', 'замок', 'xyz']);
results[0].readings[0].stressedForm; // → 'ма́ма'
results[1].readings[0].ipa;          // → 'zɑmɔk'
results[2].readings;                 // → [] (unknown)

`lookup(word: string): LookupResult`

Returns a full result object with all stress readings, IPA, syllabification, tokens and morphology.
readings is an empty array when the word is not in the dictionary.

interface LookupResult {
  form: string;        // normalised input word
  readings: Reading[];
}

interface Reading {
  syllableIndex: number; // 0-based syllable index of the stressed syllable
  stressFromEnd: number; // syllables from the end (1 = ultima, 2 = penult, …)
  syllableCount: number;
  form: string;          // normalised form (same as input for most words)
  stressedForm: string;  // word with U+0301 accent on stressed vowel

  wordSyllables: string[];  // Cyrillic syllables, e.g. ['за', 'мок']
  ipa: string;              // full IPA string, e.g. 'zɑmɔk'
  ipaSyllables: string[];   // IPA per syllable with stress mark, e.g. ['ˈzɑ', 'mɔk']

  tokens: Token[];
  morph: MorphReading[];
  confidence: string | null;
}

interface Token {
  ipa: string;        // IPA of this token, e.g. 'm', 'ɑ', 'tʲ'
  source: string;     // source grapheme(s)
  type: string;       // 'Vowel' | 'Consonant' | 'Glide' | 'Separator'
  vowelIndex: number; // -1 for consonants; 0-based vowel position for vowels
  stressed: boolean;
  palatalized: boolean;
}

interface MorphReading {
  pos: string[];                       // e.g. ['NOUN'], ['VERB']
  feats: Record<string, string[]>;     // UD morphological features
  lemma: string | null;
  definition: string | null;
}

Example — heteronym:

const r = lookup('замок');
// r.readings[0] → { stressedForm: 'за́мок', ipa: 'zɑmɔk', … }
// r.readings[1] → { stressedForm: 'замо́к', ipa: 'zɑmɔk', … }
// → use morph[0].feats to disambiguate by context

Example — IPA transcription:

const r = lookup('правда');
r.readings[0].ipa;          // → 'prɑwdɑ'
r.readings[0].ipaSyllables; // → ['ˈprɑw', 'dɑ']
r.readings[0].tokens.map(t => t.ipa);
// → ['p', 'r', 'ɑ', 'w', 'd', 'ɑ']
// (в realised as [w] — post-vocalic before consonant)

`transcribe(word: string, syllableIndex: number): TranscriptionResult`

Low-level IPA transcription for a word when you already know the syllable stress position.
Bypasses the dictionary lookup entirely — useful for OOV words or ML-predicted stress.

interface TranscriptionResult {
  word: string;
  stressIndex: number;   // same as input syllableIndex
  ipa: string;
  ipaSyllables: string[];
  wordSyllables: string[];
  syllableCount: number;
  tokens: Token[];
}

transcribe('слово', 0);
// → { ipa: 'slɔwɔ', ipaSyllables: ['ˈslɔ', 'wɔ'], … }

`wordCount(): number`

Returns the total number of word forms in the embedded dictionary.

wordCount(); // → 3008723

Phonetic pipeline

The IPA transcription runs 6 sequential passes over the tokenised word:

| Pass | Module | Description | |------|--------|-------------| | 1 | Tokenizer | Cyrillic graphemes → phoneme tokens | | 1.5 | Geminates | Merge adjacent identical consonants into tː | | 2 | Palatalization | Regressive dental softening (ь, і, я, є, ю propagation) | | 3 | Voicing assimilation | Obstruent voicing/devoicing before voiced/voiceless clusters | | 3b | Place assimilation | Sibilant + affricate place assimilation | | 4 | Vowel allophones | Stressed vowel marking; unstressed vowel reduction | | 5 | /в/ allophones | Positional [w] / [u̯] / [ʋ] selection | | 6 | Syllabifier | Sonority-based boundary placement (Savchenko 2014 rules) |

/в/ allophone rules

| Context | Allophone | Example | |---------|-----------|---------| | Post-vocalic + pre-consonantal | [w] | правда → prɑwdɑ | | Post-vocalic + word-final | [u̯] | кров → krɔu̯ | | Word-initial (default) | [ʋ] | він → ʋin |

Syllabification rules (summary)

The syllabifier follows Savchenko (2014) §20, applied after all assimilation passes:

| Rule | Condition | Result | |------|-----------|--------| | 1 | Single intervocalic consonant | Onset of next syllable | | 2a | Both obstruents, both voiceless | Both to next syllable | | 2b | Both obstruents, both voiced, same manner | Both to next syllable | | 2c | Obstruent + sonorant | Both to next syllable | | 2α | Both sonorants | Split: first in coda | | 2β | Sonorant first, any second | Sonorant in coda | | 2γ | Voiced + voiceless obstruent | Split between them | | 2δ | Voiced fricative + voiced stop | Split between them | | 3a | First consonant is sonorant | Sonorant in coda, rest to next | | 3b | Obstruents + sonorant | All to next syllable | | 3c | All voiceless obstruents | All to next syllable | | 3d | Voiced + voiceless(es) + sonorant | Split after voiced |

Data sources

The embedded dictionary aggregates four Ukrainian lexical databases:

| Source | Records | License | |--------|---------|---------| | kaikki.org Wiktionary extract | ~2 M word forms | CC BY-SA 4.0 | | lang-uk trie stress dictionary | ~2.9 M word forms | MIT | | lang-uk plain-text stress dictionary | ~2.9 M word forms | MIT / ULIF | | ua_variative_stressed_words (manual curation) | ~150 lemmas | — |

Duplicates are resolved by a lossless merge that preserves all unique readings.

Academic sources

The IPA transcription pipeline is implemented according to:

| Source | Scope | |--------|-------| | Стеріополо О. (2012). «Українська фонетична система у парадигмі МФА». Науковий вісник УжНУ. Філологія. Соціальні комунікації 27: 51–58. | IPA consonant/vowel system, transcription rules, allophony | | Савченко І. С. (2014). Фонетика, орфоепія і графіка сучасної української мови. | Syllabification rules (§19–20), sonority scale, phonological/phonetic level | | Касьянова (2015). [/в/ allophones in standard Ukrainian] | Positional allophony of /в/: [w], [u̯], [ʋ], [ʋʲ] rules |

Relation to `ua-word-stress`

ua-word-stress and ua-word-stress-wasm share the same underlying stress dictionary data but serve different use cases:

ua-word-stress — pure TypeScript, ~9 MB compressed trie served as a separate asset. Best for applications that only need stress lookup and want zero native code.
ua-word-stress-wasm — Rust/WASM, ~14 MB self-contained binary. Best for applications that need full IPA output, syllabification, morphology, or prefer a zero-fetch dependency.

Both packages are published from the same repository: ua-stress-engine.

License

AGPL-3.0-or-later — see LICENSE.

The dictionary data retains the licenses of its contributing sources (see Data sources above). Wiktionary-derived data is CC BY-SA 4.0; attribution to Wiktionary contributors is required when redistributing.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ua-word-stress-wasm

Database statistics

Installation

Quick start

Vite / webpack / Rollup

Node.js (ESM, Node 20+)

API reference

mark(word: string): string

markBatch(words: Array<string>): Array<string>

stressIndex(word: string): number

stressIndexBatch(words: Array<string>): Int32Array

lookupBatch(words: Array<string>): Array<LookupResult>

lookup(word: string): LookupResult

transcribe(word: string, syllableIndex: number): TranscriptionResult

wordCount(): number

Phonetic pipeline

/в/ allophone rules

Syllabification rules (summary)

Data sources

Academic sources

Relation to ua-word-stress

License

`mark(word: string): string`

`markBatch(words: Array<string>): Array<string>`

`stressIndex(word: string): number`

`stressIndexBatch(words: Array<string>): Int32Array`

`lookupBatch(words: Array<string>): Array<LookupResult>`

`lookup(word: string): LookupResult`

`transcribe(word: string, syllableIndex: number): TranscriptionResult`

`wordCount(): number`

Relation to `ua-word-stress`