ukrainian-ml-optimizer
v3.0.0
Published
Ukrainian text pre-processing library for ML optimization. Removes noise (URLs, emails, mentions, numbers, special symbols, stop words) and applies Ukrainian stemming to produce clean, normalized text suitable for machine learning pipelines.
Maintainers
Readme
ukrainian-ml-optimizer
Ukrainian text pre-processing library for ML optimization. Removes noise (URLs, emails, mentions, numbers, special symbols, stop words) and applies Ukrainian stemming to produce clean, normalized text suitable for machine learning pipelines.
Installation
npm install ukrainian-ml-optimizerQuick Start
import { optimizeText } from 'ukrainian-ml-optimizer';
const raw = 'аби я @mention [email protected] спеціальний символ 😝';
const clean = optimizeText(raw);
// Returns: 'аб я спеціальн символ'What It Does
The full pipeline (applied by optimizeText) in order:
- Convert text to lowercase
- Remove URLs
- Remove email addresses
- Remove @mentions
- Remove special symbols (keep only Latin, Cyrillic, digits)
- Remove numbers
- Remove extra spaces
- Remove Ukrainian stop words
- Normalize mixed Latin/Cyrillic words
- Apply Ukrainian stemming
API Reference
optimizeText(text: string): string
Applies the full pre-processing pipeline. This is the main function for ML optimization.
optimizeText('пролeтіла pакета і вийшов вибyх приїхав тaнк');
// Returns: 'пролетіл ракет і вийшов вибух приїхав танк'removeUrl(text: string): string
Removes HTTP and HTTPS URLs from the text.
removeUrl('visit https://example.com today');
// Returns: 'visit today'removeEmail(text: string): string
Removes email addresses from the text.
removeEmail('contact [email protected] now');
// Returns: 'contact now'removeMention(text: string): string
Removes @mention patterns from the text.
removeMention('hello @username world');
// Returns: 'hello world'removeSpecialSymbols(text: string): string
Removes all characters that are not Latin letters, Cyrillic letters, or digits.
removeSpecialSymbols('hello, world!');
// Returns: 'hello world 'removeNumber(text: string): string
Removes all digit sequences from the text.
removeNumber('abc 123 def');
// Returns: 'abc def'removeExtraSpaces(text: string): string
Trims the text and collapses multiple consecutive spaces into one.
removeExtraSpaces(' hello world ');
// Returns: 'hello world'removeStopWords(text: string): string
Removes Ukrainian stop words. Converts to lowercase, filters empty tokens, and trims the result. So leading/trailing spaces and consecutive spaces are automatically normalized.
removeStopWords('аби я побачив це, я би здивувався');
// Returns: 'аби я побачив це, я здивувався'stemText(text: string): string
Applies Ukrainian stemming to each word via the ukrstemmer library. Empty tokens are filtered out
so consecutive spaces don't produce empty stems. Words in stem-whitelist.json are left unchanged.
stemText('весна міський здивувався');
// Returns: 'весн міськ здивував'
stemText('щось'); // Returns: 'щось' (whitelisted)replaceLatinWithCyrillic(text: string): string
Replaces Latin characters with their Cyrillic equivalents using a lookup table.
Multi-character sequences (lj, nj, dž and their case variants) are processed before single characters.
replaceLatinWithCyrillic('hello');
// Returns: 'хелло'removeLatinPartialLetters(text: string): string
For each word, if the Latin-to-Cyrillic character ratio is ≤ 1 (predominantly Cyrillic), replaces Latin characters with Cyrillic equivalents. Words that are predominantly Latin are left unchanged.
removeLatinPartialLetters('пролeтіла'); // Returns: 'пролетіла'
removeLatinPartialLetters('test test'); // Returns: 'test test'Regex Constants
The following regex patterns are exported for use in custom pipelines:
| Export | Pattern | Description |
| --------------- | ------------------- | ------------------------- |
| numberRegexp | /\d+/g | Matches digit sequences |
| mentionRegexp | /@\D_?[^ ]+/g | Matches @mention patterns |
| urlRegexp | /https?:\/\/\S+/g | Matches HTTP/HTTPS URLs |
| emailRegexp | (lookbehind-based) | Matches email addresses |
Stop Words & Stem Whitelist
- Stop words: A comprehensive list of Ukrainian stop words is bundled in
src/data/stopwords_ua_list.json. - Stem whitelist: Words in
src/data/stem-whitelist.jsonare excluded from stemming (e.g.,щось). You can inspect this file to understand which words are protected.
License
ISC
