ukrainian-ml-optimizer

v3.0.0

Published

4 months ago

Ukrainian text pre-processing library for ML optimization. Removes noise (URLs, emails, mentions, numbers, special symbols, stop words) and applies Ukrainian stemming to produce clean, normalized text suitable for machine learning pipelines.

0High
0Medium
0Low

drsmile

ukrainian nlp text-processing machine-learning stemming text-normalization ukrainian-language preprocessing

ukrainian-ml-optimizer

Installation

npm install ukrainian-ml-optimizer

Quick Start

import { optimizeText } from 'ukrainian-ml-optimizer';

const raw = 'аби я @mention [email protected] спеціальний символ 😝';
const clean = optimizeText(raw);
// Returns: 'аб я спеціальн символ'

What It Does

The full pipeline (applied by optimizeText) in order:

Convert text to lowercase
Remove URLs
Remove email addresses
Remove @mentions
Remove special symbols (keep only Latin, Cyrillic, digits)
Remove numbers
Remove extra spaces
Remove Ukrainian stop words
Normalize mixed Latin/Cyrillic words
Apply Ukrainian stemming

API Reference

`optimizeText(text: string): string`

Applies the full pre-processing pipeline. This is the main function for ML optimization.

optimizeText('пролeтіла pакета і вийшов вибyх приїхав тaнк');
// Returns: 'пролетіл ракет і вийшов вибух приїхав танк'

`removeUrl(text: string): string`

Removes HTTP and HTTPS URLs from the text.

removeUrl('visit https://example.com today');
// Returns: 'visit  today'

`removeEmail(text: string): string`

Removes email addresses from the text.

removeEmail('contact [email protected] now');
// Returns: 'contact  now'

`removeMention(text: string): string`

Removes @mention patterns from the text.

removeMention('hello @username world');
// Returns: 'hello  world'

`removeSpecialSymbols(text: string): string`

Removes all characters that are not Latin letters, Cyrillic letters, or digits.

removeSpecialSymbols('hello, world!');
// Returns: 'hello  world '

`removeNumber(text: string): string`

Removes all digit sequences from the text.

removeNumber('abc 123 def');
// Returns: 'abc  def'

`removeExtraSpaces(text: string): string`

Trims the text and collapses multiple consecutive spaces into one.

removeExtraSpaces('  hello   world  ');
// Returns: 'hello world'

`removeStopWords(text: string): string`

Removes Ukrainian stop words. Converts to lowercase, filters empty tokens, and trims the result. So leading/trailing spaces and consecutive spaces are automatically normalized.

removeStopWords('аби я побачив це, я би здивувався');
// Returns: 'аби я побачив це, я здивувався'

`stemText(text: string): string`

Applies Ukrainian stemming to each word via the ukrstemmer library. Empty tokens are filtered out so consecutive spaces don't produce empty stems. Words in stem-whitelist.json are left unchanged.

stemText('весна міський здивувався');
// Returns: 'весн міськ здивував'

stemText('щось'); // Returns: 'щось' (whitelisted)

`replaceLatinWithCyrillic(text: string): string`

Replaces Latin characters with their Cyrillic equivalents using a lookup table. Multi-character sequences (lj, nj, dž and their case variants) are processed before single characters.

replaceLatinWithCyrillic('hello');
// Returns: 'хелло'

`removeLatinPartialLetters(text: string): string`

For each word, if the Latin-to-Cyrillic character ratio is ≤ 1 (predominantly Cyrillic), replaces Latin characters with Cyrillic equivalents. Words that are predominantly Latin are left unchanged.

removeLatinPartialLetters('пролeтіла'); // Returns: 'пролетіла'
removeLatinPartialLetters('test test'); // Returns: 'test test'

Regex Constants

The following regex patterns are exported for use in custom pipelines:

| Export | Pattern | Description | | --------------- | ------------------- | ------------------------- | | numberRegexp | /\d+/g | Matches digit sequences | | mentionRegexp | /@\D_?[^ ]+/g | Matches @mention patterns | | urlRegexp | /https?:\/\/\S+/g | Matches HTTP/HTTPS URLs | | emailRegexp | (lookbehind-based) | Matches email addresses |

Stop Words & Stem Whitelist

Stop words: A comprehensive list of Ukrainian stop words is bundled in src/data/stopwords_ua_list.json.
Stem whitelist: Words in src/data/stem-whitelist.json are excluded from stemming (e.g., щось). You can inspect this file to understand which words are protected.

License

ISC

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

ukrainian-ml-optimizer

Installation

Quick Start

What It Does

API Reference

optimizeText(text: string): string

removeUrl(text: string): string

removeEmail(text: string): string

removeMention(text: string): string

removeSpecialSymbols(text: string): string

removeNumber(text: string): string

removeExtraSpaces(text: string): string

removeStopWords(text: string): string

stemText(text: string): string

replaceLatinWithCyrillic(text: string): string

removeLatinPartialLetters(text: string): string

Regex Constants

Stop Words & Stem Whitelist

License

`optimizeText(text: string): string`

`removeUrl(text: string): string`

`removeEmail(text: string): string`

`removeMention(text: string): string`

`removeSpecialSymbols(text: string): string`

`removeNumber(text: string): string`

`removeExtraSpaces(text: string): string`

`removeStopWords(text: string): string`

`stemText(text: string): string`

`replaceLatinWithCyrillic(text: string): string`

`removeLatinPartialLetters(text: string): string`