npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

ukrainian-ml-optimizer

v3.0.0

Published

Ukrainian text pre-processing library for ML optimization. Removes noise (URLs, emails, mentions, numbers, special symbols, stop words) and applies Ukrainian stemming to produce clean, normalized text suitable for machine learning pipelines.

Readme

ukrainian-ml-optimizer

npm version License: ISC

Ukrainian text pre-processing library for ML optimization. Removes noise (URLs, emails, mentions, numbers, special symbols, stop words) and applies Ukrainian stemming to produce clean, normalized text suitable for machine learning pipelines.

Installation

npm install ukrainian-ml-optimizer

Quick Start

import { optimizeText } from 'ukrainian-ml-optimizer';

const raw = 'аби я @mention [email protected] спеціальний символ 😝';
const clean = optimizeText(raw);
// Returns: 'аб я спеціальн символ'

What It Does

The full pipeline (applied by optimizeText) in order:

  1. Convert text to lowercase
  2. Remove URLs
  3. Remove email addresses
  4. Remove @mentions
  5. Remove special symbols (keep only Latin, Cyrillic, digits)
  6. Remove numbers
  7. Remove extra spaces
  8. Remove Ukrainian stop words
  9. Normalize mixed Latin/Cyrillic words
  10. Apply Ukrainian stemming

API Reference

optimizeText(text: string): string

Applies the full pre-processing pipeline. This is the main function for ML optimization.

optimizeText('пролeтіла pакета і вийшов вибyх приїхав тaнк');
// Returns: 'пролетіл ракет і вийшов вибух приїхав танк'

removeUrl(text: string): string

Removes HTTP and HTTPS URLs from the text.

removeUrl('visit https://example.com today');
// Returns: 'visit  today'

removeEmail(text: string): string

Removes email addresses from the text.

removeEmail('contact [email protected] now');
// Returns: 'contact  now'

removeMention(text: string): string

Removes @mention patterns from the text.

removeMention('hello @username world');
// Returns: 'hello  world'

removeSpecialSymbols(text: string): string

Removes all characters that are not Latin letters, Cyrillic letters, or digits.

removeSpecialSymbols('hello, world!');
// Returns: 'hello  world '

removeNumber(text: string): string

Removes all digit sequences from the text.

removeNumber('abc 123 def');
// Returns: 'abc  def'

removeExtraSpaces(text: string): string

Trims the text and collapses multiple consecutive spaces into one.

removeExtraSpaces('  hello   world  ');
// Returns: 'hello world'

removeStopWords(text: string): string

Removes Ukrainian stop words. Converts to lowercase, filters empty tokens, and trims the result. So leading/trailing spaces and consecutive spaces are automatically normalized.

removeStopWords('аби я побачив це, я би здивувався');
// Returns: 'аби я побачив це, я здивувався'

stemText(text: string): string

Applies Ukrainian stemming to each word via the ukrstemmer library. Empty tokens are filtered out so consecutive spaces don't produce empty stems. Words in stem-whitelist.json are left unchanged.

stemText('весна міський здивувався');
// Returns: 'весн міськ здивував'

stemText('щось'); // Returns: 'щось' (whitelisted)

replaceLatinWithCyrillic(text: string): string

Replaces Latin characters with their Cyrillic equivalents using a lookup table. Multi-character sequences (lj, nj, and their case variants) are processed before single characters.

replaceLatinWithCyrillic('hello');
// Returns: 'хелло'

removeLatinPartialLetters(text: string): string

For each word, if the Latin-to-Cyrillic character ratio is ≤ 1 (predominantly Cyrillic), replaces Latin characters with Cyrillic equivalents. Words that are predominantly Latin are left unchanged.

removeLatinPartialLetters('пролeтіла'); // Returns: 'пролетіла'
removeLatinPartialLetters('test test'); // Returns: 'test test'

Regex Constants

The following regex patterns are exported for use in custom pipelines:

| Export | Pattern | Description | | --------------- | ------------------- | ------------------------- | | numberRegexp | /\d+/g | Matches digit sequences | | mentionRegexp | /@\D_?[^ ]+/g | Matches @mention patterns | | urlRegexp | /https?:\/\/\S+/g | Matches HTTP/HTTPS URLs | | emailRegexp | (lookbehind-based) | Matches email addresses |

Stop Words & Stem Whitelist

  • Stop words: A comprehensive list of Ukrainian stop words is bundled in src/data/stopwords_ua_list.json.
  • Stem whitelist: Words in src/data/stem-whitelist.json are excluded from stemming (e.g., щось). You can inspect this file to understand which words are protected.

License

ISC