text-prep-lite

v0.0.3

Published

10 months ago

Lightweight text preprocessing utilities for NLP in TypeScript.

0High
0Medium
0Low

cavani21

nlp text preprocessing tokenize typescript normalize stemming lemmatization stopwords contractions

text-prep-lite

Lightweight text preprocessing utilities for Natural Language Processing (NLP) written in TypeScript.

text-prep-lite provides two core helpers:

normalizeText – clean & normalise raw text into a predictable representation.
tokenize – break text into lowercase word tokens.

The library is intentionally dependency-free and suitable for browsers, Node.js, and serverless environments.

Why?

Natural-language data is messy. Before tokenisation or feeding text into an NLP model you often need to:

normalise case & whitespace
expand contractions ("can't" → "cannot")
strip punctuation / emojis

text-prep-lite does those common steps with zero runtime dependencies.

Installation

npm install text-prep-lite
# or
yarn add text-prep-lite

Usage

import { normalizeText, tokenize } from "text-prep-lite";

const raw = "  I can't believe it's not butter! 🧈  ";

const cleaned = normalizeText(raw, {
  expandContractions: true,
  removePunctuation: true,
  removeEmojis: true,
});
// → "i cannot believe it is not butter"

const tokens = tokenize(raw);
// → ["i", "can", "t", "believe", "it", "s", "not", "butter"]

API

`normalizeText(input: string, options?: NormalizeOptions): string`

Returns a cleaned version of input.

NormalizeOptions:

| Option | Default | Description | |--------|---------|-------------| | expandContractions | false | Expand contractions for the selected locale. | | removePunctuation | false | Strip punctuation characters. | | removeEmojis | false | Remove Unicode emoji characters. | | locale | 'en' | BCP-47 language tag for locale-specific rules (currently: en, sq, fr, de, he). |

Supported locales

en – English (default)
sq – Albanian
fr – French
de – German
he – Hebrew
es – Spanish
zh – Chinese (Mandarin)
yue – Chinese (Cantonese)

// French example
normalizeText("C'est incroyable!", { expandContractions: true, locale: "fr" });
// → "ce est incroyable!"  (punctuation kept in this call)

`tokenize(input: string): string[]`

Converts text to lowercase.
Removes punctuation & emojis.
Splits by whitespace / word boundaries.

Returns an array of tokens.

tokenize has no options – it always lowercases, strips punctuation & emojis, and splits on whitespace.

🔗 Related

👉 Need word embeddings for semantic analysis?
Check out wink-embeddings-small-en-50d
👉 Need a simple and robust PDF text extraction utility with a quality interface? Check out [pdf-worker-package]https://www.npmjs.com/package/pdf-worker-package

Development

# run tests
npm test

# build library
npm run build

License

Contributing

Fork & clone the repo
npm i
npm test – run lint & unit tests
Submit pull-request 🚀

Please add tests for any new feature or bug-fix.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

text-prep-lite

Why?

Installation

Usage

API

normalizeText(input: string, options?: NormalizeOptions): string

tokenize(input: string): string[]

🔗 Related

Development

License

Contributing

`normalizeText(input: string, options?: NormalizeOptions): string`

`tokenize(input: string): string[]`