text-prep-lite
v0.0.3
Published
Lightweight text preprocessing utilities for NLP in TypeScript.
Maintainers
Readme
text-prep-lite
Lightweight text preprocessing utilities for Natural Language Processing (NLP) written in TypeScript.
text-prep-lite provides two core helpers:
normalizeText– clean & normalise raw text into a predictable representation.tokenize– break text into lowercase word tokens.
The library is intentionally dependency-free and suitable for browsers, Node.js, and serverless environments.
Why?
Natural-language data is messy. Before tokenisation or feeding text into an NLP model you often need to:
- normalise case & whitespace
- expand contractions ("can't" → "cannot")
- strip punctuation / emojis
text-prep-lite does those common steps with zero runtime dependencies.
Installation
npm install text-prep-lite
# or
yarn add text-prep-liteUsage
import { normalizeText, tokenize } from "text-prep-lite";
const raw = " I can't believe it's not butter! 🧈 ";
const cleaned = normalizeText(raw, {
expandContractions: true,
removePunctuation: true,
removeEmojis: true,
});
// → "i cannot believe it is not butter"
const tokens = tokenize(raw);
// → ["i", "can", "t", "believe", "it", "s", "not", "butter"]API
normalizeText(input: string, options?: NormalizeOptions): string
Returns a cleaned version of input.
NormalizeOptions:
| Option | Default | Description |
|--------|---------|-------------|
| expandContractions | false | Expand contractions for the selected locale. |
| removePunctuation | false | Strip punctuation characters. |
| removeEmojis | false | Remove Unicode emoji characters. |
| locale | 'en' | BCP-47 language tag for locale-specific rules (currently: en, sq, fr, de, he). |
Supported locales
en– English (default)sq– Albanianfr– Frenchde– Germanhe– Hebrewes– Spanishzh– Chinese (Mandarin)yue– Chinese (Cantonese)
// French example
normalizeText("C'est incroyable!", { expandContractions: true, locale: "fr" });
// → "ce est incroyable!" (punctuation kept in this call)tokenize(input: string): string[]
- Converts text to lowercase.
- Removes punctuation & emojis.
- Splits by whitespace / word boundaries.
Returns an array of tokens.
tokenize has no options – it always lowercases, strips punctuation & emojis, and splits on whitespace.
🔗 Related
👉 Need word embeddings for semantic analysis?
Check outwink-embeddings-small-en-50d👉 Need a simple and robust PDF text extraction utility with a quality interface? Check out [
pdf-worker-package]https://www.npmjs.com/package/pdf-worker-package
Development
# run tests
npm test
# build library
npm run buildLicense
MIT © Cavani21/thegreatbey
Contributing
- Fork & clone the repo
npm inpm test– run lint & unit tests- Submit pull-request 🚀
Please add tests for any new feature or bug-fix.
