multilingual-tokenizer

v1.0.0

Published

a year ago

A Node.js library for tokenizing text in Thai, English, Japanese, and Korean using regex

0High
0Medium
0Low

burblanks

tokenizer nlp thai english japanese korean multilingual regex

Multilingual Tokenizer

A Node.js library for tokenizing text in multiple languages (Thai, English, Japanese, and Korean) using regex-based approaches.

Features

Support for tokenizing text in:
- English
- Thai
- Japanese
- Korean
Automatic language detection
Token classification (word, number, punctuation, etc.)
Normalization options
Whitespace preservation options
Simple, lightweight implementation using regular expressions

Installation

npm install multilingual-tokenizer

Usage

const {
  MultilingualTokenizer,
  TOKEN_TYPES,
} = require("multilingual-tokenizer");

// Create a new tokenizer instance
const tokenizer = new MultilingualTokenizer({
  preserveWhitespace: true, // Keep whitespace tokens
  normalizeText: true, // Apply Unicode normalization
});

// Tokenize English text
const englishText = "Hello, world!";
const englishTokens = tokenizer.tokenize(englishText);
console.log(englishTokens);

// Tokenize Thai text
const thaiText = "สวัสดีครับ";
const thaiTokens = tokenizer.tokenize(thaiText);
console.log(thaiTokens);

// Tokenize Japanese text
const japaneseText = "こんにちは、世界！";
const japaneseTokens = tokenizer.tokenize(japaneseText);
console.log(japaneseTokens);

// Tokenize Korean text
const koreanText = "안녕하세요, 세계!";
const koreanTokens = tokenizer.tokenize(koreanText);
console.log(koreanTokens);

// Force language selection
const forcedTokens = tokenizer.tokenize(englishText, "japanese");

// Extract only word tokens
const words = tokenizer.extractWords(englishTokens);

Token Structure

Each token is represented as an object with two properties:

{
  type: 'WORD',  // One of the values from TOKEN_TYPES
  value: 'Hello' // The actual token text
}

The available token types are:

WORD - Words and word-like constructs
NUMBER - Numeric values
SPACE - Whitespace (spaces, tabs, newlines)
PUNCTUATION - Punctuation marks
SYMBOL - Symbols (#, $, %, etc.)
OTHER - Unclassified characters

API Reference

Constructor

const tokenizer = new MultilingualTokenizer(options);

Options:

preserveWhitespace (default: false): Whether to include whitespace tokens in the output
normalizeText (default: true): Whether to apply Unicode normalization before tokenization

Methods

`tokenize(text, language = null)`

Tokenizes the input text. If language is not provided, it will be automatically detected.

text (string): The text to tokenize
language (string, optional): Force a specific language tokenizer ('english', 'thai', 'japanese', 'korean')
Returns: Array of token objects

`detectLanguage(text)`

Detects the dominant language in the text.

text (string): The text to analyze
Returns: String with language code ('english', 'thai', 'japanese', 'korean')

`extractWords(tokens)`

Extracts only the word tokens from an array of tokens.

tokens (array): Array of token objects
Returns: Array of strings (word values)

`detokenize(tokens)`

Converts tokens back to text.

tokens (array): Array of token objects
Returns: String of reconstructed text

Important Notes

This library uses regex-based tokenization, which is a simplified approach. For production use in applications requiring high accuracy in specific languages:

Thai: Consider using dictionary-based approaches (e.g., thai-tokenizer)
Japanese: Consider using morphological analyzers (e.g., kuromoji)
Korean: Consider using more sophisticated tokenizers (e.g., node-mecab-ya)

This library is intended for basic tokenization needs or cases where a lightweight solution is required.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme