multilingual-tokenizer
v1.0.0
Published
A Node.js library for tokenizing text in Thai, English, Japanese, and Korean using regex
Maintainers
Readme
Multilingual Tokenizer
A Node.js library for tokenizing text in multiple languages (Thai, English, Japanese, and Korean) using regex-based approaches.
Features
- Support for tokenizing text in:
- English
- Thai
- Japanese
- Korean
- Automatic language detection
- Token classification (word, number, punctuation, etc.)
- Normalization options
- Whitespace preservation options
- Simple, lightweight implementation using regular expressions
Installation
npm install multilingual-tokenizerUsage
const {
MultilingualTokenizer,
TOKEN_TYPES,
} = require("multilingual-tokenizer");
// Create a new tokenizer instance
const tokenizer = new MultilingualTokenizer({
preserveWhitespace: true, // Keep whitespace tokens
normalizeText: true, // Apply Unicode normalization
});
// Tokenize English text
const englishText = "Hello, world!";
const englishTokens = tokenizer.tokenize(englishText);
console.log(englishTokens);
// Tokenize Thai text
const thaiText = "สวัสดีครับ";
const thaiTokens = tokenizer.tokenize(thaiText);
console.log(thaiTokens);
// Tokenize Japanese text
const japaneseText = "こんにちは、世界!";
const japaneseTokens = tokenizer.tokenize(japaneseText);
console.log(japaneseTokens);
// Tokenize Korean text
const koreanText = "안녕하세요, 세계!";
const koreanTokens = tokenizer.tokenize(koreanText);
console.log(koreanTokens);
// Force language selection
const forcedTokens = tokenizer.tokenize(englishText, "japanese");
// Extract only word tokens
const words = tokenizer.extractWords(englishTokens);Token Structure
Each token is represented as an object with two properties:
{
type: 'WORD', // One of the values from TOKEN_TYPES
value: 'Hello' // The actual token text
}The available token types are:
WORD- Words and word-like constructsNUMBER- Numeric valuesSPACE- Whitespace (spaces, tabs, newlines)PUNCTUATION- Punctuation marksSYMBOL- Symbols (#, $, %, etc.)OTHER- Unclassified characters
API Reference
Constructor
const tokenizer = new MultilingualTokenizer(options);Options:
preserveWhitespace(default:false): Whether to include whitespace tokens in the outputnormalizeText(default:true): Whether to apply Unicode normalization before tokenization
Methods
tokenize(text, language = null)
Tokenizes the input text. If language is not provided, it will be automatically detected.
text(string): The text to tokenizelanguage(string, optional): Force a specific language tokenizer ('english', 'thai', 'japanese', 'korean')- Returns: Array of token objects
detectLanguage(text)
Detects the dominant language in the text.
text(string): The text to analyze- Returns: String with language code ('english', 'thai', 'japanese', 'korean')
extractWords(tokens)
Extracts only the word tokens from an array of tokens.
tokens(array): Array of token objects- Returns: Array of strings (word values)
detokenize(tokens)
Converts tokens back to text.
tokens(array): Array of token objects- Returns: String of reconstructed text
Important Notes
This library uses regex-based tokenization, which is a simplified approach. For production use in applications requiring high accuracy in specific languages:
- Thai: Consider using dictionary-based approaches (e.g., thai-tokenizer)
- Japanese: Consider using morphological analyzers (e.g., kuromoji)
- Korean: Consider using more sophisticated tokenizers (e.g., node-mecab-ya)
This library is intended for basic tokenization needs or cases where a lightweight solution is required.
License
MIT
