gs-tokenizer
v0.1.16
Published
A powerful and lightweight multilingual tokenizer library that provides natural language processing capabilities for multiple languages including English, Chinese, Japanese, and Korean.
Maintainers
Readme
gs-tokenizer
A powerful and lightweight multilingual tokenizer library that provides natural language processing capabilities for multiple languages including English, Chinese, Japanese, and Korean.
Documentation
Features
- Language Support: English, Chinese, Japanese, Korean
- Intelligent Tokenization:
- English: Word boundary-based tokenization
- CJK (Chinese, Japanese, Korean): Natural word segmentation using browser's Intl.Segmenter
- Date: Special handling for date patterns
- Punctuation: Consecutive punctuation marks are merged into a single token
- Custom Dictionary: Support for adding custom words with priority and name
- Auto Language Detection: Automatically detects the language of input text
- Multiple Output Formats: Get detailed token information or just word lists
- Lightweight: Minimal dependencies, designed for browser environments
- Quick Use API: Convenient static methods for easy integration
- tokenizeAll: New feature in core module that returns all possible tokens at each position
Module Comparison
| Module | Stability | Speed | Tokenization Accuracy | New Features | |--------|-----------|-------|-----------------------|--------------| | old | ✅ More stable | ⚡️ Slower | ✅ More accurate | ❌ No new features | | core | ⚠️ Less stable | ⚡️ Faster | ⚠️ May be less accurate | ✅ tokenizeAll, Stage-based architecture |
Installation
yarn add gs-tokenizerAlternative Installation
npm install gs-tokenizerUsage
Quick Use (Recommended)
The quick module provides convenient static methods for easy integration:
import { tokenize, tokenizeText, addCustomDictionary } from 'gs-tokenizer';
// Direct tokenization without creating an instance
const text = 'Hello world! 我爱北京天安门。';
const tokens = tokenize(text);
const words = tokenizeText(text);
console.log(words);
// Add custom dictionary
addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');Advanced Usage
Load Custom Dictionary with Quick Module
import { tokenize, addCustomDictionary } from 'gs-tokenizer';
// Load multiple custom dictionaries for different languages
addCustomDictionary(['人工智能', '机器学习'], 'tech', 10, 'zh');
addCustomDictionary(['Web3', 'Blockchain'], 'crypto', 10, 'en');
addCustomDictionary(['アーティフィシャル・インテリジェンス'], 'tech-ja', 10, 'ja');
// Tokenize with custom dictionaries applied
const text = '人工智能和Web3是未来的重要技术。アーティフィシャル・インテリジェンスも重要です。';
const tokens = tokenize(text);
console.log(tokens.filter(token => token.src === 'tech'));Without Built-in Lexicon
import { MultilingualTokenizer } from 'gs-tokenizer';
// Create tokenizer without using built-in lexicon
const tokenizer = new MultilingualTokenizer({
customDictionaries: {
'zh': [{ priority: 10, data: new Set(['自定义词']), name: 'custom', lang: 'zh' }]
}
});
// Tokenize using only custom dictionary
const text = '这是一个自定义词的示例。';
const tokens = tokenizer.tokenize(text, 'zh');
console.log(tokens);Custom Dictionary
const tokenizer = new OldMultilingualTokenizer();
// Add custom words with name, priority, and language
tokenizer.addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');
tokenizer.addCustomDictionary(['Python', 'JavaScript'], 'programming', 5, 'en');
const text = '我爱人工智能技术和Python编程';
const tokens = tokenizer.tokenize(text);
const words = tokenizer.tokenizeText(text);
console.log(words); // Should include '人工智能', 'Python'
// Remove custom word
tokenizer.removeCustomWord('Python', 'en', 'programming');Advanced Options
import { MultilingualTokenizer } from 'gs-tokenizer';
const tokenizer = new MultilingualTokenizer();
// Tokenize text
const text = '我爱北京天安门';
const tokens = tokenizer.tokenize(text);
// Get all possible tokens (core module only)
const allTokens = tokenizer.tokenizeAll(text);Using Old Module
import { OldMultilingualTokenizer } from 'gs-tokenizer/old';
const tokenizer = new OldMultilingualTokenizer();
// Tokenize text (old is more stable but slower)
const text = '我爱北京天安门';
const tokens = tokenizer.tokenize(text);API Reference
MultilingualTokenizer
Main tokenizer class that handles multilingual text processing.
Constructor
import { MultilingualTokenizer, TokenizerOptions } from 'gs-tokenizer';
const tokenizer = new MultilingualTokenizer(options)Options:
customDictionaries: Record<string, LexiconEntry[]> - Custom dictionaries for each languagedefaultLanguage: string - Default language code (default: 'en')
Methods
| Method | Description |
|--------|-------------|
| tokenize(text: string): Token[] | Tokenizes the input text and returns detailed token information |
| tokenizeAll(text: string): Token[] | Returns all possible tokens at each position (core module only) |
| tokenizeText(text: string): string[] | Tokenizes the input text and returns only word tokens |
| tokenizeTextAll(text: string): string[] | Returns all possible word tokens at each position (core module only) |
| addCustomDictionary(words: string[], name: string, priority?: number, language?: string): void | Adds custom words to the tokenizer |
| removeCustomWord(word: string, language?: string, lexiconName?: string): void | Removes a custom word from the tokenizer |
| addStage(stage: ITokenizerStage): void | Adds a custom tokenization stage (core module only) |
createTokenizer(options?: TokenizerOptions): MultilingualTokenizer
Factory function to create a new MultilingualTokenizer instance with optional configuration.
Quick Use API
The quick module provides convenient static methods:
import { Token } from 'gs-tokenizer';
// Quick Use API type definition
type QuickUseAPI = {
// Tokenize text
tokenize: (text: string, language?: string) => Token[];
// Tokenize to text only
tokenizeText: (text: string, language?: string) => string[];
// Add custom dictionary
addCustomDictionary: (words: string[], name: string, priority?: number, language?: string) => void;
// Remove custom word
removeCustomWord: (word: string, language?: string, lexiconName?: string) => void;
// Set default languages for lexicon loading
setDefaultLanguages: (languages: string[]) => void;
// Set default types for lexicon loading
setDefaultTypes: (types: string[]) => void;
};
// Import quick use API
import { tokenize, tokenizeText, addCustomDictionary, removeCustomWord, setDefaultLanguages, setDefaultTypes } from 'gs-tokenizer';Types
Token Interface
interface Token {
txt: string; // Token text content
type: 'word' | 'punctuation' | 'space' | 'other' | 'emoji' | 'date' | 'host' | 'ip' | 'number' | 'hashtag' | 'mention';
lang?: string; // Language code
src?: string; // Source (e.g., custom dictionary name)
}ITokenizerStage Interface (core module only)
interface ITokenizerStage {
order: number;
priority: number;
tokenize(text: string, start: number): IStageBestResult;
all(text: string): IToken[];
}TokenizerOptions Interface
import { LexiconEntry } from 'gs-tokenizer';
interface TokenizerOptions {
customDictionaries?: Record<string, LexiconEntry[]>;
granularity?: 'word' | 'grapheme' | 'sentence';
defaultLanguage?: string;
}Browser Compatibility
- Chrome/Edge: 87+
- Firefox: 86+
- Safari: 14.1+
Note: Uses Intl.Segmenter for CJK languages, which requires modern browser support.
Development
Build
npm run buildRun Tests
npm run test # Run all tests
npm run test:base # Run base tests
npm run test:english # Run English-specific tests
npm run test:cjk # Run CJK-specific tests
npm run test:mixed # Run mixed language testsLicense
MIT
