gs-tokenizer

v0.1.21

Published

a month ago

A powerful and lightweight multilingual tokenizer library that provides natural language processing capabilities for multiple languages including English, Chinese, Japanese, and Korean.

0High
0Medium
0Low

grain-sand

tokenizer multilingual Chinese Japanese Korean English NLP Natural Language Processing text processing

gs-tokenizer

A powerful and lightweight multilingual tokenizer library that provides natural language processing capabilities for multiple languages including English, Chinese, Japanese, and Korean.

Documentation

Features

Language Support: English, Chinese, Japanese, Korean
Intelligent Tokenization:
- English: Word boundary-based tokenization
- CJK (Chinese, Japanese, Korean): Natural word segmentation using browser's Intl.Segmenter
- Date: Special handling for date patterns
- Punctuation: Consecutive punctuation marks are merged into a single token
Custom Dictionary: Support for adding custom words with priority and name
Auto Language Detection: Automatically detects the language of input text
Multiple Output Formats: Get detailed token information or just word lists
Lightweight: Minimal dependencies, designed for browser environments
Quick Use API: Convenient static methods for easy integration
tokenizeAll: New feature in core module that returns all possible tokens at each position

Module Comparison

| Module | Stability | Speed | Tokenization Accuracy | New Features | |--------|-----------|-------|-----------------------|--------------| | old | ✅ More stable | ⚡️ Slower | ✅ More accurate | ❌ No new features | | core | ⚠️ Less stable | ⚡️ Faster | ⚠️ May be less accurate | ✅ tokenizeAll, Stage-based architecture |

Installation

yarn add gs-tokenizer

Alternative Installation

npm install gs-tokenizer

Usage

Quick Use (Recommended)

The quick module provides convenient static methods for easy integration:

import { tokenize, tokenizeText, addCustomDictionary } from 'gs-tokenizer';

// Direct tokenization without creating an instance
const text = 'Hello world! 我爱北京天安门。';
const tokens = tokenize(text);
const words = tokenizeText(text);
console.log(words);

// Add custom dictionary
addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');

Advanced Usage

Load Custom Dictionary with Quick Module

import { tokenize, addCustomDictionary } from 'gs-tokenizer';

// Load multiple custom dictionaries for different languages
addCustomDictionary(['人工智能', '机器学习'], 'tech', 10, 'zh');
addCustomDictionary(['Web3', 'Blockchain'], 'crypto', 10, 'en');
addCustomDictionary(['アーティフィシャル・インテリジェンス'], 'tech-ja', 10, 'ja');

// Tokenize with custom dictionaries applied
const text = '人工智能和Web3是未来的重要技术。アーティフィシャル・インテリジェンスも重要です。';
const tokens = tokenize(text);
console.log(tokens.filter(token => token.src === 'tech'));

Without Built-in Lexicon

import { MultilingualTokenizer } from 'gs-tokenizer';

// Create tokenizer without using built-in lexicon
const tokenizer = new MultilingualTokenizer({
  customDictionaries: {
    'zh': [{ priority: 10, data: new Set(['自定义词']), name: 'custom', lang: 'zh' }]
  }
});

// Tokenize using only custom dictionary
const text = '这是一个自定义词的示例。';
const tokens = tokenizer.tokenize(text, 'zh');
console.log(tokens);

Custom Dictionary

const tokenizer = new OldMultilingualTokenizer();

// Add custom words with name, priority, and language
tokenizer.addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');
tokenizer.addCustomDictionary(['Python', 'JavaScript'], 'programming', 5, 'en');

const text = '我爱人工智能技术和Python编程';
const tokens = tokenizer.tokenize(text);
const words = tokenizer.tokenizeText(text);
console.log(words); // Should include '人工智能', 'Python'

// Remove custom word
tokenizer.removeCustomWord('Python', 'en', 'programming');

Advanced Options

import { MultilingualTokenizer } from 'gs-tokenizer';

const tokenizer = new MultilingualTokenizer();

// Tokenize text
const text = '我爱北京天安门';
const tokens = tokenizer.tokenize(text);

// Get all possible tokens (core module only)
const allTokens = tokenizer.tokenizeAll(text);

Using Old Module

import { OldMultilingualTokenizer } from 'gs-tokenizer/old';

const tokenizer = new OldMultilingualTokenizer();

// Tokenize text (old is more stable but slower)
const text = '我爱北京天安门';
const tokens = tokenizer.tokenize(text);

API Reference

`MultilingualTokenizer`

Main tokenizer class that handles multilingual text processing.

Constructor

import { MultilingualTokenizer, TokenizerOptions } from 'gs-tokenizer';

const tokenizer = new MultilingualTokenizer(options)

Options:

customDictionaries: Record<string, LexiconEntry[]> - Custom dictionaries for each language
defaultLanguage: string - Default language code (default: 'en')

Methods

| Method | Description | |--------|-------------| | tokenize(text: string): Token[] | Tokenizes the input text and returns detailed token information | | tokenizeAll(text: string): Token[] | Returns all possible tokens at each position (core module only) | | tokenizeText(text: string): string[] | Tokenizes the input text and returns only word tokens | | tokenizeTextAll(text: string): string[] | Returns all possible word tokens at each position (core module only) | | addCustomDictionary(words: string[], name: string, priority?: number, language?: string): void | Adds custom words to the tokenizer | | removeCustomWord(word: string, language?: string, lexiconName?: string): void | Removes a custom word from the tokenizer | | addStage(stage: ITokenizerStage): void | Adds a custom tokenization stage (core module only) |

`createTokenizer(options?: TokenizerOptions): MultilingualTokenizer`

Factory function to create a new MultilingualTokenizer instance with optional configuration.

Quick Use API

The quick module provides convenient static methods:

import { Token } from 'gs-tokenizer';

// Quick Use API type definition
type QuickUseAPI = {
  // Tokenize text
  tokenize: (text: string, language?: string) => Token[];
  // Tokenize to text only
  tokenizeText: (text: string, language?: string) => string[];
  // Add custom dictionary
  addCustomDictionary: (words: string[], name: string, priority?: number, language?: string) => void;
  // Remove custom word
  removeCustomWord: (word: string, language?: string, lexiconName?: string) => void;
  // Set default languages for lexicon loading
  setDefaultLanguages: (languages: string[]) => void;
  // Set default types for lexicon loading
  setDefaultTypes: (types: string[]) => void;
};

// Import quick use API
import { tokenize, tokenizeText, addCustomDictionary, removeCustomWord, setDefaultLanguages, setDefaultTypes } from 'gs-tokenizer';

Types

`Token` Interface

interface Token {
  txt: string;              // Token text content
  type: 'word' | 'punctuation' | 'space' | 'other' | 'emoji' | 'date' | 'host' | 'ip' | 'number' | 'hashtag' | 'mention';
  lang?: string;            // Language code
  src?: string;             // Source (e.g., custom dictionary name)
}

`ITokenizerStage` Interface (core module only)

interface ITokenizerStage {
  order: number;
  priority: number;
  tokenize(text: string, start: number): IStageBestResult;
  all(text: string): IToken[];
}

`TokenizerOptions` Interface

import { LexiconEntry } from 'gs-tokenizer';

interface TokenizerOptions {
  customDictionaries?: Record<string, LexiconEntry[]>;
  granularity?: 'word' | 'grapheme' | 'sentence';
  defaultLanguage?: string;
}

Browser Compatibility

Chrome/Edge: 87+
Firefox: 86+
Safari: 14.1+

Note: Uses Intl.Segmenter for CJK languages, which requires modern browser support.

Development

Build

npm run build

Run Tests

npm run test          # Run all tests
npm run test:base     # Run base tests
npm run test:english  # Run English-specific tests
npm run test:cjk      # Run CJK-specific tests
npm run test:mixed    # Run mixed language tests

License

MIT

GitHub Repository

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

gs-tokenizer

Documentation

Features

Module Comparison

Installation

Alternative Installation

Usage

Quick Use (Recommended)

Advanced Usage

Load Custom Dictionary with Quick Module

Without Built-in Lexicon

Custom Dictionary

Advanced Options

Using Old Module

API Reference

MultilingualTokenizer

Constructor

Methods

createTokenizer(options?: TokenizerOptions): MultilingualTokenizer

Quick Use API

Types

Token Interface

ITokenizerStage Interface (core module only)

TokenizerOptions Interface

Browser Compatibility

Development

Build

Run Tests

License

`MultilingualTokenizer`

`createTokenizer(options?: TokenizerOptions): MultilingualTokenizer`

`Token` Interface

`ITokenizerStage` Interface (core module only)

`TokenizerOptions` Interface