npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

gs-tokenizer

v0.1.16

Published

A powerful and lightweight multilingual tokenizer library that provides natural language processing capabilities for multiple languages including English, Chinese, Japanese, and Korean.

Readme

gs-tokenizer

A powerful and lightweight multilingual tokenizer library that provides natural language processing capabilities for multiple languages including English, Chinese, Japanese, and Korean.

Documentation

Features

  • Language Support: English, Chinese, Japanese, Korean
  • Intelligent Tokenization:
    • English: Word boundary-based tokenization
    • CJK (Chinese, Japanese, Korean): Natural word segmentation using browser's Intl.Segmenter
    • Date: Special handling for date patterns
    • Punctuation: Consecutive punctuation marks are merged into a single token
  • Custom Dictionary: Support for adding custom words with priority and name
  • Auto Language Detection: Automatically detects the language of input text
  • Multiple Output Formats: Get detailed token information or just word lists
  • Lightweight: Minimal dependencies, designed for browser environments
  • Quick Use API: Convenient static methods for easy integration
  • tokenizeAll: New feature in core module that returns all possible tokens at each position

Module Comparison

| Module | Stability | Speed | Tokenization Accuracy | New Features | |--------|-----------|-------|-----------------------|--------------| | old | ✅ More stable | ⚡️ Slower | ✅ More accurate | ❌ No new features | | core | ⚠️ Less stable | ⚡️ Faster | ⚠️ May be less accurate | ✅ tokenizeAll, Stage-based architecture |

Installation

yarn add gs-tokenizer

Alternative Installation

npm install gs-tokenizer

Usage

Quick Use (Recommended)

The quick module provides convenient static methods for easy integration:

import { tokenize, tokenizeText, addCustomDictionary } from 'gs-tokenizer';

// Direct tokenization without creating an instance
const text = 'Hello world! 我爱北京天安门。';
const tokens = tokenize(text);
const words = tokenizeText(text);
console.log(words);

// Add custom dictionary
addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');

Advanced Usage

Load Custom Dictionary with Quick Module

import { tokenize, addCustomDictionary } from 'gs-tokenizer';

// Load multiple custom dictionaries for different languages
addCustomDictionary(['人工智能', '机器学习'], 'tech', 10, 'zh');
addCustomDictionary(['Web3', 'Blockchain'], 'crypto', 10, 'en');
addCustomDictionary(['アーティフィシャル・インテリジェンス'], 'tech-ja', 10, 'ja');

// Tokenize with custom dictionaries applied
const text = '人工智能和Web3是未来的重要技术。アーティフィシャル・インテリジェンスも重要です。';
const tokens = tokenize(text);
console.log(tokens.filter(token => token.src === 'tech'));

Without Built-in Lexicon

import { MultilingualTokenizer } from 'gs-tokenizer';

// Create tokenizer without using built-in lexicon
const tokenizer = new MultilingualTokenizer({
  customDictionaries: {
    'zh': [{ priority: 10, data: new Set(['自定义词']), name: 'custom', lang: 'zh' }]
  }
});

// Tokenize using only custom dictionary
const text = '这是一个自定义词的示例。';
const tokens = tokenizer.tokenize(text, 'zh');
console.log(tokens);

Custom Dictionary

const tokenizer = new OldMultilingualTokenizer();

// Add custom words with name, priority, and language
tokenizer.addCustomDictionary(['人工智能', '技术'], 'tech', 10, 'zh');
tokenizer.addCustomDictionary(['Python', 'JavaScript'], 'programming', 5, 'en');

const text = '我爱人工智能技术和Python编程';
const tokens = tokenizer.tokenize(text);
const words = tokenizer.tokenizeText(text);
console.log(words); // Should include '人工智能', 'Python'

// Remove custom word
tokenizer.removeCustomWord('Python', 'en', 'programming');

Advanced Options

import { MultilingualTokenizer } from 'gs-tokenizer';

const tokenizer = new MultilingualTokenizer();

// Tokenize text
const text = '我爱北京天安门';
const tokens = tokenizer.tokenize(text);

// Get all possible tokens (core module only)
const allTokens = tokenizer.tokenizeAll(text);

Using Old Module

import { OldMultilingualTokenizer } from 'gs-tokenizer/old';

const tokenizer = new OldMultilingualTokenizer();

// Tokenize text (old is more stable but slower)
const text = '我爱北京天安门';
const tokens = tokenizer.tokenize(text);

API Reference

MultilingualTokenizer

Main tokenizer class that handles multilingual text processing.

Constructor

import { MultilingualTokenizer, TokenizerOptions } from 'gs-tokenizer';

const tokenizer = new MultilingualTokenizer(options)

Options:

  • customDictionaries: Record<string, LexiconEntry[]> - Custom dictionaries for each language
  • defaultLanguage: string - Default language code (default: 'en')

Methods

| Method | Description | |--------|-------------| | tokenize(text: string): Token[] | Tokenizes the input text and returns detailed token information | | tokenizeAll(text: string): Token[] | Returns all possible tokens at each position (core module only) | | tokenizeText(text: string): string[] | Tokenizes the input text and returns only word tokens | | tokenizeTextAll(text: string): string[] | Returns all possible word tokens at each position (core module only) | | addCustomDictionary(words: string[], name: string, priority?: number, language?: string): void | Adds custom words to the tokenizer | | removeCustomWord(word: string, language?: string, lexiconName?: string): void | Removes a custom word from the tokenizer | | addStage(stage: ITokenizerStage): void | Adds a custom tokenization stage (core module only) |

createTokenizer(options?: TokenizerOptions): MultilingualTokenizer

Factory function to create a new MultilingualTokenizer instance with optional configuration.

Quick Use API

The quick module provides convenient static methods:

import { Token } from 'gs-tokenizer';

// Quick Use API type definition
type QuickUseAPI = {
  // Tokenize text
  tokenize: (text: string, language?: string) => Token[];
  // Tokenize to text only
  tokenizeText: (text: string, language?: string) => string[];
  // Add custom dictionary
  addCustomDictionary: (words: string[], name: string, priority?: number, language?: string) => void;
  // Remove custom word
  removeCustomWord: (word: string, language?: string, lexiconName?: string) => void;
  // Set default languages for lexicon loading
  setDefaultLanguages: (languages: string[]) => void;
  // Set default types for lexicon loading
  setDefaultTypes: (types: string[]) => void;
};

// Import quick use API
import { tokenize, tokenizeText, addCustomDictionary, removeCustomWord, setDefaultLanguages, setDefaultTypes } from 'gs-tokenizer';

Types

Token Interface

interface Token {
  txt: string;              // Token text content
  type: 'word' | 'punctuation' | 'space' | 'other' | 'emoji' | 'date' | 'host' | 'ip' | 'number' | 'hashtag' | 'mention';
  lang?: string;            // Language code
  src?: string;             // Source (e.g., custom dictionary name)
}

ITokenizerStage Interface (core module only)

interface ITokenizerStage {
  order: number;
  priority: number;
  tokenize(text: string, start: number): IStageBestResult;
  all(text: string): IToken[];
}

TokenizerOptions Interface

import { LexiconEntry } from 'gs-tokenizer';

interface TokenizerOptions {
  customDictionaries?: Record<string, LexiconEntry[]>;
  granularity?: 'word' | 'grapheme' | 'sentence';
  defaultLanguage?: string;
}

Browser Compatibility

  • Chrome/Edge: 87+
  • Firefox: 86+
  • Safari: 14.1+

Note: Uses Intl.Segmenter for CJK languages, which requires modern browser support.

Development

Build

npm run build

Run Tests

npm run test          # Run all tests
npm run test:base     # Run base tests
npm run test:english  # Run English-specific tests
npm run test:cjk      # Run CJK-specific tests
npm run test:mixed    # Run mixed language tests

License

MIT

GitHub Repository