nlpo3-newmm-typescript

v1.1.0

Published

17 days ago

Pure TypeScript/ES2016 implementation of nlpO3 NewMM Thai word tokenizer

0High
0Medium
0Low

korarit

thai nlp tokenizer word-segmentation newmm pythainlp nlpo3

nlpo3-newmm-typescript

Pure TypeScript / ES2016 implementation of the NewMM (New Maximum Matching) Thai word tokenizer.

Transliterated from the Rust version of nlpO3 — a Thai natural language processing library by PyThaiNLP.

No native bindings, no build tools required — pure JavaScript that works in both ESM and CommonJS.

Project VideCode use deepseek 4 Pro

Features

NewMM algorithm — dictionary-based maximal matching with Thai Character Cluster (TCC) boundary constraints
TrieChar dictionary — character-level trie for O(k) prefix lookups
BFS path resolution — shortest-path graph search over candidate split positions
Path explosion protection — visited set + MAX_GRAPH_SIZE=50 prevent exponential blowup
Safe mode — sliding-window heuristic for long texts with many ambiguities
Default dictionary — bundled words_th.txt (~62k words, 1.6 MB) from nlpO3
Dual ESM/CJS — works with import and require()

Installation

npm install nlpo3-newmm-typescript

Usage

TypeScript / ESM

import { NewmmTokenizer } from 'nlpo3-newmm-typescript';

// Default dictionary only
const tok = new NewmmTokenizer();
tok.segment('ภาษาไทยเป็นภาษาที่มีโครงสร้างซับซ้อน');
// ['ภาษา', 'ไทย', 'เป็น', 'ภาษา', 'ที่', 'มี', 'โครงสร้าง', 'ซับซ้อน']

CommonJS

const { NewmmTokenizer } = require('nlpo3-newmm-typescript');

const tok = new NewmmTokenizer();
const tokens = tok.segment('สวัสดีชาวโลก');

Default dictionary + custom words

const tok = new NewmmTokenizer(['คำศัพท์เฉพาะทาง', 'nlpo3']);
tok.segment('nlpo3เป็นคำศัพท์เฉพาะทาง');
// ['nlpo3', 'เป็น', 'คำศัพท์เฉพาะทาง']

Isolated word list (no defaults)

const tok = NewmmTokenizer.fromWordList(['สวัสดี', 'ชาว', 'โลก']);
tok.segment('สวัสดีชาวโลก');
// ['สวัสดี', 'ชาว', 'โลก']

Add / remove words dynamically

const tok = new NewmmTokenizer();
tok.addWord('นิวซีแลนด์');
tok.removeWord('ที่ไม่ต้องการ');
tok.segment('นิวซีแลนด์');

Safe mode (for long texts)

tok.segment(longText, true);  // second arg = safe mode

API

`new NewmmTokenizer(customWords?: string[])`

Create a tokenizer with the built-in ~62k word dictionary. Optionally merge custom words on top.

`NewmmTokenizer.fromWordList(words: string[])`

Create a tokenizer using only the given word list. No default dictionary.

`segment(text: string, safe?: boolean): string[]`

Tokenize text into words.

safe — enable safe mode (default false). Uses a sliding window to avoid long run times on highly ambiguous input. Recommended for texts longer than ~140 characters.

`segmentWithOptions(text: string, safe: boolean, parallelChunkSize?: number): string[]`

Full-options entry point. parallelChunkSize is accepted for API parity with the Rust version but has no effect in this single-threaded implementation.

`addWord(...words: string[]): void`

Add one or more words to the dictionary.

`removeWord(...words: string[]): void`

Remove one or more words from the dictionary.

How it works

Input text
    ↓
TCC (Thai Character Cluster) — compute valid split positions
    ↓
Main loop (min-heap of candidate positions):
  ├─ dictionary prefix lookup at current position
  ├─ build graph: position → position + word_length
  ├─ when only 1 candidate → BFS shortest path → extract tokens
  └─ when 0 candidates → non-Thai pattern match / forward scan
    ↓
Word tokens

The algorithm is the same dictionary-based maximal matching used by PyThaiNLP's newmm tokenizer, with:

TCC rules from Theeramunkong et al. 2000
BFS path resolution with visited-set cycle prevention
Non-Thai text detection (English, numbers, whitespace)

Benchmarks

Run with: npm run test:perf

Accuracy

| Dataset | Sentences | Text Match | Boundary F1 | |---------|-----------|------------|-------------| | LST20 | 300 | 99.3% | 88.6% | | thai_wordseg_menu | 109 | 100.0% | 68.8% |

| Difficulty | Sentences | Text Match | Boundary F1 | |------------|-----------|------------|-------------| | easy | 20 | 100% | 35.0% | | medium | 20 | 100% | 77.4% | | hard | 20 | 100% | 81.2% | | very_hard | 20 | 100% | 88.9% | | noisy | 29 | 100% | 63.7% |

Speed (LST20)

| Throughput | Avg per sentence | |------------|------------------| | ~492 sent/s | ~2.0 ms |

Memory

| Metric | Value | |--------|-------| | Tokenizer idle overhead | ~22 MB | | RSS after dict load | ~182 MB | | Active segment (300 LST20) | negative heap churn |

Negative heap churn means the garbage collector frees more memory than each segment call allocates, resulting in a net-zero allocation profile.

Tests

npm test

Build

npm run build

Output:

dist/index.js — ESM bundle (target es2020)
dist/index.cjs — CommonJS bundle (target es2016)
dist/index.d.ts + dist/index.d.cts — TypeScript declarations
dist/words_th.txt — bundled dictionary

Project structure

src/
  index.ts          — Public exports
  newmm.ts          — NewmmTokenizer (main algorithm)
  trie_char.ts      — Character-based trie dictionary
  tcc_rules.ts      — TCC regex patterns (24 rules)
  tcc_tokenizer.ts  — TCC position computation
  default_dict.ts   — Default dictionary loader
  words_th.txt      — Bundled ~62k word dictionary
test/
  newmm.test.ts     — Test suite (15 tests)
scripts/
  fix-cjs.mjs       — Post-build CJS compat patching
tsup.config.ts      — Build config (dual ESM/CJS)

Credits

Algorithm: Korakot Chaovavanich, Jakkrit TeCho, Wittawat Jitkrittum, Thanathip Suntorntip
Rust implementation: nlpO3 by PyThaiNLP
TCC rules: Theeramunkong et al. 2000 — "Learning-based Thai Word Boundary"
Thai dictionary: PyThaiNLP project (words_th.txt)

License

Apache-2.0 (matching nlpO3)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

nlpo3-newmm-typescript

Features

Installation

Usage

TypeScript / ESM

CommonJS

Default dictionary + custom words

Isolated word list (no defaults)

Add / remove words dynamically

Safe mode (for long texts)

API

new NewmmTokenizer(customWords?: string[])

NewmmTokenizer.fromWordList(words: string[])

segment(text: string, safe?: boolean): string[]

segmentWithOptions(text: string, safe: boolean, parallelChunkSize?: number): string[]

addWord(...words: string[]): void

removeWord(...words: string[]): void

How it works

Benchmarks

Accuracy

Speed (LST20)

Memory

Tests

Build

Project structure

Credits

License

`new NewmmTokenizer(customWords?: string[])`

`NewmmTokenizer.fromWordList(words: string[])`

`segment(text: string, safe?: boolean): string[]`

`segmentWithOptions(text: string, safe: boolean, parallelChunkSize?: number): string[]`

`addWord(...words: string[]): void`

`removeWord(...words: string[]): void`