npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

nlpo3-newmm-typescript

v1.1.0

Published

Pure TypeScript/ES2016 implementation of nlpO3 NewMM Thai word tokenizer

Readme

nlpo3-newmm-typescript

Pure TypeScript / ES2016 implementation of the NewMM (New Maximum Matching) Thai word tokenizer.

Transliterated from the Rust version of nlpO3 — a Thai natural language processing library by PyThaiNLP.

No native bindings, no build tools required — pure JavaScript that works in both ESM and CommonJS.

Project VideCode use deepseek 4 Pro

Features

  • NewMM algorithm — dictionary-based maximal matching with Thai Character Cluster (TCC) boundary constraints
  • TrieChar dictionary — character-level trie for O(k) prefix lookups
  • BFS path resolution — shortest-path graph search over candidate split positions
  • Path explosion protection — visited set + MAX_GRAPH_SIZE=50 prevent exponential blowup
  • Safe mode — sliding-window heuristic for long texts with many ambiguities
  • Default dictionary — bundled words_th.txt (~62k words, 1.6 MB) from nlpO3
  • Dual ESM/CJS — works with import and require()

Installation

npm install nlpo3-newmm-typescript

Usage

TypeScript / ESM

import { NewmmTokenizer } from 'nlpo3-newmm-typescript';

// Default dictionary only
const tok = new NewmmTokenizer();
tok.segment('ภาษาไทยเป็นภาษาที่มีโครงสร้างซับซ้อน');
// ['ภาษา', 'ไทย', 'เป็น', 'ภาษา', 'ที่', 'มี', 'โครงสร้าง', 'ซับซ้อน']

CommonJS

const { NewmmTokenizer } = require('nlpo3-newmm-typescript');

const tok = new NewmmTokenizer();
const tokens = tok.segment('สวัสดีชาวโลก');

Default dictionary + custom words

const tok = new NewmmTokenizer(['คำศัพท์เฉพาะทาง', 'nlpo3']);
tok.segment('nlpo3เป็นคำศัพท์เฉพาะทาง');
// ['nlpo3', 'เป็น', 'คำศัพท์เฉพาะทาง']

Isolated word list (no defaults)

const tok = NewmmTokenizer.fromWordList(['สวัสดี', 'ชาว', 'โลก']);
tok.segment('สวัสดีชาวโลก');
// ['สวัสดี', 'ชาว', 'โลก']

Add / remove words dynamically

const tok = new NewmmTokenizer();
tok.addWord('นิวซีแลนด์');
tok.removeWord('ที่ไม่ต้องการ');
tok.segment('นิวซีแลนด์');

Safe mode (for long texts)

tok.segment(longText, true);  // second arg = safe mode

API

new NewmmTokenizer(customWords?: string[])

Create a tokenizer with the built-in ~62k word dictionary. Optionally merge custom words on top.

NewmmTokenizer.fromWordList(words: string[])

Create a tokenizer using only the given word list. No default dictionary.

segment(text: string, safe?: boolean): string[]

Tokenize text into words.

  • safe — enable safe mode (default false). Uses a sliding window to avoid long run times on highly ambiguous input. Recommended for texts longer than ~140 characters.

segmentWithOptions(text: string, safe: boolean, parallelChunkSize?: number): string[]

Full-options entry point. parallelChunkSize is accepted for API parity with the Rust version but has no effect in this single-threaded implementation.

addWord(...words: string[]): void

Add one or more words to the dictionary.

removeWord(...words: string[]): void

Remove one or more words from the dictionary.

How it works

Input text
    ↓
TCC (Thai Character Cluster) — compute valid split positions
    ↓
Main loop (min-heap of candidate positions):
  ├─ dictionary prefix lookup at current position
  ├─ build graph: position → position + word_length
  ├─ when only 1 candidate → BFS shortest path → extract tokens
  └─ when 0 candidates → non-Thai pattern match / forward scan
    ↓
Word tokens

The algorithm is the same dictionary-based maximal matching used by PyThaiNLP's newmm tokenizer, with:

  • TCC rules from Theeramunkong et al. 2000
  • BFS path resolution with visited-set cycle prevention
  • Non-Thai text detection (English, numbers, whitespace)

Benchmarks

Run with: npm run test:perf

Accuracy

| Dataset | Sentences | Text Match | Boundary F1 | |---------|-----------|------------|-------------| | LST20 | 300 | 99.3% | 88.6% | | thai_wordseg_menu | 109 | 100.0% | 68.8% |

| Difficulty | Sentences | Text Match | Boundary F1 | |------------|-----------|------------|-------------| | easy | 20 | 100% | 35.0% | | medium | 20 | 100% | 77.4% | | hard | 20 | 100% | 81.2% | | very_hard | 20 | 100% | 88.9% | | noisy | 29 | 100% | 63.7% |

Speed (LST20)

| Throughput | Avg per sentence | |------------|------------------| | ~492 sent/s | ~2.0 ms |

Memory

| Metric | Value | |--------|-------| | Tokenizer idle overhead | ~22 MB | | RSS after dict load | ~182 MB | | Active segment (300 LST20) | negative heap churn |

Negative heap churn means the garbage collector frees more memory than each segment call allocates, resulting in a net-zero allocation profile.

Tests

npm test

Build

npm run build

Output:

  • dist/index.js — ESM bundle (target es2020)
  • dist/index.cjs — CommonJS bundle (target es2016)
  • dist/index.d.ts + dist/index.d.cts — TypeScript declarations
  • dist/words_th.txt — bundled dictionary

Project structure

src/
  index.ts          — Public exports
  newmm.ts          — NewmmTokenizer (main algorithm)
  trie_char.ts      — Character-based trie dictionary
  tcc_rules.ts      — TCC regex patterns (24 rules)
  tcc_tokenizer.ts  — TCC position computation
  default_dict.ts   — Default dictionary loader
  words_th.txt      — Bundled ~62k word dictionary
test/
  newmm.test.ts     — Test suite (15 tests)
scripts/
  fix-cjs.mjs       — Post-build CJS compat patching
tsup.config.ts      — Build config (dual ESM/CJS)

Credits

  • Algorithm: Korakot Chaovavanich, Jakkrit TeCho, Wittawat Jitkrittum, Thanathip Suntorntip
  • Rust implementation: nlpO3 by PyThaiNLP
  • TCC rules: Theeramunkong et al. 2000 — "Learning-based Thai Word Boundary"
  • Thai dictionary: PyThaiNLP project (words_th.txt)

License

Apache-2.0 (matching nlpO3)