lao-segmenter
v0.1.2
Published
Lao word segmenter using maximal matching with a 34k-word dictionary — works in Node.js and browsers
Maintainers
Readme
lao-segmenter

A Lao word segmenter for Node.js and the browser. It splits unsegmented Lao text into words using a dictionary of 34,000 Lao words and a maximal matching algorithm — the same technique used by PyThaiNLP for Thai text.
import { segment } from 'lao-segmenter'
segment('ຂ້ອຍຮຽນພາສາລາວ')
// → ['ຂ້ອຍ', 'ຮຽນ', 'ພາສາລາວ']
// → ['I', 'study', 'Lao language']Why this package exists
Lao script has no spaces between words — just like Thai or Khmer. This makes it hard for computers to know where one word ends and the next begins. Until now, no Lao word segmenter existed for JavaScript. This package fills that gap.
Install
npm install lao-segmenterWorks with Node.js 16+, Bun, and Deno. Zero runtime dependencies.
Quick start
import { segment } from 'lao-segmenter'
// Basic segmentation
segment('ສະບາຍດີ')
// → ['ສະບາຍດີ'] (one dictionary word)
segment('ຄົນລາວ')
// → ['ຄົນ', 'ລາວ'] (two words: "person" + "Lao")
segment('ຂ້ອຍໄປຮຽນທີ່ໂຮງຮຽນ')
// → ['ຂ້ອຍ', 'ໄປ', 'ຮຽນ', 'ທີ່', 'ໂຮງຮຽນ']
// → ['I', 'go', 'study', 'at', 'school']Mixed Lao and English:
segment('ພາສາລາວ hello world')
// → ['ພາສາລາວ', ' ', 'hello', ' ', 'world']Numbers and prices:
segment('ລາຄາ 1000 ກີບ')
// → ['ລາຄາ', ' ', '1000', ' ', 'ກີບ']
// → ['price', ' ', '1000', ' ', 'kip']API
segment(text, options?)
Splits a string into an array of tokens.
segment(text: string, options?: SegmentOptions): string[]Options:
| Option | Type | Default | Description |
|---|---|---|---|
| keepWhitespace | boolean | true | Include space tokens in the result |
| customWords | string[] | [] | Extra words to add to the dictionary |
| trie | Trie | — | Bring your own pre-built Trie |
Examples:
// Remove spaces from the output
segment('ຂ້ອຍ ຮຽນ ພາສາ', { keepWhitespace: false })
// → ['ຂ້ອຍ', 'ຮຽນ', 'ພາສາ']
// Add custom words not in the default dictionary
segment('ໂຄ້ດດິ້ງລາວ', { customWords: ['ໂຄ້ດດິ້ງ'] })
// → ['ໂຄ້ດດິ້ງ', 'ລາວ']new Segmenter(options?)
A reusable class that builds the dictionary index once and reuses it across many calls. Faster when you segment a lot of text.
import { Segmenter } from 'lao-segmenter'
const seg = new Segmenter({ customWords: ['ຊາວໜຸ່ມ'] })
seg.segment('ຊາວໜຸ່ມລາວ')
// → ['ຊາວໜຸ່ມ', 'ລາວ']
seg.segment('ຂ້ອຍຮຽນ')
// → ['ຂ້ອຍ', 'ຮຽນ']splitLGC(text)
A lower-level function that splits text into Lao Grapheme Clusters — the smallest atomic units of Lao script (roughly one syllable per cluster). Useful when you need character-level control.
import { splitLGC } from 'lao-segmenter'
splitLGC('ເກາະ')
// → ['ເກາະ'] (one cluster: leading vowel + consonant + trailing vowel)How it works
- Trie lookup — the 34,000-word dictionary is loaded into a prefix tree (trie) for fast lookups.
- Maximal matching — at each position, the algorithm finds the longest word that matches the dictionary.
- LGC fallback — if no dictionary match is found, the segmenter advances one Lao Grapheme Cluster so it never gets stuck on unknown words.
- ໆ absorption — the Lao repetition mark ໆ is always merged with the word before it (e.g.
ຕ່າງໆstays as one token).
This is the same algorithm family as PyThaiNLP's newmm tokenizer, adapted for Lao Unicode.
Dictionary sources
The built-in dictionary combines three open-source word lists:
| Source | Words | License | |---|---|---| | Lao Dictionary by Brian Wilson | ~11,000 | BSD 3-Clause | | Wiktionary Lao snapshot 2021 | ~13,000 | CC-BY-SA 3.0 | | Google Language Resources spell-check | ~21,000 | Apache 2.0 |
After deduplication: 33,996 unique words, sorted longest-first for best matching performance.
CommonJS usage
const { segment } = require('lao-segmenter')
segment('ສະບາຍດີ')
// → ['ສະບາຍດີ']TypeScript
This package ships with full TypeScript types.
import { segment, Segmenter, SegmentOptions } from 'lao-segmenter'
const options: SegmentOptions = { keepWhitespace: false }
const tokens: string[] = segment('ຂ້ອຍຮຽນ', options)Rebuild the dictionary
If you want to update the dictionary from the original sources:
node scripts/build-dict.mjsThis downloads the latest word lists and regenerates data/lao-words.txt.
Related projects
- LaoNLP — Lao NLP library for Python
- PyThaiNLP — Thai NLP library (inspiration for the algorithm)
- Awesome Lao NLP — curated list of Lao language resources
License
MIT © Xaypanya Phongsa
The bundled dictionary files have separate licenses — see Dictionary sources above.
