tltk-js
v0.1.4
Published
Pure JS port of TLTK (Thai Language Toolkit)
Readme
TLTK-JS
A JavaScript/TypeScript port of the Python TLTK library for Thai text processing.
Installation
npm install tltk-jsUsage
import { g2p, th2roman } from 'tltk-js';
// Convert Thai to IPA
const ipa = g2p("สวัสดี");
// Output: สวัส~ดี<tr/>sa1'wat1~dii0|<s/>
// Convert Thai to Romanized form
const roman = th2roman("สวัสดี", { stripTags: true });
// Output: sawatdiAPI
g2p(input: string, options?: TLTKOptions): string
Converts Thai text to IPA transcription.
th2roman(input: string, options?: TLTKOptions): string
Converts Thai text to Romanized form (RTGS approximation).
Options
stripTags?: boolean- Iftrue, removes XML-like tags and separators from output. Default:false.fallbackHeuristics?: boolean- Iftrue, enables heuristic IPA generation for unknown syllables AND consonant shifting logic for clusters. Default:false(matches Python TLTK behavior).
Architecture & Design Rationale
Core Pipeline
Input -> preprocess -> sylparse -> wordparse -> selectPhones -> Outputpreprocess: Handles mixed Thai/English, spacing, and special characters.sylparse: Syllable segmentation using regex patterns and trigram probabilities.wordparse: Word segmentation using dictionary lookup (TDICT) and chart parsing.selectPhones: Selects the best pronunciation for each syllable fromtltk_data.json.
Heuristic Fallback (src/heuristics.ts)
The Problem with Python TLTK
Python TLTK relies entirely on its dictionary (tltk_data.json) for pronunciation lookups. When it encounters a syllable not in the dictionary, it:
- Silently drops the syllable from the output, OR
- Truncates the remaining text after the unknown syllable.
Example:
Input: "โอม มฤกกึกกึย" (Mantra with rare/nonsense syllables)
Python: "om marue" (The "กกึ", "กกึย" parts are lost!)This is problematic for applications that need to handle:
- Religious/mantra texts with non-standard combinations
- User-generated content with typos
- Rare or archaic Thai words not in the dictionary
- Invented words or names
Our Solution: Heuristic IPA Generation
Instead of silent failure, we implemented guessIPA() in src/heuristics.ts:
- When invoked: Only when
selectPhones()finds zero pronunciations for a syllable. - What it does: Analyzes Thai graphemes and constructs a plausible IPA string using:
- Consonant mappings (initial vs. final position)
- Vowel mappings
- Implicit vowel insertion rules (e.g.,
กก→kok)
- Result: The syllable is preserved with an approximate pronunciation instead of being dropped.
Same example with JS TLTK:
Input: "โอม มฤกกึกกึย"
JS: "om maruek kuek kuei" (All syllables preserved!)Key Guarantee: Standard Thai is Unaffected
For valid, standard Thai phrases, heuristics.ts is NEVER invoked.
The dictionary lookup in selectPhones handles all known words. This ensures:
- ✅ 100% parity with Python TLTK for compliant inputs (verified by
verify_parity.js) - ✅ Graceful degradation for unknown inputs (tested by
test_deviations.mjs)
Why Not Just Add Words to the Dictionary?
Adding every possible syllable combination to tltk_data.json is impractical because:
- The dictionary is already ~17MB
- Nonsense/mantra words are infinite variations
- Typos and invented words cannot be pre-enumerated
A heuristic approach provides reasonable coverage without bloating the data file.
Consonant Shifting in th2roman
When a segment starts with a double consonant (e.g., kkue) and the previous segment ends with a vowel (e.g., marue), the shifting logic moves the first consonant to close the previous syllable:
marue + kkue -> maruek + kueThis is applied via regex:
tran.replace(/([aeiou])\s+([bcdfghjklmnpqrstvwxyz])\2/g, "$1$2 $2");Test Suites
1. verify_parity.js - Parity Tests
Purpose: Ensure JS output matches Python TLTK for standard inputs.
Data Source: ground_truth.json generated from Python TLTK.
Usage:
node verify_parity.jsExpected: 100% pass rate. Any failure indicates a regression.
2. test_deviations.mjs - Enhancement Tests
Purpose: Test enhanced behavior for edge cases where we intentionally deviate from Python TLTK.
Rationale: Python TLTK truncates/silently drops unknown syllables. Our JS port provides a heuristic fallback instead. This test suite validates that fallback produces reasonable output.
Example:
{
input: "โอม มฤกกึกกึย",
// Python TLTK: "om marue" (truncated)
// JS TLTK: "om maruek kuek kuei" (enhanced)
expectedRomKeywords: ["om", "maruek", "kue", "kuei"]
}Usage:
node test_deviations.mjs3. test_dist.js - Smoke Test
Purpose: Quick sanity check that the built distribution is importable and functional.
Usage:
node test_dist.jsSummary Table
| Test File | Purpose | Expected Behavior |
|---------------------|-----------------------------|----------------------------------------|
| verify_parity.js | Standard Thai parity | 100% match with Python TLTK |
| test_deviations.mjs | Unknown syllable handling | Enhanced output (no truncation) |
| test_dist.js | Build smoke test | No errors, basic output |
License
MIT
