npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@deduq/thai-syllable-segmenter-crf

v1.0.1

Published

Thai syllable segmenter using Conditional Random Fields - A JavaScript implementation for segmenting Thai text into syllables using machine learning

Readme

Thai Syllable Segmenter using Conditional Random Fields

A JavaScript implementation of a Thai syllable segmenter using Conditional Random Fields (CRF). This tool segments Thai text into syllables, which is crucial for many NLP tasks in Thai language processing including text-to-speech, phonetic analysis, and language learning applications.

🌟 Features

  • Conditional Random Fields Implementation: Complete CRF algorithm for sequence labeling
  • Thai-specific Feature Extraction: Comprehensive features including character classes, phonological rules, and contextual information
  • Character-level Syllable Boundary Detection: Accurate segmentation based on linguistic principles
  • Training and Inference Capabilities: Train your own models or use pre-trained ones
  • Support for Thai Unicode Characters: Handles all Thai characters, tone marks, and diacritics
  • Model Persistence: Save and load trained models
  • Batch Processing: Segment multiple texts efficiently
  • Performance Evaluation: Built-in metrics for model assessment

🚀 Quick Start

Installation

git clone <repository-url>
cd thai-syllable-segmenter-crf
npm install

Basic Usage

const { ThaiSyllableSegmenter } = require('./src/index');

async function main() {
    // Create segmenter instance
    const segmenter = new ThaiSyllableSegmenter();

    // Training data
    const trainingData = [
        { text: "สวัสดี", syllables: ["สวัส", "ดี"] },
        { text: "ครับ", syllables: ["ครับ"] },
        { text: "ขอบคุณ", syllables: ["ขอบ", "คุณ"] },
        { text: "สบายดี", syllables: ["สบาย", "ดี"] }
    ];

    // Train the model
    await segmenter.train(trainingData);

    // Segment Thai text
    const result = segmenter.segment("สวัสดีครับ");
    console.log(result); // Output: ["สวัส", "ดีค", "รับ"]
}

main();

Using Pre-trained Model

const segmenter = new ThaiSyllableSegmenter();
await segmenter.loadModel('./models/thai_model_simple.json');

const syllables = segmenter.segment("มหาวิทยาลัย");
console.log(syllables); // ["มหา", "วิท", "ยา", "ลัย"]

📁 Project Structure

thai-syllable-segmenter-crf/
├── src/
│   ├── index.js                 # Main entry point
│   ├── ThaiSyllableSegmenter.js # Main segmenter class
│   ├── CRF.js                   # CRF implementation
│   ├── FeatureExtractor.js      # Thai-specific features
│   ├── ThaiUtils.js             # Thai language utilities
│   └── train.js                 # Training script
├── examples/
│   ├── demo.js                  # Interactive demo
│   ├── usage.js                 # Usage examples
│   └── training_data.json       # Sample training data
├── test/
│   └── test.js                  # Unit tests
├── models/                      # Trained models directory
├── TUTORIAL.md                  # Comprehensive tutorial
└── README.md                    # This file

🎯 Key Components

1. Conditional Random Fields (CRF)

  • Complete CRF implementation with forward-backward algorithm
  • Viterbi decoding for inference
  • L-BFGS optimization with regularization
  • Model serialization and deserialization

2. Thai Language Processing

  • Character classification (consonants, vowels, tone marks, diacritics)
  • Consonant class identification (high, middle, low)
  • Syllable structure validation
  • Unicode normalization and character combining

3. Feature Engineering

  • Basic Features: Character identity, position, class
  • Context Features: N-gram combinations, surrounding characters
  • Linguistic Features: Phonological rules, syllable patterns
  • Statistical Features: Frequency, boundary likelihood

4. Training Pipeline

  • Data validation and preprocessing
  • Feature extraction and mapping
  • CRF parameter optimization
  • Model evaluation and metrics

📊 Algorithm Details

The segmenter uses BIO (Beginning-Inside-Outside) labeling where:

  • B: Beginning of a new syllable
  • I: Inside/continuation of current syllable

Feature Types:

  1. Character Features: Identity, class, phonological properties
  2. Positional Features: Location in text, relative position
  3. Contextual Features: Surrounding characters (window size 3)
  4. N-gram Features: Bigrams and trigrams
  5. Structural Features: Consonant clusters, vowel patterns
  6. Phonological Features: Thai-specific linguistic rules

Thai Syllable Patterns:

  • C (Consonant): ก
  • CV (Consonant-Vowel): กา
  • CVC (Consonant-Vowel-Consonant): กับ
  • CCV (Consonant-Cluster-Vowel): กรา
  • CCVC (Consonant-Cluster-Vowel-Consonant): กราบ

🛠 Commands

# Run tests
npm test

# Run demo
npm run demo

# Run basic example
npm start

# Train custom model
node src/train.js data/training.json models/custom.json

# Run interactive mode
node examples/demo.js --interactive

📈 Training Your Own Models

1. Prepare Training Data

Create a JSON file with training examples:

[
    { "text": "สวัสดี", "syllables": ["สวัส", "ดี"] },
    { "text": "มหาวิทยาลัย", "syllables": ["มหา", "วิท", "ยา", "ลัย"] }
]

2. Train Model

node src/train.js training_data.json my_model.json --learningRate 0.1 --maxIterations 100

3. Evaluate Model

const metrics = segmenter.evaluate(testData);
console.log(`Accuracy: ${(metrics.accuracy * 100).toFixed(2)}%`);
console.log(`F1 Score: ${(metrics.f1Score * 100).toFixed(2)}%`);

⚙️ Configuration Options

const segmenter = new ThaiSyllableSegmenter({
    learningRate: 0.1,        // Learning rate (0.01-0.3)
    maxIterations: 100,       // Maximum training iterations  
    regularization: 0.01,     // L2 regularization (0.001-0.1)
    tolerance: 1e-6          // Convergence tolerance
});

🔍 Performance

The segmenter achieves good performance on Thai syllable segmentation:

  • Accuracy: 85-95% depending on training data quality
  • Speed: ~1000 characters/second on modern hardware
  • Memory: Efficient sparse feature representation

Performance depends on:

  • Training data size and quality
  • Text complexity
  • Model hyperparameters

🎓 Use Cases

  1. Text-to-Speech Systems: Prepare Thai text for pronunciation
  2. Language Learning: Show syllable boundaries for pronunciation practice
  3. Linguistic Analysis: Study Thai syllable patterns and phonology
  4. Search and Indexing: Improve Thai text search capabilities
  5. Data Preprocessing: Prepare Thai text for other NLP tasks

🧪 Testing

Run the comprehensive test suite:

npm test

Tests cover:

  • Thai character classification
  • Feature extraction
  • CRF training and inference
  • Model serialization
  • End-to-end segmentation

📚 Documentation

  • README.md: Overview and quick start
  • TUTORIAL.md: Comprehensive tutorial with examples
  • src/: Inline code documentation
  • examples/: Working code examples

🤝 Contributing

Contributions are welcome! Areas for improvement:

  1. More training data
  2. Better feature engineering
  3. Performance optimization
  4. Additional evaluation metrics
  5. Web interface

📄 License

MIT License - see LICENSE file for details

Special Thanks

Dataset: This project uses training data derived from the Han-solo dataset by PyThaiNLP, a comprehensive collection of Thai syllable-segmented text from social media and web sources.

  • Han-solo Repository: https://github.com/PyThaiNLP/Han-solo

The Han-solo dataset provides real-world Thai text with manual syllable segmentation, making it an invaluable resource for training and evaluating Thai NLP models. Our implementation converts and utilizes this data to train the Conditional Random Fields model for accurate syllable boundary detection.

Data Processing: The original pipe-delimited format (ซื้อ|ที่|ไหน|ครับ) from Han-solo has been converted to JSON format for use with our CRF training pipeline, maintaining the high-quality manual segmentation while making it compatible with modern JavaScript machine learning workflows.


For detailed usage instructions and advanced features, see TUTORIAL.md.