sudachi-ts

v0.1.22

Published

3 months ago

TypeScript port of Sudachi morphological analyzer for Japanese text

0High
0Medium
0Low

gstamp

morphological-analyzer nlp japanese tokenization natural-language-processing text-processing

Sudachi-TS

TypeScript port of Sudachi Japanese morphological analyzer.

Warning: Dictionary files are required for Sudachi-TS to function. Please download them from the Sudachi releases page before using this library.

Features

Full Tokenization Support: A/B/C split modes for different granularities
Binary Dictionary Compatibility: Load and use pre-built Sudachi dictionaries
Dynamic Plugin System: Extensible architecture with runtime plugin loading
Dictionary Building: Complete CSV to binary dictionary conversion
Sentence Detection: Multi-sentence text processing
UTF-8 Handling: Proper Japanese text normalization and character encoding
POS Matching: Flexible part-of-speech filtering and matching
Counter Alias Recovery: Resolves numeric kana counters such as 1こ to the canonical counter lattice before best-path selection

Requirements

Node.js: >= 18.0.0
TypeScript: >= 5.0.0 (peer dependency)

Installation

npm install sudachi-ts

Or using yarn:

yarn add sudachi-ts

Quick Start

import { Dictionary, SplitMode } from 'sudachi-ts';
import { BinaryDictionary } from 'sudachi-ts/dictionary/binaryDictionary.js';

// Load dictionary
const dict = await BinaryDictionary.loadSystem('./path/to/system.dic');

// Create tokenizer
const tokenizer = dict.create();

// Tokenize text
const result = tokenizer.tokenize('東京都に行きました。');

// Access morphemes
for (const morpheme of result) {
  console.log(morpheme.surface()); // Surface form
  console.log(morpheme.readingForm()); // Reading
  console.log(morpheme.partOfSpeech()); // POS tags
  console.log(morpheme.normalizedForm()); // Normalized form
}

// Use different split modes
const modeAResult = tokenizer.tokenize(SplitMode.A, '京都に行きました');
const modeBResult = tokenizer.tokenize(SplitMode.B, '京都に行きました');
const modeCResult = tokenizer.tokenize(SplitMode.C, '京都に行きました');

// Clean up
await dict.close();

Dictionary Files

Important: This package does not include dictionary files. You need to provide your own:

System Dictionary: Download from Sudachi releases
User Dictionary: Build your own using the CLI tools or provide existing .dic files

Example dictionary paths:

system_core.dic - Core system dictionary
system_full.dic - Full system dictionary
user.dic - User dictionary (optional)

Split Modes

Sudachi provides three tokenization modes:

Mode A: Shortest possible segmentation (most granular)
Mode B: Medium segmentation (balanced)
Mode C: Longest possible segmentation (least granular)

Example with "京都に行きました":

Mode A: 京都|に|行き|まし|た
Mode B: 京都|に|行きました
Mode C: 京都|に行きました

Configuration

Load configuration from a JSON file:

import { loadConfig } from 'sudachi-ts/config/config.js';
import { Dictionary } from 'sudachi-ts/core/dictionary.js';

const config = await loadConfig('./sudachi.json');
const dict = Dictionary.create();

Example sudachi.json:

{
  "systemDict": "system_core.dic",
  "userDicts": ["user.dic"],
  "characterDefinitionFile": "char.def",
  "plugins": [
    {
      "className": "sudachi-ts/plugins/inputText/defaultInputTextPlugin.js",
      "settings": {
        "normalize": true
      }
    }
  ]
}

For non-absolute file references in config (dictionary files, plugin module paths, and built-in plugin file settings), Sudachi-TS tries paths relative to the config file first, then relative to the current working directory.

By default, Sudachi-TS enables a built-in compound-particle lexicon ("enableDefaultCompoundParticles": true) so forms such as かも, のか, and だから are tokenized as single morphemes. Set it to false to disable:

{
  "enableDefaultCompoundParticles": false
}

The default OOV plugin stack also injects counter aliases in numeric contexts, so kana counters such as りんごを1こください。 are analyzed as りんご / を / 1 / こ / ください / 。 with the counter normalized to 個 instead of falling through to unrelated dictionary entries.

Working with Morphemes

Access detailed morpheme information:

const morpheme = result[0];

// Surface form
console.log(morpheme.surface());

// Word forms
console.log(morpheme.dictionaryForm()); // Dictionary form
console.log(morpheme.normalizedForm()); // Normalized form
console.log(morpheme.readingForm()); // Reading form

// Part of speech
console.log(morpheme.partOfSpeech()); // e.g., ["名詞", "固有名詞", "地名", "一般"]

// Word ID and dictionary
console.log(morpheme.wordId());
console.log(morpheme.dictionaryId());

// Morpheme bounds
console.log(morpheme.begin());
console.log(morpheme.end());
console.log(morpheme.length());

// Check morpheme properties
console.log(morpheme.isOov()); // True if out-of-vocabulary

Public Dictionary Access

DictionaryFactory returns a public Dictionary that now exposes stable dictionary metadata APIs without requiring internal imports.

import { DictionaryFactory } from 'sudachi-ts';

const dictionary = await new DictionaryFactory().create('./sudachi.json');

const grammar = dictionary.getGrammar();
const lexicon = dictionary.getLexicon();

const kyotoId = lexicon.getWordId('京都', 3, 'キョウト');
const kyotoInfo = lexicon.getWordInfo(kyotoId);

console.log(grammar.getPartOfSpeechString(kyotoInfo.getPOSId()));
console.log(kyotoInfo.getSynonymGroupIds());

When user dictionaries are configured, dictionary.getLexicon() exposes the merged lexicon view used by tokenization, so downstream plugins can look up both system and user dictionary entries through the same public API.

Splitting Morphemes

Use the split method to change granularity:

const result = tokenizer.tokenize(SplitMode.A, '東京都に行きました');
const morpheme = result[0]; // "東京都"

// Split to different modes
const modeAList = morpheme.split(SplitMode.A);
const modeBList = morpheme.split(SplitMode.B);
const modeCList = morpheme.split(SplitMode.C);

Sentence Detection

Process multi-sentence text:

import { SentenceDetector } from 'sudachi-ts/sentdetect/sentenceDetector.js';

const sentences = tokenizer.tokenizeSentences('東京都は日本の首都です。大阪は商業都市です。');

for (const sentence of sentences) {
  console.log('--- Sentence ---');
  for (const morpheme of sentence) {
    console.log(morpheme.surface());
  }
}

tokenizeSentences(...) treats standalone quoted dialogue endings (for example 「...！」) as sentence boundaries, but keeps quoted speech attached to following reporting clauses such as 「...。」と言いました。. It also skips leading inter-sentence whitespace such as newlines before tokenization.

Lazy sentence processing for streaming:

async function* streamSentences(textStream: ReadableStream<string>) {
  for await (const sentences of tokenizer.lazyTokenizeSentences(textStream)) {
    for (const morphemes of sentences) {
      yield morphemes;
    }
  }
}

Part of Speech Matching

Filter morphemes by POS:

import { Dictionary } from 'sudachi-ts/core/dictionary.js';

const dict = await Dictionary.loadSystem();

// Create matcher for specific POS
const nounMatcher = dict.posMatcher(pos => pos[0] === '名詞');

// Find words matching POS pattern
const result = tokenizer.tokenize('東京都に行きました');
for (const morpheme of result) {
  if (nounMatcher.matches(morpheme.partOfSpeech())) {
    console.log('Noun:', morpheme.surface());
  }
}

// Create matcher from partial POS list
const properNounMatcher = dict.posMatcherFromList([
  ['名詞', '固有名詞', '*', '*']
]);

Plugin Development

Create custom plugins to extend functionality:

import { InputTextPlugin } from 'sudachi-ts/plugins/index.js';

export class MyCustomPlugin implements InputTextPlugin {
  setSettings(settings: Settings): void {
    // Configure plugin
  }

  rewrite(input: InputText): InputText {
    // Transform input text before tokenization
    return input;
  }
}

Load plugins dynamically:

import { PluginLoader } from 'sudachi-ts/plugins/loader.js';

const loader = new PluginLoader();
const plugin = await loader.loadInputTextPlugin(
  './myCustomPlugin.js',
  new Settings({ option: 'value' })
);

See PLUGINS.md for detailed plugin development guide.

The core tokenizer also rewrites sentence-ending ambiguities such as ね | こと | ね into ねこ | と | ね when the lattice supports that path.

Dictionary Building

Build binary dictionaries from CSV source:

import { systemBuilder } from 'sudachi-ts/dictionary-build';

const builder = systemBuilder();

// Add lexicon entries from CSV
await builder.matrix(matrixDefContents);
await builder.lexicon(lexiconCsvContents, 'lexicon.csv');

// Build binary dictionary
const { buffer } = await builder.build();

CSV format:

東京都,4,4,3816,京都,-1,-1,東京都,名詞,固有名詞,地名,一般,*,*,東京都,トウキョウト,東京

Debug and Inspection

Dump internal structures for debugging:

// Set output stream for lattice dumps
const output = new WritableStream({
  write(chunk) {
    console.log(chunk);
  }
});
tokenizer.setDumpOutput(output);

// Get lattice as JSON
const latticeJson = tokenizer.dumpInternalStructures('東京都');
console.log(latticeJson);

API Reference

See API.md for complete API documentation.

Configuration

See CONFIG.md for detailed configuration options.

Development

# Clone repository
git clone https://github.com/your-org/sudachi-ts.git
cd sudachi-ts

# Install dependencies
npm install

# Type check
npm run typecheck

# Run tests
npm test

# Lint
npm run check:fix

Architecture

sudachi-ts/
├── core/              # Tokenization engine
│   ├── tokenizer.ts   # Tokenizer interface and SplitMode
│   ├── dictionary.ts  # Dictionary and tokenizer factory
│   ├── morpheme.ts    # Morpheme interface and implementation
│   ├── lattice.ts     # Lattice graph implementation
│   └── inputText.ts   # Input text handling
├── dictionary/        # Dictionary system
│   ├── binaryDictionary.ts    # Binary dictionary loading
│   ├── grammar.ts             # Grammar and POS data
│   ├── lexicon.ts             # Lexicon interface
│   ├── doubleArrayLexicon.ts  # Double array trie lookup
│   └── characterCategory.ts   # Character categories
├── plugins/          # Plugin system
│   ├── base.ts       # Plugin base classes
│   ├── inputText/    # Input text plugins
│   ├── oov/          # OOV provider plugins
│   ├── pathRewrite/  # Path rewrite plugins
│   ├── connection/   # Connection edit plugins
│   └── loader.ts     # Dynamic plugin loader
├── dictionary-build/ # Dictionary builder
│   ├── csvLexicon.ts         # CSV parsing
│   ├── doubleArrayBuilder.ts # Double array construction
│   ├── connectionMatrix.ts   # Connection cost matrix
│   └── dicBuilder.ts         # Builder API
├── sentdetect/       # Sentence detection
│   └── sentenceDetector.ts
├── utils/           # Utilities
│   ├── wordId.ts    # Word ID encoding
│   ├── wordMask.ts  # OOV tracking
│   └── numericParser.ts # Japanese numeral parsing
└── config/          # Configuration
    ├── config.ts    # Config management
    ├── settings.ts  # Settings parsing
    └── pathAnchor.ts # Path resolution

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Sudachi-TS

Features

Requirements

Installation

Quick Start

Dictionary Files

Split Modes

Configuration

Working with Morphemes

Public Dictionary Access

Splitting Morphemes

Sentence Detection

Part of Speech Matching

Plugin Development

Dictionary Building

Debug and Inspection

API Reference

Configuration

Development

Architecture

License

Contributing

References