sudachi-ts
v0.1.22
Published
TypeScript port of Sudachi morphological analyzer for Japanese text
Maintainers
Readme
Sudachi-TS
TypeScript port of Sudachi Japanese morphological analyzer.
Warning: Dictionary files are required for Sudachi-TS to function. Please download them from the Sudachi releases page before using this library.
Features
- Full Tokenization Support: A/B/C split modes for different granularities
- Binary Dictionary Compatibility: Load and use pre-built Sudachi dictionaries
- Dynamic Plugin System: Extensible architecture with runtime plugin loading
- Dictionary Building: Complete CSV to binary dictionary conversion
- Sentence Detection: Multi-sentence text processing
- UTF-8 Handling: Proper Japanese text normalization and character encoding
- POS Matching: Flexible part-of-speech filtering and matching
- Counter Alias Recovery: Resolves numeric kana counters such as
1こto the canonical counter lattice before best-path selection
Requirements
- Node.js: >= 18.0.0
- TypeScript: >= 5.0.0 (peer dependency)
Installation
npm install sudachi-tsOr using yarn:
yarn add sudachi-tsQuick Start
import { Dictionary, SplitMode } from 'sudachi-ts';
import { BinaryDictionary } from 'sudachi-ts/dictionary/binaryDictionary.js';
// Load dictionary
const dict = await BinaryDictionary.loadSystem('./path/to/system.dic');
// Create tokenizer
const tokenizer = dict.create();
// Tokenize text
const result = tokenizer.tokenize('東京都に行きました。');
// Access morphemes
for (const morpheme of result) {
console.log(morpheme.surface()); // Surface form
console.log(morpheme.readingForm()); // Reading
console.log(morpheme.partOfSpeech()); // POS tags
console.log(morpheme.normalizedForm()); // Normalized form
}
// Use different split modes
const modeAResult = tokenizer.tokenize(SplitMode.A, '京都に行きました');
const modeBResult = tokenizer.tokenize(SplitMode.B, '京都に行きました');
const modeCResult = tokenizer.tokenize(SplitMode.C, '京都に行きました');
// Clean up
await dict.close();Dictionary Files
Important: This package does not include dictionary files. You need to provide your own:
- System Dictionary: Download from Sudachi releases
- User Dictionary: Build your own using the CLI tools or provide existing
.dicfiles
Example dictionary paths:
system_core.dic- Core system dictionarysystem_full.dic- Full system dictionaryuser.dic- User dictionary (optional)
Split Modes
Sudachi provides three tokenization modes:
- Mode A: Shortest possible segmentation (most granular)
- Mode B: Medium segmentation (balanced)
- Mode C: Longest possible segmentation (least granular)
Example with "京都に行きました":
Mode A: 京都|に|行き|まし|た
Mode B: 京都|に|行きました
Mode C: 京都|に行きましたConfiguration
Load configuration from a JSON file:
import { loadConfig } from 'sudachi-ts/config/config.js';
import { Dictionary } from 'sudachi-ts/core/dictionary.js';
const config = await loadConfig('./sudachi.json');
const dict = Dictionary.create();Example sudachi.json:
{
"systemDict": "system_core.dic",
"userDicts": ["user.dic"],
"characterDefinitionFile": "char.def",
"plugins": [
{
"className": "sudachi-ts/plugins/inputText/defaultInputTextPlugin.js",
"settings": {
"normalize": true
}
}
]
}For non-absolute file references in config (dictionary files, plugin module paths, and built-in plugin file settings), Sudachi-TS tries paths relative to the config file first, then relative to the current working directory.
By default, Sudachi-TS enables a built-in compound-particle lexicon
("enableDefaultCompoundParticles": true) so forms such as かも, のか,
and だから are tokenized as single morphemes. Set it to false to disable:
{
"enableDefaultCompoundParticles": false
}The default OOV plugin stack also injects counter aliases in numeric contexts,
so kana counters such as りんごを1こください。 are analyzed as
りんご / を / 1 / こ / ください / 。 with the counter normalized to 個
instead of falling through to unrelated dictionary entries.
Working with Morphemes
Access detailed morpheme information:
const morpheme = result[0];
// Surface form
console.log(morpheme.surface());
// Word forms
console.log(morpheme.dictionaryForm()); // Dictionary form
console.log(morpheme.normalizedForm()); // Normalized form
console.log(morpheme.readingForm()); // Reading form
// Part of speech
console.log(morpheme.partOfSpeech()); // e.g., ["名詞", "固有名詞", "地名", "一般"]
// Word ID and dictionary
console.log(morpheme.wordId());
console.log(morpheme.dictionaryId());
// Morpheme bounds
console.log(morpheme.begin());
console.log(morpheme.end());
console.log(morpheme.length());
// Check morpheme properties
console.log(morpheme.isOov()); // True if out-of-vocabularyPublic Dictionary Access
DictionaryFactory returns a public Dictionary that now exposes stable
dictionary metadata APIs without requiring internal imports.
import { DictionaryFactory } from 'sudachi-ts';
const dictionary = await new DictionaryFactory().create('./sudachi.json');
const grammar = dictionary.getGrammar();
const lexicon = dictionary.getLexicon();
const kyotoId = lexicon.getWordId('京都', 3, 'キョウト');
const kyotoInfo = lexicon.getWordInfo(kyotoId);
console.log(grammar.getPartOfSpeechString(kyotoInfo.getPOSId()));
console.log(kyotoInfo.getSynonymGroupIds());When user dictionaries are configured, dictionary.getLexicon() exposes the
merged lexicon view used by tokenization, so downstream plugins can look up both
system and user dictionary entries through the same public API.
Splitting Morphemes
Use the split method to change granularity:
const result = tokenizer.tokenize(SplitMode.A, '東京都に行きました');
const morpheme = result[0]; // "東京都"
// Split to different modes
const modeAList = morpheme.split(SplitMode.A);
const modeBList = morpheme.split(SplitMode.B);
const modeCList = morpheme.split(SplitMode.C);Sentence Detection
Process multi-sentence text:
import { SentenceDetector } from 'sudachi-ts/sentdetect/sentenceDetector.js';
const sentences = tokenizer.tokenizeSentences('東京都は日本の首都です。大阪は商業都市です。');
for (const sentence of sentences) {
console.log('--- Sentence ---');
for (const morpheme of sentence) {
console.log(morpheme.surface());
}
}tokenizeSentences(...) treats standalone quoted dialogue endings (for example
「...!」) as sentence boundaries, but keeps quoted speech attached to following
reporting clauses such as 「...。」と言いました。. It also skips leading
inter-sentence whitespace such as newlines before tokenization.
Lazy sentence processing for streaming:
async function* streamSentences(textStream: ReadableStream<string>) {
for await (const sentences of tokenizer.lazyTokenizeSentences(textStream)) {
for (const morphemes of sentences) {
yield morphemes;
}
}
}Part of Speech Matching
Filter morphemes by POS:
import { Dictionary } from 'sudachi-ts/core/dictionary.js';
const dict = await Dictionary.loadSystem();
// Create matcher for specific POS
const nounMatcher = dict.posMatcher(pos => pos[0] === '名詞');
// Find words matching POS pattern
const result = tokenizer.tokenize('東京都に行きました');
for (const morpheme of result) {
if (nounMatcher.matches(morpheme.partOfSpeech())) {
console.log('Noun:', morpheme.surface());
}
}
// Create matcher from partial POS list
const properNounMatcher = dict.posMatcherFromList([
['名詞', '固有名詞', '*', '*']
]);Plugin Development
Create custom plugins to extend functionality:
import { InputTextPlugin } from 'sudachi-ts/plugins/index.js';
export class MyCustomPlugin implements InputTextPlugin {
setSettings(settings: Settings): void {
// Configure plugin
}
rewrite(input: InputText): InputText {
// Transform input text before tokenization
return input;
}
}Load plugins dynamically:
import { PluginLoader } from 'sudachi-ts/plugins/loader.js';
const loader = new PluginLoader();
const plugin = await loader.loadInputTextPlugin(
'./myCustomPlugin.js',
new Settings({ option: 'value' })
);See PLUGINS.md for detailed plugin development guide.
The core tokenizer also rewrites sentence-ending ambiguities such as
ね | こと | ね into ねこ | と | ね when the lattice supports that path.
Dictionary Building
Build binary dictionaries from CSV source:
import { systemBuilder } from 'sudachi-ts/dictionary-build';
const builder = systemBuilder();
// Add lexicon entries from CSV
await builder.matrix(matrixDefContents);
await builder.lexicon(lexiconCsvContents, 'lexicon.csv');
// Build binary dictionary
const { buffer } = await builder.build();CSV format:
東京都,4,4,3816,京都,-1,-1,東京都,名詞,固有名詞,地名,一般,*,*,東京都,トウキョウト,東京Debug and Inspection
Dump internal structures for debugging:
// Set output stream for lattice dumps
const output = new WritableStream({
write(chunk) {
console.log(chunk);
}
});
tokenizer.setDumpOutput(output);
// Get lattice as JSON
const latticeJson = tokenizer.dumpInternalStructures('東京都');
console.log(latticeJson);API Reference
See API.md for complete API documentation.
Configuration
See CONFIG.md for detailed configuration options.
Development
# Clone repository
git clone https://github.com/your-org/sudachi-ts.git
cd sudachi-ts
# Install dependencies
npm install
# Type check
npm run typecheck
# Run tests
npm test
# Lint
npm run check:fixArchitecture
sudachi-ts/
├── core/ # Tokenization engine
│ ├── tokenizer.ts # Tokenizer interface and SplitMode
│ ├── dictionary.ts # Dictionary and tokenizer factory
│ ├── morpheme.ts # Morpheme interface and implementation
│ ├── lattice.ts # Lattice graph implementation
│ └── inputText.ts # Input text handling
├── dictionary/ # Dictionary system
│ ├── binaryDictionary.ts # Binary dictionary loading
│ ├── grammar.ts # Grammar and POS data
│ ├── lexicon.ts # Lexicon interface
│ ├── doubleArrayLexicon.ts # Double array trie lookup
│ └── characterCategory.ts # Character categories
├── plugins/ # Plugin system
│ ├── base.ts # Plugin base classes
│ ├── inputText/ # Input text plugins
│ ├── oov/ # OOV provider plugins
│ ├── pathRewrite/ # Path rewrite plugins
│ ├── connection/ # Connection edit plugins
│ └── loader.ts # Dynamic plugin loader
├── dictionary-build/ # Dictionary builder
│ ├── csvLexicon.ts # CSV parsing
│ ├── doubleArrayBuilder.ts # Double array construction
│ ├── connectionMatrix.ts # Connection cost matrix
│ └── dicBuilder.ts # Builder API
├── sentdetect/ # Sentence detection
│ └── sentenceDetector.ts
├── utils/ # Utilities
│ ├── wordId.ts # Word ID encoding
│ ├── wordMask.ts # OOV tracking
│ └── numericParser.ts # Japanese numeral parsing
└── config/ # Configuration
├── config.ts # Config management
├── settings.ts # Settings parsing
└── pathAnchor.ts # Path resolutionLicense
Apache License 2.0
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
