npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

sudachi-ts

v0.1.22

Published

TypeScript port of Sudachi morphological analyzer for Japanese text

Readme

Sudachi-TS

TypeScript port of Sudachi Japanese morphological analyzer.

Warning: Dictionary files are required for Sudachi-TS to function. Please download them from the Sudachi releases page before using this library.

Features

  • Full Tokenization Support: A/B/C split modes for different granularities
  • Binary Dictionary Compatibility: Load and use pre-built Sudachi dictionaries
  • Dynamic Plugin System: Extensible architecture with runtime plugin loading
  • Dictionary Building: Complete CSV to binary dictionary conversion
  • Sentence Detection: Multi-sentence text processing
  • UTF-8 Handling: Proper Japanese text normalization and character encoding
  • POS Matching: Flexible part-of-speech filtering and matching
  • Counter Alias Recovery: Resolves numeric kana counters such as 1こ to the canonical counter lattice before best-path selection

Requirements

  • Node.js: >= 18.0.0
  • TypeScript: >= 5.0.0 (peer dependency)

Installation

npm install sudachi-ts

Or using yarn:

yarn add sudachi-ts

Quick Start

import { Dictionary, SplitMode } from 'sudachi-ts';
import { BinaryDictionary } from 'sudachi-ts/dictionary/binaryDictionary.js';

// Load dictionary
const dict = await BinaryDictionary.loadSystem('./path/to/system.dic');

// Create tokenizer
const tokenizer = dict.create();

// Tokenize text
const result = tokenizer.tokenize('東京都に行きました。');

// Access morphemes
for (const morpheme of result) {
  console.log(morpheme.surface()); // Surface form
  console.log(morpheme.readingForm()); // Reading
  console.log(morpheme.partOfSpeech()); // POS tags
  console.log(morpheme.normalizedForm()); // Normalized form
}

// Use different split modes
const modeAResult = tokenizer.tokenize(SplitMode.A, '京都に行きました');
const modeBResult = tokenizer.tokenize(SplitMode.B, '京都に行きました');
const modeCResult = tokenizer.tokenize(SplitMode.C, '京都に行きました');

// Clean up
await dict.close();

Dictionary Files

Important: This package does not include dictionary files. You need to provide your own:

  • System Dictionary: Download from Sudachi releases
  • User Dictionary: Build your own using the CLI tools or provide existing .dic files

Example dictionary paths:

  • system_core.dic - Core system dictionary
  • system_full.dic - Full system dictionary
  • user.dic - User dictionary (optional)

Split Modes

Sudachi provides three tokenization modes:

  • Mode A: Shortest possible segmentation (most granular)
  • Mode B: Medium segmentation (balanced)
  • Mode C: Longest possible segmentation (least granular)

Example with "京都に行きました":

Mode A: 京都|に|行き|まし|た
Mode B: 京都|に|行きました
Mode C: 京都|に行きました

Configuration

Load configuration from a JSON file:

import { loadConfig } from 'sudachi-ts/config/config.js';
import { Dictionary } from 'sudachi-ts/core/dictionary.js';

const config = await loadConfig('./sudachi.json');
const dict = Dictionary.create();

Example sudachi.json:

{
  "systemDict": "system_core.dic",
  "userDicts": ["user.dic"],
  "characterDefinitionFile": "char.def",
  "plugins": [
    {
      "className": "sudachi-ts/plugins/inputText/defaultInputTextPlugin.js",
      "settings": {
        "normalize": true
      }
    }
  ]
}

For non-absolute file references in config (dictionary files, plugin module paths, and built-in plugin file settings), Sudachi-TS tries paths relative to the config file first, then relative to the current working directory.

By default, Sudachi-TS enables a built-in compound-particle lexicon ("enableDefaultCompoundParticles": true) so forms such as かも, のか, and だから are tokenized as single morphemes. Set it to false to disable:

{
  "enableDefaultCompoundParticles": false
}

The default OOV plugin stack also injects counter aliases in numeric contexts, so kana counters such as りんごを1こください。 are analyzed as りんご / を / 1 / こ / ください / 。 with the counter normalized to instead of falling through to unrelated dictionary entries.

Working with Morphemes

Access detailed morpheme information:

const morpheme = result[0];

// Surface form
console.log(morpheme.surface());

// Word forms
console.log(morpheme.dictionaryForm()); // Dictionary form
console.log(morpheme.normalizedForm()); // Normalized form
console.log(morpheme.readingForm()); // Reading form

// Part of speech
console.log(morpheme.partOfSpeech()); // e.g., ["名詞", "固有名詞", "地名", "一般"]

// Word ID and dictionary
console.log(morpheme.wordId());
console.log(morpheme.dictionaryId());

// Morpheme bounds
console.log(morpheme.begin());
console.log(morpheme.end());
console.log(morpheme.length());

// Check morpheme properties
console.log(morpheme.isOov()); // True if out-of-vocabulary

Public Dictionary Access

DictionaryFactory returns a public Dictionary that now exposes stable dictionary metadata APIs without requiring internal imports.

import { DictionaryFactory } from 'sudachi-ts';

const dictionary = await new DictionaryFactory().create('./sudachi.json');

const grammar = dictionary.getGrammar();
const lexicon = dictionary.getLexicon();

const kyotoId = lexicon.getWordId('京都', 3, 'キョウト');
const kyotoInfo = lexicon.getWordInfo(kyotoId);

console.log(grammar.getPartOfSpeechString(kyotoInfo.getPOSId()));
console.log(kyotoInfo.getSynonymGroupIds());

When user dictionaries are configured, dictionary.getLexicon() exposes the merged lexicon view used by tokenization, so downstream plugins can look up both system and user dictionary entries through the same public API.

Splitting Morphemes

Use the split method to change granularity:

const result = tokenizer.tokenize(SplitMode.A, '東京都に行きました');
const morpheme = result[0]; // "東京都"

// Split to different modes
const modeAList = morpheme.split(SplitMode.A);
const modeBList = morpheme.split(SplitMode.B);
const modeCList = morpheme.split(SplitMode.C);

Sentence Detection

Process multi-sentence text:

import { SentenceDetector } from 'sudachi-ts/sentdetect/sentenceDetector.js';

const sentences = tokenizer.tokenizeSentences('東京都は日本の首都です。大阪は商業都市です。');

for (const sentence of sentences) {
  console.log('--- Sentence ---');
  for (const morpheme of sentence) {
    console.log(morpheme.surface());
  }
}

tokenizeSentences(...) treats standalone quoted dialogue endings (for example 「...!」) as sentence boundaries, but keeps quoted speech attached to following reporting clauses such as 「...。」と言いました。. It also skips leading inter-sentence whitespace such as newlines before tokenization.

Lazy sentence processing for streaming:

async function* streamSentences(textStream: ReadableStream<string>) {
  for await (const sentences of tokenizer.lazyTokenizeSentences(textStream)) {
    for (const morphemes of sentences) {
      yield morphemes;
    }
  }
}

Part of Speech Matching

Filter morphemes by POS:

import { Dictionary } from 'sudachi-ts/core/dictionary.js';

const dict = await Dictionary.loadSystem();

// Create matcher for specific POS
const nounMatcher = dict.posMatcher(pos => pos[0] === '名詞');

// Find words matching POS pattern
const result = tokenizer.tokenize('東京都に行きました');
for (const morpheme of result) {
  if (nounMatcher.matches(morpheme.partOfSpeech())) {
    console.log('Noun:', morpheme.surface());
  }
}

// Create matcher from partial POS list
const properNounMatcher = dict.posMatcherFromList([
  ['名詞', '固有名詞', '*', '*']
]);

Plugin Development

Create custom plugins to extend functionality:

import { InputTextPlugin } from 'sudachi-ts/plugins/index.js';

export class MyCustomPlugin implements InputTextPlugin {
  setSettings(settings: Settings): void {
    // Configure plugin
  }

  rewrite(input: InputText): InputText {
    // Transform input text before tokenization
    return input;
  }
}

Load plugins dynamically:

import { PluginLoader } from 'sudachi-ts/plugins/loader.js';

const loader = new PluginLoader();
const plugin = await loader.loadInputTextPlugin(
  './myCustomPlugin.js',
  new Settings({ option: 'value' })
);

See PLUGINS.md for detailed plugin development guide.

The core tokenizer also rewrites sentence-ending ambiguities such as ね | こと | ね into ねこ | と | ね when the lattice supports that path.

Dictionary Building

Build binary dictionaries from CSV source:

import { systemBuilder } from 'sudachi-ts/dictionary-build';

const builder = systemBuilder();

// Add lexicon entries from CSV
await builder.matrix(matrixDefContents);
await builder.lexicon(lexiconCsvContents, 'lexicon.csv');

// Build binary dictionary
const { buffer } = await builder.build();

CSV format:

東京都,4,4,3816,京都,-1,-1,東京都,名詞,固有名詞,地名,一般,*,*,東京都,トウキョウト,東京

Debug and Inspection

Dump internal structures for debugging:

// Set output stream for lattice dumps
const output = new WritableStream({
  write(chunk) {
    console.log(chunk);
  }
});
tokenizer.setDumpOutput(output);

// Get lattice as JSON
const latticeJson = tokenizer.dumpInternalStructures('東京都');
console.log(latticeJson);

API Reference

See API.md for complete API documentation.

Configuration

See CONFIG.md for detailed configuration options.

Development

# Clone repository
git clone https://github.com/your-org/sudachi-ts.git
cd sudachi-ts

# Install dependencies
npm install

# Type check
npm run typecheck

# Run tests
npm test

# Lint
npm run check:fix

Architecture

sudachi-ts/
├── core/              # Tokenization engine
│   ├── tokenizer.ts   # Tokenizer interface and SplitMode
│   ├── dictionary.ts  # Dictionary and tokenizer factory
│   ├── morpheme.ts    # Morpheme interface and implementation
│   ├── lattice.ts     # Lattice graph implementation
│   └── inputText.ts   # Input text handling
├── dictionary/        # Dictionary system
│   ├── binaryDictionary.ts    # Binary dictionary loading
│   ├── grammar.ts             # Grammar and POS data
│   ├── lexicon.ts             # Lexicon interface
│   ├── doubleArrayLexicon.ts  # Double array trie lookup
│   └── characterCategory.ts   # Character categories
├── plugins/          # Plugin system
│   ├── base.ts       # Plugin base classes
│   ├── inputText/    # Input text plugins
│   ├── oov/          # OOV provider plugins
│   ├── pathRewrite/  # Path rewrite plugins
│   ├── connection/   # Connection edit plugins
│   └── loader.ts     # Dynamic plugin loader
├── dictionary-build/ # Dictionary builder
│   ├── csvLexicon.ts         # CSV parsing
│   ├── doubleArrayBuilder.ts # Double array construction
│   ├── connectionMatrix.ts   # Connection cost matrix
│   └── dicBuilder.ts         # Builder API
├── sentdetect/       # Sentence detection
│   └── sentenceDetector.ts
├── utils/           # Utilities
│   ├── wordId.ts    # Word ID encoding
│   ├── wordMask.ts  # OOV tracking
│   └── numericParser.ts # Japanese numeral parsing
└── config/          # Configuration
    ├── config.ts    # Config management
    ├── settings.ts  # Settings parsing
    └── pathAnchor.ts # Path resolution

License

Apache License 2.0

Contributing

Contributions are welcome! Please see CONTRIBUTING.md for guidelines.

References