kusamoji
v1.1.1
Published
Japanese morphological analyzer for Node.js — Viterbi tokenizer with mmap dict loading and pluggable POS-source strategy
Maintainers
Readme
Kusamoji 草文字
Segments Japanese text into morphemes and attaches part of speech, reading, and pronunciation metadata.
Features
- Viterbi tokenization with IPADIC/NEologd dictionary support
- Custom dictionary — bring your own IPADIC/NEologd
.datfiles - OS-level native dict loading — loads dictionary via memory-mapped I/O for near-instant boot (~1s vs ~4s) and OS-managed page cache
- Automatic memory management — lets the OS handle page caching; no manual tuning needed
- Viterbi length bonus — prevents short dictionary fragments from stealing prefixes of longer correct matches
- Zero-copy TypedArray access to binary dictionary data
Install
pnpm install kusamoji
# or
npm add kusamojiHow the native mmap addon works
kusamoji ships pre-compiled mmap binaries for common platforms. The addon is optional — kusamoji works without it, just with slower boot and higher RAM.
You don't need to do anything. On first use, kusamoji automatically:
- Finds the matching prebuilt binary inside the package (
src/native/prebuilds/) - Copies it to
~/.kusamoji/for persistence across reinstalls - Loads it — mmap dict loading is now active
If no prebuilt matches your platform, kusamoji silently falls back to fs.readFile. Everything works — the mmap addon is a performance optimization, not a requirement.
Shipped prebuilts
| Platform | Architecture | Status | | -------- | --------------------- | -------------------------- | | macOS | Apple Silicon (arm64) | ✅ Shipped | | macOS | Intel (x64) | Compile from source | | Linux | x64 (Intel/AMD) | ✅ Shipped | | Linux | arm64 (Graviton, RPi) | ✅ Shipped | | Windows | any | Not supported (POSIX only) |
Troubleshooting the native addon
"I installed kusamoji but I'm not sure if mmap is active"
node -e "
const path = require('path');
const loader = require(path.join(require.resolve('kusamoji'), '..', 'native', 'loader.js'));
const addon = loader.loadMmapAddon();
console.log(addon ? 'mmap is ACTIVE' : 'mmap is NOT active (using fs.readFile fallback)');
""pnpm install didn't set up the addon"
This is normal. pnpm may skip postinstall scripts for security. The addon is loaded lazily on first use — no manual setup needed. If you want to pre-warm the cache:
pnpm rebuild kusamoji"I want to compile the addon from source"
For platforms without a shipped prebuilt, or if you want to rebuild:
npx kusamoji rebuild-nativeRequires: C compiler (gcc/clang), Python 3. The compiled binary is cached at ~/.kusamoji/ and persists across pnpm install cycles.
"I'm on an unsupported platform"
kusamoji falls back to fs.readFile automatically. Dictionary loading still works — boot is ~3-4s instead of ~1s, and RAM is higher (~2.5 GB vs ~1.4 GB for NEologd). No action needed.
Binary cache directory (~/.kusamoji/)
The native addon binary is cached at ~/.kusamoji/ along with a config.json metadata file. This cache:
- Survives
pnpm install/npm installcycles - Is validated against your Node.js N-API version on each load
- Is automatically refreshed when you upgrade Node.js to a new major version
- Can be safely deleted — it will be recreated on next use
Quick Start
const kusamoji = require('kusamoji')
const tokenizer = await kusamoji.builder({ dicPath: '/path/to/dict' }).buildAsync()
const tokens = tokenizer.tokenize('大谷翔平がロサンゼルス・ドジャースで3本塁打を放った')
for (const token of tokens) {
console.log(token.surface_form, token.reading, token.pos)
}
// 大谷翔平 オオタニショウヘイ 名詞
// が ガ 助詞
// ロサンゼルス ロサンゼルス 名詞
// ・ ・ 記号
// ドジャース ドジャース 名詞
// で デ 助詞
// 3 サン 名詞
// 本塁打 ホンルイダ 名詞
// を ヲ 助詞
// 放っ ハナッ 動詞
// た タ 助動詞More examples
Dates, counters, and proper nouns are resolved natively from the dictionary — no preprocessing needed:
tokenizer.tokenize('2026年4月9日、川崎市の製鉄所で作業員が転落する事故が発生した')
// 2026年 ニセンニジュウロクネン 名詞 ← full year reading
// 4月9日 シガツココノカ 名詞 ← month + day as one token
// 、 、 記号
// 川崎市 カワサキシ 名詞 ← place name
// の ノ 助詞
// 製鉄所 セイテツジョ 名詞 ← rendaku: 所(ショ→ジョ)
// で デ 助詞
// 作業員 サギョウイン 名詞
// が ガ 助詞
// 転落 テンラク 名詞
// する スル 動詞
// 事故 ジコ 名詞
// が ガ 助詞
// 発生 ハッセイ 名詞
// し シ 動詞
// た タ 助動詞
tokenizer.tokenize('藤井聡太名人は第84期将棋名人戦で圧倒的な強さを見せた')
// 藤井聡太 フジイソウタ 名詞 ← NEologd proper noun
// 名人 メイジン 名詞
// は ハ 助詞
// 第 ダイ 接頭詞
// 84期 ハチジュウヨンキ 名詞 ← digit+counter compound
// 将棋 ショウギ 名詞
// 名人戦 メイジンセン 名詞
// で デ 助詞
// 圧倒的 アットウテキ 名詞
// な ナ 助動詞
// 強 ツヨ 形容詞
// さ サ 名詞
// を ヲ 助詞
// 見せ ミセ 動詞
// た タ 助動詞Benchmarks
All numbers measured on Apple M1 Pro, Node.js 22, NEologd dictionary (6.1M entries, 1.4 GB uncompressed). Methodology: 700 real-world Japanese news snippets × 9 conversion variants = 6,300 HTTP calls end-to-end through an Express service.
Cold start
| Mode | Boot time | Ready for first query | | ------------ | --------: | --------------------------------------------------- | | kusamoji | 1.0 s | Dictionary memory-mapped, OS demand-pages on access | | kuromoji.js | 8–12 s | gunzip + parse all 12 .dat.gz files |
Runtime memory (RSS)
| Mode | Idle RSS | Under load (700 concurrent) | Peak | | ------------ | ---------: | --------------------------: | -------: | | kusamoji | 1.4 GB | 2.2 GB | 3.1 GB | | kuromoji.js | 6–8 GB | 8+ GB | OOM risk |
With mmap, the ~1.4 GB dictionary sits in the OS page cache, not V8 heap. Under memory pressure the OS evicts cold pages automatically. V8's garbage collector never sees the dictionary data.
Tokenization throughput
| Input | Tokens/call | Latency (p50) | Throughput | | -------------------------- | ----------: | ------------: | ------------: | | Short sentence (10 chars) | ~5 | 0.3 ms | 3,300 calls/s | | News headline (50 chars) | ~20 | 0.8 ms | 1,250 calls/s | | News article (500 chars) | ~150 | 5 ms | 200 calls/s | | Long article (2,000 chars) | ~600 | 18 ms | 55 calls/s |
Accuracy (6,300-call harness)
700 real-world news snippets from Yahoo News Japan, NHK, and Mainichi — mixed content with ASCII brand names, URLs, numbers, brackets, and quoted English.
You can find the feeding news snippets here Kusamoji Test News Snippets
| Metric | Score | | ----------------------------------- | ---------------------------------------- | | Romaji conversion (5 systems × 700) | 99.0% kanji-free output | | Kana conversion (4 modes × 700) | 99.0% kanji-free output | | Jukujikun (熟字訓) accuracy | 48 / 49 tested compounds | | Proper noun accuracy (NEologd) | 10 / 10 (大谷翔平, 宮崎駿, etc.) | | Place name accuracy | 10 / 10 (東京, 鹿児島, 秋葉原, etc.) | | File descriptor leaks | 0 after 6,300 calls |
vs. alternatives
| Feature | kusamoji | kuromoji.js | MeCab (C++) | Sudachi (Java/Rust) |
| -------------------- | ----------------------- | -------------- | --------------- | ------------------- |
| Runtime | Node.js | Node.js | Native binary | JVM / Native |
| Dict loading | mmap (zero-copy) | gunzip to heap | mmap | mmap (Rust) |
| Boot time (NEologd) | ~1 s | ~10 s | ~0 s | ~0.2 s |
| RSS (NEologd) | ~1.4 GB | ~6-8 GB | ~0.5 GB | ~0.2 GB |
| Viterbi optimization | Length bonus | None | Cost estimation | CowArray |
| POS source strategy | Pluggable (3 modes) | In-heap only | mmap | mmap |
| NEologd support | ✅ | ✅ | ✅ | ✅ (built-in) |
| Node.js native | ✅ | ✅ | FFI required | FFI required |
| npm install | ✅ npm i kusamoji | ✅ | ❌ | ❌ |
| Zero native deps | ✅ (optional mmap) | ✅ | N/A | N/A |
Note: MeCab and Sudachi achieve lower RSS because they're compiled languages with direct memory management. kusamoji's mmap addon brings Node.js RSS within 4× of native C++ — the closest any pure-npm Japanese tokenizer has gotten.
API
kusamoji.builder(options)
Returns a TokenizerBuilder.
| Option | Type | Required | Description |
| --------- | -------- | -------- | ---------------------------------------------------- |
| dicPath | string | Yes | Path to the directory containing the 12 .dat files |
builder.buildAsync() → Promise<Tokenizer>
Loads the dictionary and returns a Tokenizer instance.
builder.build(callback)
Callback-style variant: callback(err, tokenizer).
tokenizer.tokenize(text) → Token[]
Tokenizes input text. Returns an array of tokens:
{
surface_form: "東京", // as it appears in the text
pos: "名詞", // part of speech
pos_detail_1: "固有名詞", // POS subcategory 1
pos_detail_2: "地域", // POS subcategory 2
pos_detail_3: "一般", // POS subcategory 3
conjugated_type: "*", // conjugation type
conjugated_form: "*", // conjugated form
basic_form: "東京", // dictionary form
reading: "トウキョウ", // reading in katakana
pronunciation: "トーキョー", // pronunciation in katakana
word_type: "KNOWN", // "KNOWN" or "UNKNOWN"
}Returns [] for null, undefined, or empty string input.
Dictionary Files
kusamoji does NOT bundle a dictionary. You need 12 uncompressed .dat files compiled from IPADIC (or IPADIC-format compatible) CSV sources:
base.dat, check.dat, cc.dat, tid.dat, tid_map.dat, tid_pos.dat,
unk.dat, unk_char.dat, unk_compat.dat, unk_invoke.dat, unk_map.dat, unk_pos.datBuilding a dictionary
Use the included build script with IPADIC CSV sources:
node node_modules/kusamoji/dict-source/build.mjs \
--source /path/to/csv-sources \
--output /path/to/outputThe source directory must contain:
ipadic/— base IPADIC CSV files +matrix.def,char.def,unk.defcustom/— (optional) your own override entries
License
BSL 1.1 — free for personal and non-commercial use. Commercial use requires a license. Change date: 4 years from release.
