npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

kusamoji

v1.1.1

Published

Japanese morphological analyzer for Node.js — Viterbi tokenizer with mmap dict loading and pluggable POS-source strategy

Readme

Kusamoji 草文字

Segments Japanese text into morphemes and attaches part of speech, reading, and pronunciation metadata.

Features

  • Viterbi tokenization with IPADIC/NEologd dictionary support
  • Custom dictionary — bring your own IPADIC/NEologd .dat files
  • OS-level native dict loading — loads dictionary via memory-mapped I/O for near-instant boot (~1s vs ~4s) and OS-managed page cache
  • Automatic memory management — lets the OS handle page caching; no manual tuning needed
  • Viterbi length bonus — prevents short dictionary fragments from stealing prefixes of longer correct matches
  • Zero-copy TypedArray access to binary dictionary data

Install

pnpm install kusamoji
# or
npm add kusamoji

How the native mmap addon works

kusamoji ships pre-compiled mmap binaries for common platforms. The addon is optional — kusamoji works without it, just with slower boot and higher RAM.

You don't need to do anything. On first use, kusamoji automatically:

  1. Finds the matching prebuilt binary inside the package (src/native/prebuilds/)
  2. Copies it to ~/.kusamoji/ for persistence across reinstalls
  3. Loads it — mmap dict loading is now active

If no prebuilt matches your platform, kusamoji silently falls back to fs.readFile. Everything works — the mmap addon is a performance optimization, not a requirement.

Shipped prebuilts

| Platform | Architecture | Status | | -------- | --------------------- | -------------------------- | | macOS | Apple Silicon (arm64) | ✅ Shipped | | macOS | Intel (x64) | Compile from source | | Linux | x64 (Intel/AMD) | ✅ Shipped | | Linux | arm64 (Graviton, RPi) | ✅ Shipped | | Windows | any | Not supported (POSIX only) |

Troubleshooting the native addon

"I installed kusamoji but I'm not sure if mmap is active"

node -e "
  const path = require('path');
  const loader = require(path.join(require.resolve('kusamoji'), '..', 'native', 'loader.js'));
  const addon = loader.loadMmapAddon();
  console.log(addon ? 'mmap is ACTIVE' : 'mmap is NOT active (using fs.readFile fallback)');
"

"pnpm install didn't set up the addon"

This is normal. pnpm may skip postinstall scripts for security. The addon is loaded lazily on first use — no manual setup needed. If you want to pre-warm the cache:

pnpm rebuild kusamoji

"I want to compile the addon from source"

For platforms without a shipped prebuilt, or if you want to rebuild:

npx kusamoji rebuild-native

Requires: C compiler (gcc/clang), Python 3. The compiled binary is cached at ~/.kusamoji/ and persists across pnpm install cycles.

"I'm on an unsupported platform"

kusamoji falls back to fs.readFile automatically. Dictionary loading still works — boot is ~3-4s instead of ~1s, and RAM is higher (~2.5 GB vs ~1.4 GB for NEologd). No action needed.

Binary cache directory (~/.kusamoji/)

The native addon binary is cached at ~/.kusamoji/ along with a config.json metadata file. This cache:

  • Survives pnpm install / npm install cycles
  • Is validated against your Node.js N-API version on each load
  • Is automatically refreshed when you upgrade Node.js to a new major version
  • Can be safely deleted — it will be recreated on next use

Quick Start

const kusamoji = require('kusamoji')

const tokenizer = await kusamoji.builder({ dicPath: '/path/to/dict' }).buildAsync()

const tokens = tokenizer.tokenize('大谷翔平がロサンゼルス・ドジャースで3本塁打を放った')

for (const token of tokens) {
    console.log(token.surface_form, token.reading, token.pos)
}
// 大谷翔平      オオタニショウヘイ  名詞
// が            ガ                助詞
// ロサンゼルス   ロサンゼルス       名詞
// ・            ・                記号
// ドジャース     ドジャース         名詞
// で            デ                助詞
// 3             サン              名詞
// 本塁打        ホンルイダ         名詞
// を            ヲ                助詞
// 放っ          ハナッ             動詞
// た            タ                助動詞

More examples

Dates, counters, and proper nouns are resolved natively from the dictionary — no preprocessing needed:

tokenizer.tokenize('2026年4月9日、川崎市の製鉄所で作業員が転落する事故が発生した')
// 2026年      ニセンニジュウロクネン  名詞    ← full year reading
// 4月9日      シガツココノカ        名詞    ← month + day as one token
// 、          、                記号
// 川崎市      カワサキシ           名詞    ← place name
// の          ノ                助詞
// 製鉄所      セイテツジョ          名詞    ← rendaku: 所(ショ→ジョ)
// で          デ                助詞
// 作業員      サギョウイン          名詞
// が          ガ                助詞
// 転落        テンラク            名詞
// する        スル               動詞
// 事故        ジコ               名詞
// が          ガ                助詞
// 発生        ハッセイ            名詞
// し          シ                動詞
// た          タ                助動詞

tokenizer.tokenize('藤井聡太名人は第84期将棋名人戦で圧倒的な強さを見せた')
// 藤井聡太    フジイソウタ          名詞    ← NEologd proper noun
// 名人        メイジン            名詞
// は          ハ                助詞
// 第          ダイ               接頭詞
// 84期       ハチジュウヨンキ       名詞    ← digit+counter compound
// 将棋        ショウギ            名詞
// 名人戦      メイジンセン          名詞
// で          デ                助詞
// 圧倒的      アットウテキ          名詞
// な          ナ                助動詞
// 強          ツヨ               形容詞
// さ          サ                名詞
// を          ヲ                助詞
// 見せ        ミセ               動詞
// た          タ                助動詞

Benchmarks

All numbers measured on Apple M1 Pro, Node.js 22, NEologd dictionary (6.1M entries, 1.4 GB uncompressed). Methodology: 700 real-world Japanese news snippets × 9 conversion variants = 6,300 HTTP calls end-to-end through an Express service.

Cold start

| Mode | Boot time | Ready for first query | | ------------ | --------: | --------------------------------------------------- | | kusamoji | 1.0 s | Dictionary memory-mapped, OS demand-pages on access | | kuromoji.js | 8–12 s | gunzip + parse all 12 .dat.gz files |

Runtime memory (RSS)

| Mode | Idle RSS | Under load (700 concurrent) | Peak | | ------------ | ---------: | --------------------------: | -------: | | kusamoji | 1.4 GB | 2.2 GB | 3.1 GB | | kuromoji.js | 6–8 GB | 8+ GB | OOM risk |

With mmap, the ~1.4 GB dictionary sits in the OS page cache, not V8 heap. Under memory pressure the OS evicts cold pages automatically. V8's garbage collector never sees the dictionary data.

Tokenization throughput

| Input | Tokens/call | Latency (p50) | Throughput | | -------------------------- | ----------: | ------------: | ------------: | | Short sentence (10 chars) | ~5 | 0.3 ms | 3,300 calls/s | | News headline (50 chars) | ~20 | 0.8 ms | 1,250 calls/s | | News article (500 chars) | ~150 | 5 ms | 200 calls/s | | Long article (2,000 chars) | ~600 | 18 ms | 55 calls/s |

Accuracy (6,300-call harness)

700 real-world news snippets from Yahoo News Japan, NHK, and Mainichi — mixed content with ASCII brand names, URLs, numbers, brackets, and quoted English.

You can find the feeding news snippets here Kusamoji Test News Snippets

| Metric | Score | | ----------------------------------- | ---------------------------------------- | | Romaji conversion (5 systems × 700) | 99.0% kanji-free output | | Kana conversion (4 modes × 700) | 99.0% kanji-free output | | Jukujikun (熟字訓) accuracy | 48 / 49 tested compounds | | Proper noun accuracy (NEologd) | 10 / 10 (大谷翔平, 宮崎駿, etc.) | | Place name accuracy | 10 / 10 (東京, 鹿児島, 秋葉原, etc.) | | File descriptor leaks | 0 after 6,300 calls |

vs. alternatives

| Feature | kusamoji | kuromoji.js | MeCab (C++) | Sudachi (Java/Rust) | | -------------------- | ----------------------- | -------------- | --------------- | ------------------- | | Runtime | Node.js | Node.js | Native binary | JVM / Native | | Dict loading | mmap (zero-copy) | gunzip to heap | mmap | mmap (Rust) | | Boot time (NEologd) | ~1 s | ~10 s | ~0 s | ~0.2 s | | RSS (NEologd) | ~1.4 GB | ~6-8 GB | ~0.5 GB | ~0.2 GB | | Viterbi optimization | Length bonus | None | Cost estimation | CowArray | | POS source strategy | Pluggable (3 modes) | In-heap only | mmap | mmap | | NEologd support | ✅ | ✅ | ✅ | ✅ (built-in) | | Node.js native | ✅ | ✅ | FFI required | FFI required | | npm install | ✅ npm i kusamoji | ✅ | ❌ | ❌ | | Zero native deps | ✅ (optional mmap) | ✅ | N/A | N/A |

Note: MeCab and Sudachi achieve lower RSS because they're compiled languages with direct memory management. kusamoji's mmap addon brings Node.js RSS within 4× of native C++ — the closest any pure-npm Japanese tokenizer has gotten.

API

kusamoji.builder(options)

Returns a TokenizerBuilder.

| Option | Type | Required | Description | | --------- | -------- | -------- | ---------------------------------------------------- | | dicPath | string | Yes | Path to the directory containing the 12 .dat files |

builder.buildAsync()Promise<Tokenizer>

Loads the dictionary and returns a Tokenizer instance.

builder.build(callback)

Callback-style variant: callback(err, tokenizer).

tokenizer.tokenize(text)Token[]

Tokenizes input text. Returns an array of tokens:

{
    surface_form: "東京",       // as it appears in the text
    pos: "名詞",                // part of speech
    pos_detail_1: "固有名詞",   // POS subcategory 1
    pos_detail_2: "地域",       // POS subcategory 2
    pos_detail_3: "一般",       // POS subcategory 3
    conjugated_type: "*",       // conjugation type
    conjugated_form: "*",       // conjugated form
    basic_form: "東京",         // dictionary form
    reading: "トウキョウ",      // reading in katakana
    pronunciation: "トーキョー", // pronunciation in katakana
    word_type: "KNOWN",         // "KNOWN" or "UNKNOWN"
}

Returns [] for null, undefined, or empty string input.

Dictionary Files

kusamoji does NOT bundle a dictionary. You need 12 uncompressed .dat files compiled from IPADIC (or IPADIC-format compatible) CSV sources:

base.dat, check.dat, cc.dat, tid.dat, tid_map.dat, tid_pos.dat,
unk.dat, unk_char.dat, unk_compat.dat, unk_invoke.dat, unk_map.dat, unk_pos.dat

Building a dictionary

Use the included build script with IPADIC CSV sources:

node node_modules/kusamoji/dict-source/build.mjs \
    --source /path/to/csv-sources \
    --output /path/to/output

The source directory must contain:

  • ipadic/ — base IPADIC CSV files + matrix.def, char.def, unk.def
  • custom/ — (optional) your own override entries

License

BSL 1.1 — free for personal and non-commercial use. Commercial use requires a license. Change date: 4 years from release.