kuromoji-ko

v1.0.8

Published

2 months ago

Pure TypeScript Korean Morphological Analyzer - serverless compatible, based on kuromoji.js and mecab-ko-dic

0High
0Medium
0Low

elfsmelf

korean morphological-analysis tokenizer nlp mecab mecab-ko hangul kuromoji serverless nextjs vercel

kuromoji-ko

Pure JavaScript Korean Morphological Analyzer

A port of kuromoji.js adapted for Korean language processing using mecab-ko-dic.

Features

🚀 Pure JavaScript - runs in Node.js, browsers, and serverless (Vercel, Cloudflare Workers)
📦 No native dependencies - no compilation required
🇰🇷 Korean-optimized - uses mecab-ko-dic with Sejong tagset
⚡ Viterbi algorithm - accurate morphological analysis
🔧 Simple API - tokenize Korean text in a few lines

Installation

npm install kuromoji-ko

Quick Start

napi-mecab Compatible API (Recommended)

import { MeCab } from 'kuromoji-ko';

const mecab = await MeCab.create({ engine: 'ko', dictPath: './dict' });
const tokens = mecab.parse('안녕하세요');

for (const token of tokens) {
  console.log(token.surface, token.pos, token.lemma);
}
// 안녕 ['NNG'] 안녕
// 하 ['XSV'] 하다
// 세요 ['EF'] 세요

Classic API

import kuromoji from 'kuromoji-ko';

const tokenizer = await kuromoji.builder({
  dicPath: './dict'
}).build();

const tokens = tokenizer.tokenize('안녕하세요');

for (const token of tokens) {
  console.log(token.surface_form, token.pos, token.posDescription);
}
// 안녕 NNG 일반 명사
// 하 XSV 동사 파생 접미사
// 세요 EF 종결 어미

Building the Dictionary

Before using kuromoji-ko, you need to build the dictionary files from mecab-ko-dic:

# Download mecab-ko-dic
git clone https://bitbucket.org/eunjeon/mecab-ko-dic.git

# Build dictionary
npm run build:dict -- ./mecab-ko-dic ./dict

This creates binary dictionary files in the ./dict directory.

API

MeCab API (napi-mecab compatible)

`MeCab.create(options)`

Create a MeCab instance asynchronously.

import { MeCab } from 'kuromoji-ko';

const mecab = await MeCab.create({
  engine: 'ko',      // Only 'ko' is supported
  dictPath: './dict' // Path to dictionary directory
});

`mecab.parse(text)`

Parse text into an array of Token objects.

const tokens = mecab.parse('아버지가방에들어가신다');
tokens.forEach(t => console.log(t.surface, t.pos));

Token Object (napi-mecab compatible)

| Property | Type | Description | |----------|------|-------------| | surface | string | How the token looks in the input text | | pos | string[] | Parts of speech as array (split by "+") | | lemma | string | Dictionary headword (adds "다" for verbs) | | pronunciation | string \| null | How the token is pronounced | | hasBatchim | boolean \| null | Whether token has final consonant (받침) | | hasJongseong | boolean \| null | Alias for hasBatchim | | semanticClass | string \| null | Semantic word class or category | | type | string \| null | Token type (Inflect/Compound/Preanalysis) | | expression | ExpressionToken[] \| null | Breakdown of compound/inflected tokens | | features | string | Raw features string (comma-separated) | | raw | string | Raw MeCab output format (surface\tfeatures) |

ExpressionToken Object

For compound or inflected words, expression returns an array of ExpressionToken:

| Property | Type | Description | |----------|------|-------------| | morpheme | string | The normalized token | | pos | string | Part of speech | | lemma | string | Dictionary form (adds "다" for verbs) | | semanticClass | string \| null | Semantic category |

Classic API

`kuromoji.builder(options)`

Create a tokenizer builder.

const builder = kuromoji.builder({
  dicPath: './dict',      // Path to dictionary directory
  loader: customLoader    // Optional custom file loader
});

`builder.build()`

Build and return the tokenizer (async).

const tokenizer = await builder.build();

`tokenizer.tokenize(text)`

Tokenize Korean text into morphemes.

const tokens = tokenizer.tokenize('한국어 형태소 분석');

`tokenizer.wakati(text)`

Get just the surface forms as an array.

const words = tokenizer.wakati('한국어 형태소 분석');
// ['한국어', '형태소', '분석']

`tokenizer.wakatiString(text)`

Get space-separated surface forms.

const str = tokenizer.wakatiString('한국어 형태소 분석');
// '한국어 형태소 분석'

KoreanToken Object (Classic API)

Each token from tokenizer.tokenize() has the following properties:

| Property | Description | Example | |----------|-------------|---------| | surface_form | Surface text | '한국어' | | word_position | Position in text (1-indexed) | 1 | | word_id | Dictionary word ID | 12345 | | word_type | KNOWN or UNKNOWN | 'KNOWN' | | pos | POS tag (Sejong tagset) | 'NNG' | | posDescription | POS description | '일반 명사' | | semantic_class | Semantic category | '*' | | has_final_consonant | Ends with 받침? (T/F/*) | 'F' | | reading | Pronunciation | '한국어' | | type | Inflect/Compound/Preanalysis | 'Compound' | | first_pos | First POS (compounds) | 'NNG' | | last_pos | Last POS (compounds) | 'NNG' | | expression | Decomposition | '한국/NNG/*+어/NNG/*' |

Korean POS Tags (Sejong Tagset)

체언 (Substantives)

| Tag | Description | |-----|-------------| | NNG | 일반 명사 (General noun) | | NNP | 고유 명사 (Proper noun) | | NNB | 의존 명사 (Dependent noun) | | NR | 수사 (Numeral) | | NP | 대명사 (Pronoun) |

용언 (Predicates)

| Tag | Description | |-----|-------------| | VV | 동사 (Verb) | | VA | 형용사 (Adjective) | | VX | 보조 용언 (Auxiliary) | | VCP | 긍정 지정사 (Copula 이다) | | VCN | 부정 지정사 (Negative 아니다) |

조사 (Particles)

| Tag | Description | |-----|-------------| | JKS | 주격 조사 (Subject) | | JKO | 목적격 조사 (Object) | | JKB | 부사격 조사 (Adverbial) | | JX | 보조사 (Auxiliary particle) |

어미 (Endings)

| Tag | Description | |-----|-------------| | EP | 선어말 어미 (Pre-final) | | EF | 종결 어미 (Final) | | EC | 연결 어미 (Connective) | | ETN | 명사형 전성 어미 (Nominalizing) | | ETM | 관형형 전성 어미 (Adnominalizing) |

기타 (Others)

| Tag | Description | |-----|-------------| | SL | 외국어 (Foreign) | | SH | 한자 (Chinese characters) | | SN | 숫자 (Numbers) | | SW | 기타 기호 (Symbols) |

Browser Usage

<script type="module">
import kuromoji from 'https://cdn.jsdelivr.net/npm/kuromoji-ko/dist/index.mjs';

const tokenizer = await kuromoji.builder({
  dicPath: 'https://cdn.jsdelivr.net/npm/kuromoji-ko/dict/'
}).build();

console.log(tokenizer.tokenize('안녕하세요'));
</script>

Serverless (Vercel) Usage

kuromoji-ko runs without native dependencies, making it perfect for serverless:

// api/tokenize.js
import kuromoji from 'kuromoji-ko';

let tokenizerPromise = null;

function getTokenizer() {
  if (!tokenizerPromise) {
    tokenizerPromise = kuromoji.builder({
      dicPath: './dict'
    }).build();
  }
  return tokenizerPromise;
}

export default async function handler(req, res) {
  const tokenizer = await getTokenizer();
  const tokens = tokenizer.tokenize(req.body.text);
  res.json(tokens);
}

How It Works

kuromoji-ko implements morphological analysis using:

Double-Array TRIE - Efficient dictionary lookup for surface forms
Viterbi Algorithm - Dynamic programming to find the optimal segmentation
Connection Costs - Bigram model for morpheme transitions
Unknown Word Handling - Character-type based POS estimation

Credits

kuromoji.js - Original Japanese implementation
mecab-ko-dic - Korean dictionary
MeCab - Original C++ morphological analyzer

License

Apache-2.0

Dictionary files (mecab-ko-dic) are also Apache-2.0 licensed.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

kuromoji-ko

Features

Installation

Quick Start

napi-mecab Compatible API (Recommended)

Classic API

Building the Dictionary

API

MeCab API (napi-mecab compatible)

MeCab.create(options)

mecab.parse(text)

Token Object (napi-mecab compatible)

ExpressionToken Object

Classic API

kuromoji.builder(options)

builder.build()

tokenizer.tokenize(text)

tokenizer.wakati(text)

tokenizer.wakatiString(text)

KoreanToken Object (Classic API)

Korean POS Tags (Sejong Tagset)

체언 (Substantives)

용언 (Predicates)

조사 (Particles)

어미 (Endings)

기타 (Others)

Browser Usage

Serverless (Vercel) Usage

How It Works

Credits

License

`MeCab.create(options)`

`mecab.parse(text)`

`kuromoji.builder(options)`

`builder.build()`

`tokenizer.tokenize(text)`

`tokenizer.wakati(text)`

`tokenizer.wakatiString(text)`