@2ji-han/kuromoji.js

v0.1.8

Published

5 months ago

JavaScript implementation of Japanese morphological analyzer

0High
0Medium
0Low

am4_02

japanese morphological analyzer nlp pos pos tagger tokenizer

kuromoji.js

English・日本語

this is fork of @takuyaa/kuromoji.js

and, I was inspired by the following repositories

@MijinkoSD/kuromoji.ts

Once again, I would like to thank you!

futures

[x] Tests Pass 100% :partying_face:
[x] async to promise/await :partying_face:
[x] Support and build for browser :partying_face:
[x] Asynchronization of init functions(eg. await kuromoji.builder()) :partying_face:
[ ] Support Stream
[ ] kuromoji-server
[ ] Support user dictionary
[ ] Search mode
[ ] Output of N-best solution
[ ] Support NAIST-jdic, Unidic
[ ] Low dictionary size(use fst?)

About

JavaScript implementation of Japanese morphological analyzer. This is a pure JavaScript porting of Kuromoji.

You can see how kuromoji.js works in demo site.

Usage

Install with package manager:

npm install kuromoji.js
pnpm install kuromoji.js
bun install kuromoji.js

Load this library as follows:

import kuromoji from "kuromoji.js";
const kuromoji = require("kuromoji.js").default;
//browser
import kuromoji from 'https://cdn.jsdelivr.net/npm/kuromoji.js/dist/browser/index.min.js'

You can tokenize sentences with only 5 lines of code. If you need working examples, you can see the files under the demo or example directory.

import kuromoji from "kuromoji.js";

kuromoji.builder().build((err, tokenizer) => {
    // tokenizer is ready
    const path = tokenizer.tokenize("すもももももももものうち");
    console.log(path);
});

Also, Loading with top-level await is also supported as follows

import kuromoji from "kuromoji.js/promise";

const tokenizer = await kuromoji.builder().build();

const path = tokenizer.tokenize("すもももももももものうち");
console.log(path.length);

Build Dictionary

We currently use mecab-ipadic for our dictionaries, but You can build and use your own dictionary as long as it is compatible with mecab-ipadic

bun build-dict <output path> <dict input path>

API

The function tokenize() returns an JSON array like this:

[ {
    // word id in dictionary
    word_id: 509800,
    // word type (KNOWN for words in the dictionary, UNKNOWN for unknown words)
    word_type: 'KNOWN',
    // word start position
    word_position: 1,
    // surface form
    surface_form: '黒文字',
    // part of speech
    pos: '名詞',
    // Part-of-Speech Subdivision 1
    pos_detail_1: '一般',
    // Part-of-Speech Subdivision 2
    pos_detail_2: '*',
    // Part-of-Speech Subdivision 3
    pos_detail_3: '*',
    // conjugated type
    conjugated_type: '*',
    // conjugated form
    conjugated_form: '*',
    // basic form
    basic_form: '黒文字',
    // reading
    reading: 'クロモジ',
    // pronunciation
    pronunciation: 'クロモジ'
} ]

(This is defined in src/_core/util/IpadicFormatter.ts)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme