ve-japanese
v1.0.3
Published
A Japanese language parser based on Ve, using kuromoji.
Readme
ve-japanese
A Japanese language parser, ported from the original ve project.
This package intelligently groups inflected forms (like verbs and adjectives) into single, meaningful words, while preserving their dictionary forms (lemmas).
It uses kuromoji.js as its underlying tokenizer and is fully self-contained, requiring no external dependencies like native MeCab.
Usage
First, import the parse function from the package.
const { parse } = require('./dist/index.js'); // Adjust the path if needed
async function main() {
const text = 'これ食べました';
const words = await parse(text);
// The result is an array of Word objects
console.log(words);
}
main();Getting the Dictionary Form (Lemma)
Each object in the returned array represents a word. The .word property gives you the surface form from the text, while the .lemma property gives you its basic, or dictionary, form.
This is especially useful for conjugated verbs.
const { parse } = require('./dist/index.js');
async function findLemmas() {
const text = 'これ食べました';
const words = await parse(text);
const verb = words[1];
console.log(`Surface form: ${verb.word}`); // Outputs: 食べました
console.log(`Dictionary form: ${verb.lemma}`); // Outputs: 食べる
}
findLemmas();Example Word Object
A Word object for 食べました will look like this:
{
"word": "食べました",
"lemma": "食べる",
"part_of_speech": "verb",
"tokens": [
{ "surface_form": "食べ", "pos": "動詞", ... },
{ "surface_form": "まし", "pos": "助動詞", ... },
{ "surface_form": "た", "pos": "助動詞", ... }
],
"extra": {
"reading": "タベマシタ",
"transcription": "タベマシタ"
}
}