npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

lindera-nodejs

v3.0.5

Published

Node.js bindings for Lindera morphological analysis engine

Downloads

717

Readme

lindera-nodejs

Node.js binding for Lindera, a Japanese morphological analysis engine.

Overview

lindera-nodejs provides a comprehensive Node.js interface to the Lindera morphological analysis engine, supporting Japanese, Korean, and Chinese text analysis. This implementation includes all major features:

  • Multi-language Support: Japanese (IPADIC, IPADIC-NEologd, UniDic), Korean (ko-dic), Chinese (CC-CEDICT, Jieba)
  • Character Filters: Text preprocessing with mapping, regex, Unicode normalization, and Japanese iteration mark handling
  • Token Filters: Post-processing filters including lowercase, length filtering, stop words, and Japanese-specific filters
  • Flexible Configuration: Configurable tokenization modes and penalty settings
  • Metadata Support: Complete dictionary schema and metadata management
  • TypeScript Support: Full type definitions included out of the box

Features

Core Components

  • TokenizerBuilder: Fluent API for building customized tokenizers
  • Tokenizer: High-performance text tokenization with integrated filtering
  • CharacterFilter: Pre-processing filters for text normalization
  • TokenFilter: Post-processing filters for token refinement
  • Metadata & Schema: Dictionary structure and configuration management
  • Training & Export (optional): Train custom morphological analysis models from corpus data

Supported Dictionaries

  • Japanese: IPADIC, IPADIC-NEologd, UniDic
  • Korean: ko-dic
  • Chinese: CC-CEDICT, Jieba
  • Custom: User dictionary support

Pre-built dictionaries are available from GitHub Releases. Download a dictionary archive (e.g. lindera-ipadic-*.zip) and specify the extracted path when loading.

Filter Types

Character Filters:

  • Mapping filter (character replacement)
  • Regex filter (pattern-based replacement)
  • Unicode normalization (NFKC, etc.)
  • Japanese iteration mark normalization

Token Filters:

  • Text case transformation (lowercase, uppercase)
  • Length filtering (min/max character length)
  • Stop words filtering
  • Japanese-specific filters (base form, reading form, etc.)
  • Korean-specific filters

Install project dependencies

Setup repository

# Clone lindera project repository
git clone [email protected]:lindera/lindera.git
cd lindera

Install lindera-nodejs

This command builds the library with development settings (debug build).

cd lindera-nodejs
npm install
npm run build

Quick Start

Basic Tokenization

const { loadDictionary, Tokenizer } = require("lindera-nodejs");

// Load dictionary
// Load dictionary from a local path (download from GitHub Releases)
const dictionary = loadDictionary("/path/to/ipadic");

// Create a tokenizer
const tokenizer = new Tokenizer(dictionary, "normal");

// Tokenize Japanese text
const text = "すもももももももものうち";
const tokens = tokenizer.tokenize(text);

for (const token of tokens) {
  console.log(`Text: ${token.surface}, Position: ${token.byteStart}-${token.byteEnd}`);
}

Using Character Filters

const { TokenizerBuilder } = require("lindera-nodejs");

// Create tokenizer builder
const builder = new TokenizerBuilder();
builder.setMode("normal");
builder.setDictionary("/path/to/ipadic");

// Add character filters
builder.appendCharacterFilter("mapping", { mapping: { "ー": "-" } });
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });

// Build tokenizer with filters
const tokenizer = builder.build();
const text = "テストー123";
const tokens = tokenizer.tokenize(text); // Will apply filters automatically

Using Token Filters

const { TokenizerBuilder } = require("lindera-nodejs");

// Create tokenizer builder
const builder = new TokenizerBuilder();
builder.setMode("normal");
builder.setDictionary("/path/to/ipadic");

// Add token filters
builder.appendTokenFilter("lowercase");
builder.appendTokenFilter("length", { min: 2, max: 10 });
builder.appendTokenFilter("japanese_stop_tags", { tags: ["助詞", "助動詞"] });

// Build tokenizer with filters
const tokenizer = builder.build();
const tokens = tokenizer.tokenize("テキストの解析");

Integrated Pipeline

const { TokenizerBuilder } = require("lindera-nodejs");

// Build tokenizer with integrated filters
const builder = new TokenizerBuilder();
builder.setMode("normal");
builder.setDictionary("/path/to/ipadic");

// Add character filters
builder.appendCharacterFilter("mapping", { mapping: { "ー": "-" } });
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });

// Add token filters
builder.appendTokenFilter("lowercase");
builder.appendTokenFilter("japanese_base_form");

// Build and use
const tokenizer = builder.build();
const tokens = tokenizer.tokenize("コーヒーショップ");

Working with Metadata

const { Metadata } = require("lindera-nodejs");

// Create metadata with default values
const metadata = new Metadata();
console.log(`Name: ${metadata.name}`);
console.log(`Encoding: ${metadata.encoding}`);

// Create metadata from a JSON file
const loaded = Metadata.fromJsonFile("metadata.json");
console.log(loaded.toObject());

Advanced Usage

Filter Configuration Examples

Character filters and token filters accept configuration as object arguments:

const { TokenizerBuilder } = require("lindera-nodejs");

const builder = new TokenizerBuilder();
builder.setDictionary("/path/to/ipadic");

// Character filters with object configuration
builder.appendCharacterFilter("unicode_normalize", { kind: "nfkc" });
builder.appendCharacterFilter("japanese_iteration_mark", {
  normalize_kanji: true,
  normalize_kana: true,
});
builder.appendCharacterFilter("mapping", {
  mapping: { "リンデラ": "lindera", "トウキョウ": "東京" },
});

// Token filters with object configuration
builder.appendTokenFilter("japanese_katakana_stem", { min: 3 });
builder.appendTokenFilter("length", { min: 2, max: 10 });
builder.appendTokenFilter("japanese_stop_tags", {
  tags: ["助詞", "助動詞", "記号"],
});

// Filters without configuration can omit the object
builder.appendTokenFilter("lowercase");
builder.appendTokenFilter("japanese_base_form");

const tokenizer = builder.build();

See examples/ directory for comprehensive examples including:

  • tokenize.js: Basic tokenization
  • tokenize_with_filters.js: Using character and token filters
  • tokenize_with_userdict.js: Custom user dictionary
  • train_and_export.js: Train and export custom dictionaries (requires train feature)
  • tokenize_with_decompose.js: Decompose mode tokenization

Dictionary Support

Japanese

  • IPADIC: Default Japanese dictionary, good for general text
  • UniDic: Academic dictionary with detailed morphological information

Korean

  • ko-dic: Standard Korean dictionary for morphological analysis

Chinese

  • CC-CEDICT: Community-maintained Chinese-English dictionary

Custom Dictionaries

  • User dictionary support for domain-specific terms
  • CSV format for easy customization

Dictionary Training (Experimental)

lindera-nodejs supports training custom morphological analysis models from annotated corpus data when built with the train feature.

Building with Training Support

npm run build -- --features train

Training a Model

const { train } = require("lindera-nodejs");

// Train a model from corpus
train({
  seed: "path/to/seed.csv",
  corpus: "path/to/corpus.txt",
  charDef: "path/to/char.def",
  unkDef: "path/to/unk.def",
  featureDef: "path/to/feature.def",
  rewriteDef: "path/to/rewrite.def",
  output: "model.dat",
  lambda: 0.01,
  maxIter: 100,
});

Exporting Dictionary Files

const { exportModel } = require("lindera-nodejs");

// Export trained model to dictionary files
exportModel({
  model: "model.dat",
  output: "exported_dict/",
  metadata: "metadata.json",
});

This will create:

  • lex.csv: Lexicon file
  • matrix.def: Connection cost matrix
  • unk.def: Unknown word definitions
  • char.def: Character definitions
  • metadata.json: Dictionary metadata (if provided)

See examples/train_and_export.js for a complete example.

API Reference

Core Classes

  • TokenizerBuilder: Fluent builder for tokenizer configuration
  • Tokenizer: Main tokenization engine
  • Token: Individual token with text, position, and linguistic features
  • Metadata: Dictionary metadata and configuration
  • Schema: Dictionary schema definition

Training Functions (requires train feature)

  • train(): Train a morphological analysis model from corpus
  • exportModel(): Export trained model to dictionary files