npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@ioris/tokenizer-kuromoji

v0.3.3

Published

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

Downloads

15

Readme

@ioris/tokenizer-kuromoji

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

Overview

@ioris/tokenizer-kuromoji integrates with the @ioris/core framework to provide advanced lyrics tokenization capabilities. The library focuses on natural phrase breaks and proper handling of mixed Japanese/English content, making it ideal for karaoke applications, music apps, and lyrics analysis tools.

Features

  • Intelligent Segmentation: Advanced rule-based system for natural phrase breaks
  • Mixed Language Support: Seamless handling of Japanese and English text
  • Lyrics-Optimized Rules: Specialized processing for parentheses, quotes, and repetitive patterns
  • Timeline Preservation: Maintains temporal relationships while adding logical segmentation
  • Part-of-Speech Analysis: Leverages Kuromoji's morphological analysis for accurate breaks
  • Extensible Rule System: Customizable rules for specific use cases

Installation

npm install @ioris/tokenizer-kuromoji @ioris/core kuromoji
# or
yarn add @ioris/tokenizer-kuromoji @ioris/core kuromoji

Basic Usage

import path from "path";
import { createParagraph } from "@ioris/core";
import { builder } from "kuromoji";
import { LineArgsTokenizer } from "@ioris/tokenizer-kuromoji";

// Initialize kuromoji tokenizer
const kuromojiBuilder = builder({
  dicPath: path.resolve(__dirname, "node_modules/kuromoji/dict")
});

// Get kuromoji tokenizer (Promise)
const getTokenizer = () => new Promise((resolve, reject) => {
  kuromojiBuilder.build((err, tokenizer) => {
    if (err) reject(err);
    resolve(tokenizer);
  });
});

// Usage example
async function example() {
  // Get kuromoji tokenizer instance
  const tokenizer = await getTokenizer();

  // Prepare lyrics data with timeline information
  const lyricData = {
    position: 1,
    timelines: [
      {
        wordID: "",
        begin: 1,
        end: 5,
        text: "あの花が咲いたのは、そこに種が落ちたからで"
      }
    ]
  };

  // Create paragraph with custom tokenizer
  const paragraph = await createParagraph({
    ...lyricData,
    lineTokenizer: (lineArgs) => LineArgsTokenizer({
      lineArgs,
      tokenizer
    })
  });

  // Get processing results with natural breaks
  const lines = paragraph.lines;
  const lineText = lines[0].words
    .map(word => {
      let text = word.timeline.text;
      if (word.timeline.hasNewLine) text += '\n';
      return text;
    })
    .join('');
  
  console.log(lineText);
  // Output: "あの花が\n咲いたのは、\nそこに\n種が落ちたからで"
}

example();

How It Works

The tokenizer analyzes lyrics using advanced linguistic rules to create natural phrase breaks:

Intelligent Break Detection

  • Part-of-Speech Analysis: Uses Kuromoji's morphological analysis to identify grammatical boundaries
  • Context Awareness: Considers before/current/after token relationships for accurate segmentation
  • Length Optimization: Balances phrase length for optimal readability and singing
  • Mixed Language Handling: Seamlessly processes Japanese-English transitions

Special Lyrics Processing

  • Parentheses & Quotes: Preserves phrases enclosed in brackets, parentheses, or quotation marks
  • Repetitive Patterns: Handles repetitive expressions like "Baby Baby Baby" intelligently
  • Punctuation Sensitivity: Respects natural pauses indicated by punctuation marks
  • Timeline Preservation: Maintains original timing information while adding segmentation

Example Transformations

Input:  "あの花が咲いたのは、そこに種が落ちたからで"
Output: "あの花が\n咲いたのは、\nそこに\n種が落ちたからで"

Input:  "Baby Baby Baby 君を抱きしめていたい"
Output: "Baby\nBaby\nBaby\n君を抱きしめていたい"

Input:  "Oh, I can't help falling in love with you"
Output: "Oh,\nI can't help falling in love with you"

API Reference

LineArgsTokenizer

The main tokenization function that processes timeline data with intelligent segmentation.

function LineArgsTokenizer(options: {
  lineArgs: CreateLineArgs;
  tokenizer: Tokenizer<IpadicFeatures>;
  brakeRules?: TokenizeRule[];
  whitespaceRules?: TokenizeRule[];
}): Promise<Map<number, CreateLineArgs>>

Parameters

  • lineArgs: Input timeline data containing text and timing information
  • tokenizer: Kuromoji tokenizer instance for morphological analysis
  • brakeRules: Optional custom rules for line breaks (defaults to DEFAULT_BRAKE_RULES)
  • whitespaceRules: Optional custom rules for whitespace handling (defaults to DEFAULT_WHITESPACE_RULES)

Returns

A Map containing segmented line data with natural break points and preserved timing information.

Custom Rules

You can extend the tokenizer with custom break point rules:

import { LineArgsTokenizer, DEFAULT_BRAKE_RULES } from "@ioris/tokenizer-kuromoji";

// Define custom rules
const customRules = [
  ...DEFAULT_BRAKE_RULES,
  {
    // Break after specific patterns
    current: {
      surface_form: [/^(.*特定の文字列).*$/]
    },
    after: {
      pos: [["名詞", false]]
    }
  }
];

// Apply custom rules
const result = await LineArgsTokenizer({
  lineArgs,
  tokenizer,
  brakeRules: customRules
});

Rule Structure

Rules use the TokenizeRule interface with conditions for:

  • before: Conditions for the previous token
  • current: Conditions for the current token
  • after: Conditions for the next token
  • length: Length-based constraints
  • insert: Where to insert the break ("before" or "current")

Development

Building

npm run build        # Full build process
npm run build:types  # TypeScript declarations only
npm run build:esbuild # ESBuild compilation only

Testing

npm test            # Run all tests
npm run test -- --watch  # Watch mode

Code Quality

npm run lint        # Check code quality
npm run format      # Auto-fix formatting

Use Cases

  • Karaoke Applications: Generate natural phrase breaks for synchronized lyrics display
  • Music Apps: Improve lyrics readability with intelligent segmentation
  • Lyrics Analysis: Analyze song structure and linguistic patterns
  • Subtitle Generation: Create well-formatted subtitles for music videos
  • Language Learning: Study Japanese lyrics with proper phrase boundaries

Requirements

  • Node.js 16.0 or higher
  • TypeScript 5.0 or higher (for development)
  • @ioris/core ^0.3.2
  • kuromoji ^0.1.2

License

MIT