npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@ioris/tokenizer-kuromoji

v0.4.0

Published

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

Readme

@ioris/tokenizer-kuromoji

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

npm version License: MIT

Overview

@ioris/tokenizer-kuromoji integrates with the @ioris/core framework to provide advanced lyrics tokenization capabilities. The library focuses on natural phrase breaks and proper handling of mixed Japanese/English content, making it ideal for:

  • Karaoke Applications - Generate natural phrase breaks for synchronized lyrics display
  • Music Apps - Improve lyrics readability through intelligent segmentation
  • Lyrics Analysis - Analyze song structure and linguistic patterns
  • Subtitle Generation - Create formatted subtitles for music videos
  • Language Learning - Study Japanese lyrics with proper phrase boundaries

Features

🎯 Intelligent Segmentation

  • Advanced rule-based system for natural phrase breaks
  • Part-of-speech analysis for accurate break placement
  • Configurable boundary rules with score-based strength evaluation

🌏 Mixed Language Support

  • Seamless processing of Japanese and English text
  • Script type detection (Japanese/Latin/Number)
  • Script change boundary detection

🎵 Lyrics-Optimized Rules

  • Specialized handling of parentheses, quotation marks, and repetition patterns
  • Timeline preservation (maintains temporal relationships while adding logical segmentation)
  • Whitespace break detection

🔧 Extensible Rule System

  • Customizable boundary rules
  • Token-based and position-based rule conditions
  • Multiple break strength levels (Strong/Medium/Weak/None)

Installation

npm install @ioris/tokenizer-kuromoji

Usage

Basic Usage

import { LineArgsTokenizer } from '@ioris/tokenizer-kuromoji';

const text = '桜の花が咲いている Beautiful spring day';
const result = await LineArgsTokenizer({ text });

console.log(result.phrases);
// Output: Array of Phrase objects with intelligent segmentation

Advanced Usage with Custom Rules

import { LineArgsTokenizer, parseWithKuromoji, generateBreaksOnTokens, segmentByBreakAfter } from '@ioris/tokenizer-kuromoji';

// Parse text with Kuromoji
const tokens = await parseWithKuromoji(text);

// Generate breaks based on rules
const tokensWithBreaks = generateBreaksOnTokens(tokens);

// Segment into phrases
const phrases = segmentByBreakAfter(tokensWithBreaks, text);

Processing Flow

The tokenization process follows this flow:

flowchart TD
    Start([Input Text]) --> Parse[Parse with Kuromoji]
    Parse --> Tokens[Morphological Tokens]
    Tokens --> Script[Detect Script Types]
    Script --> Rules[Apply Boundary Rules]

    Rules --> CheckRules{Evaluate Rules}
    CheckRules -->|Token-based| TokenRule[Check POS, surface, etc.]
    CheckRules -->|Position-based| PosRule[Check text position]

    TokenRule --> Score[Calculate Break Score]
    PosRule --> Score

    Score --> Strength[Map to Break Strength]
    Strength -->|Strong| StrongBreak[Strong Break]
    Strength -->|Medium| MediumBreak[Medium Break]
    Strength -->|Weak| WeakBreak[Weak Break]
    Strength -->|None| NoBreak[No Break]

    StrongBreak --> Segment
    MediumBreak --> Segment
    WeakBreak --> Segment
    NoBreak --> Segment

    Segment[Segment by Breaks] --> BuildPhrase[Build Phrases]
    BuildPhrase --> Timeline[Apply Timeline]
    Timeline --> Result([Output Phrases])

    style Start fill:#e1f5ff
    style Result fill:#e1f5ff
    style Parse fill:#fff4e1
    style Rules fill:#fff4e1
    style Segment fill:#fff4e1
    style BuildPhrase fill:#fff4e1

Key Processing Steps

  1. Morphological Analysis: Parse input text using Kuromoji to get tokens with part-of-speech information
  2. Script Detection: Identify script types (Japanese/Latin/Number) for each token
  3. Rule Application: Evaluate boundary rules based on:
    • Token properties (POS, surface form, reading)
    • Position in text (brackets, quotes, whitespace)
    • Script changes between tokens
  4. Break Scoring: Calculate break strength score from matched rules
  5. Strength Mapping: Convert scores to break strength levels (Strong/Medium/Weak/None)
  6. Segmentation: Split tokens into phrases based on break points
  7. Phrase Building: Construct phrase objects with proper text and metadata
  8. Timeline Application: Apply temporal information (startTime/endTime) to phrases

API Reference

Main Functions

LineArgsTokenizer(args: LineArgsTokenizerArgs): Promise<TokenizeResult>

Main tokenizer function that processes text and returns segmented phrases.

Parameters:

  • args.text (string) - The text to tokenize
  • args.startTime (number, optional) - Start timestamp
  • args.endTime (number, optional) - End timestamp

Returns: Promise resolving to TokenizeResult containing phrases array

parseWithKuromoji(text: string): Promise<IpadicFeatures[]>

Parse text using Kuromoji morphological analyzer.

generateBreaksOnTokens(tokens: IpadicFeatures[]): TokenWithBreak[]

Apply boundary rules to tokens and generate break information.

segmentByBreakAfter(tokens: TokenWithBreak[], text: string): Phrase[]

Segment tokens into phrases based on break information.

Development

Setup

# Install dependencies
npm install

# Run tests
npm test

# Run tests with coverage
npm run test:coverage

# Run tests in watch mode
npm run test:watch

# Build the library
npm run build

# Run linter
npm run lint

# Format code
npm run format

Project Structure

.
├── src/
│   ├── Tokenizer.Kuromoji.ts        # Main tokenizer implementation
│   ├── types.ts                     # Type definitions
│   ├── rules.ts                     # Boundary rule definitions
│   ├── constants.ts                 # Constants
│   ├── index.ts                     # Entry point
│   ├── *.test.ts                    # Integration tests
│   └── *.unit.test.ts              # Unit tests
├── dist/                            # Build output
└── coverage/                        # Test coverage reports

Testing

The project uses Vitest for testing with comprehensive test coverage:

  • Unit tests for individual functions
  • Integration tests for complete tokenization flows
  • Coverage reporting with @vitest/coverage-v8

Run npm run test:coverage to generate coverage reports.

Technical Details

Dependency Requirements

  • Node.js >= 16.0
  • TypeScript >= 5.0

Key Dependencies

  • @ioris/core - Ioris framework core
  • kuromoji - Japanese morphological analyzer

Build Tools

  • Build: esbuild + TypeScript compiler
  • Testing: Vitest
  • Linting/Formatting: Biome
  • Task Runner: npm-run-all

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Repository

https://github.com/8beeeaaat/ioris_tokenizer_kuromoji

Issues

https://github.com/8beeeaaat/ioris_tokenizer_kuromoji/issues