@ioris/tokenizer-kuromoji

v0.4.0

Published

2 months ago

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

0High
0Medium
0Low

8beeeaaat

music lyric sync iori

@ioris/tokenizer-kuromoji

A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.

Overview

@ioris/tokenizer-kuromoji integrates with the @ioris/core framework to provide advanced lyrics tokenization capabilities. The library focuses on natural phrase breaks and proper handling of mixed Japanese/English content, making it ideal for:

Karaoke Applications - Generate natural phrase breaks for synchronized lyrics display
Music Apps - Improve lyrics readability through intelligent segmentation
Lyrics Analysis - Analyze song structure and linguistic patterns
Subtitle Generation - Create formatted subtitles for music videos
Language Learning - Study Japanese lyrics with proper phrase boundaries

Features

🎯 Intelligent Segmentation

Advanced rule-based system for natural phrase breaks
Part-of-speech analysis for accurate break placement
Configurable boundary rules with score-based strength evaluation

🌏 Mixed Language Support

Seamless processing of Japanese and English text
Script type detection (Japanese/Latin/Number)
Script change boundary detection

🎵 Lyrics-Optimized Rules

Specialized handling of parentheses, quotation marks, and repetition patterns
Timeline preservation (maintains temporal relationships while adding logical segmentation)
Whitespace break detection

🔧 Extensible Rule System

Customizable boundary rules
Token-based and position-based rule conditions
Multiple break strength levels (Strong/Medium/Weak/None)

Installation

npm install @ioris/tokenizer-kuromoji

Usage

Basic Usage

import { LineArgsTokenizer } from '@ioris/tokenizer-kuromoji';

const text = '桜の花が咲いている Beautiful spring day';
const result = await LineArgsTokenizer({ text });

console.log(result.phrases);
// Output: Array of Phrase objects with intelligent segmentation

Advanced Usage with Custom Rules

import { LineArgsTokenizer, parseWithKuromoji, generateBreaksOnTokens, segmentByBreakAfter } from '@ioris/tokenizer-kuromoji';

// Parse text with Kuromoji
const tokens = await parseWithKuromoji(text);

// Generate breaks based on rules
const tokensWithBreaks = generateBreaksOnTokens(tokens);

// Segment into phrases
const phrases = segmentByBreakAfter(tokensWithBreaks, text);

Processing Flow

The tokenization process follows this flow:

flowchart TD
    Start([Input Text]) --> Parse[Parse with Kuromoji]
    Parse --> Tokens[Morphological Tokens]
    Tokens --> Script[Detect Script Types]
    Script --> Rules[Apply Boundary Rules]

    Rules --> CheckRules{Evaluate Rules}
    CheckRules -->|Token-based| TokenRule[Check POS, surface, etc.]
    CheckRules -->|Position-based| PosRule[Check text position]

    TokenRule --> Score[Calculate Break Score]
    PosRule --> Score

    Score --> Strength[Map to Break Strength]
    Strength -->|Strong| StrongBreak[Strong Break]
    Strength -->|Medium| MediumBreak[Medium Break]
    Strength -->|Weak| WeakBreak[Weak Break]
    Strength -->|None| NoBreak[No Break]

    StrongBreak --> Segment
    MediumBreak --> Segment
    WeakBreak --> Segment
    NoBreak --> Segment

    Segment[Segment by Breaks] --> BuildPhrase[Build Phrases]
    BuildPhrase --> Timeline[Apply Timeline]
    Timeline --> Result([Output Phrases])

    style Start fill:#e1f5ff
    style Result fill:#e1f5ff
    style Parse fill:#fff4e1
    style Rules fill:#fff4e1
    style Segment fill:#fff4e1
    style BuildPhrase fill:#fff4e1

Key Processing Steps

Morphological Analysis: Parse input text using Kuromoji to get tokens with part-of-speech information
Script Detection: Identify script types (Japanese/Latin/Number) for each token
Rule Application: Evaluate boundary rules based on:
- Token properties (POS, surface form, reading)
- Position in text (brackets, quotes, whitespace)
- Script changes between tokens
Break Scoring: Calculate break strength score from matched rules
Strength Mapping: Convert scores to break strength levels (Strong/Medium/Weak/None)
Segmentation: Split tokens into phrases based on break points
Phrase Building: Construct phrase objects with proper text and metadata
Timeline Application: Apply temporal information (startTime/endTime) to phrases

API Reference

Main Functions

`LineArgsTokenizer(args: LineArgsTokenizerArgs): Promise<TokenizeResult>`

Main tokenizer function that processes text and returns segmented phrases.

Parameters:

args.text (string) - The text to tokenize
args.startTime (number, optional) - Start timestamp
args.endTime (number, optional) - End timestamp

Returns: Promise resolving to TokenizeResult containing phrases array

`parseWithKuromoji(text: string): Promise<IpadicFeatures[]>`

Parse text using Kuromoji morphological analyzer.

`generateBreaksOnTokens(tokens: IpadicFeatures[]): TokenWithBreak[]`

Apply boundary rules to tokens and generate break information.

`segmentByBreakAfter(tokens: TokenWithBreak[], text: string): Phrase[]`

Segment tokens into phrases based on break information.

Development

Setup

# Install dependencies
npm install

# Run tests
npm test

# Run tests with coverage
npm run test:coverage

# Run tests in watch mode
npm run test:watch

# Build the library
npm run build

# Run linter
npm run lint

# Format code
npm run format

Project Structure

.
├── src/
│   ├── Tokenizer.Kuromoji.ts        # Main tokenizer implementation
│   ├── types.ts                     # Type definitions
│   ├── rules.ts                     # Boundary rule definitions
│   ├── constants.ts                 # Constants
│   ├── index.ts                     # Entry point
│   ├── *.test.ts                    # Integration tests
│   └── *.unit.test.ts              # Unit tests
├── dist/                            # Build output
└── coverage/                        # Test coverage reports

Testing

The project uses Vitest for testing with comprehensive test coverage:

Unit tests for individual functions
Integration tests for complete tokenization flows
Coverage reporting with @vitest/coverage-v8

Run npm run test:coverage to generate coverage reports.

Technical Details

Dependency Requirements

Node.js >= 16.0
TypeScript >= 5.0

Key Dependencies

@ioris/core - Ioris framework core
kuromoji - Japanese morphological analyzer

Build Tools

Build: esbuild + TypeScript compiler
Testing: Vitest
Linting/Formatting: Biome
Task Runner: npm-run-all

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Repository

https://github.com/8beeeaaat/ioris_tokenizer_kuromoji

Issues

https://github.com/8beeeaaat/ioris_tokenizer_kuromoji/issues

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@ioris/tokenizer-kuromoji

Overview

Features

🎯 Intelligent Segmentation

🌏 Mixed Language Support

🎵 Lyrics-Optimized Rules

🔧 Extensible Rule System

Installation

Usage

Basic Usage

Advanced Usage with Custom Rules

Processing Flow

Key Processing Steps

API Reference

Main Functions

LineArgsTokenizer(args: LineArgsTokenizerArgs): Promise<TokenizeResult>

parseWithKuromoji(text: string): Promise<IpadicFeatures[]>

generateBreaksOnTokens(tokens: IpadicFeatures[]): TokenWithBreak[]

segmentByBreakAfter(tokens: TokenWithBreak[], text: string): Phrase[]

Development

Setup

Project Structure

Testing

Technical Details

Dependency Requirements

Key Dependencies

Build Tools

License

Contributing

Repository

Issues

`LineArgsTokenizer(args: LineArgsTokenizerArgs): Promise<TokenizeResult>`

`parseWithKuromoji(text: string): Promise<IpadicFeatures[]>`

`generateBreaksOnTokens(tokens: IpadicFeatures[]): TokenWithBreak[]`

`segmentByBreakAfter(tokens: TokenWithBreak[], text: string): Phrase[]`