@ioris/tokenizer-kuromoji
v0.4.0
Published
A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.
Readme
@ioris/tokenizer-kuromoji
A specialized tokenizer library for Japanese lyrics analysis that provides intelligent text segmentation using the Kuromoji morphological analyzer.
Overview
@ioris/tokenizer-kuromoji integrates with the @ioris/core framework to provide advanced lyrics tokenization capabilities. The library focuses on natural phrase breaks and proper handling of mixed Japanese/English content, making it ideal for:
- Karaoke Applications - Generate natural phrase breaks for synchronized lyrics display
- Music Apps - Improve lyrics readability through intelligent segmentation
- Lyrics Analysis - Analyze song structure and linguistic patterns
- Subtitle Generation - Create formatted subtitles for music videos
- Language Learning - Study Japanese lyrics with proper phrase boundaries
Features
🎯 Intelligent Segmentation
- Advanced rule-based system for natural phrase breaks
- Part-of-speech analysis for accurate break placement
- Configurable boundary rules with score-based strength evaluation
🌏 Mixed Language Support
- Seamless processing of Japanese and English text
- Script type detection (Japanese/Latin/Number)
- Script change boundary detection
🎵 Lyrics-Optimized Rules
- Specialized handling of parentheses, quotation marks, and repetition patterns
- Timeline preservation (maintains temporal relationships while adding logical segmentation)
- Whitespace break detection
🔧 Extensible Rule System
- Customizable boundary rules
- Token-based and position-based rule conditions
- Multiple break strength levels (Strong/Medium/Weak/None)
Installation
npm install @ioris/tokenizer-kuromojiUsage
Basic Usage
import { LineArgsTokenizer } from '@ioris/tokenizer-kuromoji';
const text = '桜の花が咲いている Beautiful spring day';
const result = await LineArgsTokenizer({ text });
console.log(result.phrases);
// Output: Array of Phrase objects with intelligent segmentationAdvanced Usage with Custom Rules
import { LineArgsTokenizer, parseWithKuromoji, generateBreaksOnTokens, segmentByBreakAfter } from '@ioris/tokenizer-kuromoji';
// Parse text with Kuromoji
const tokens = await parseWithKuromoji(text);
// Generate breaks based on rules
const tokensWithBreaks = generateBreaksOnTokens(tokens);
// Segment into phrases
const phrases = segmentByBreakAfter(tokensWithBreaks, text);Processing Flow
The tokenization process follows this flow:
flowchart TD
Start([Input Text]) --> Parse[Parse with Kuromoji]
Parse --> Tokens[Morphological Tokens]
Tokens --> Script[Detect Script Types]
Script --> Rules[Apply Boundary Rules]
Rules --> CheckRules{Evaluate Rules}
CheckRules -->|Token-based| TokenRule[Check POS, surface, etc.]
CheckRules -->|Position-based| PosRule[Check text position]
TokenRule --> Score[Calculate Break Score]
PosRule --> Score
Score --> Strength[Map to Break Strength]
Strength -->|Strong| StrongBreak[Strong Break]
Strength -->|Medium| MediumBreak[Medium Break]
Strength -->|Weak| WeakBreak[Weak Break]
Strength -->|None| NoBreak[No Break]
StrongBreak --> Segment
MediumBreak --> Segment
WeakBreak --> Segment
NoBreak --> Segment
Segment[Segment by Breaks] --> BuildPhrase[Build Phrases]
BuildPhrase --> Timeline[Apply Timeline]
Timeline --> Result([Output Phrases])
style Start fill:#e1f5ff
style Result fill:#e1f5ff
style Parse fill:#fff4e1
style Rules fill:#fff4e1
style Segment fill:#fff4e1
style BuildPhrase fill:#fff4e1Key Processing Steps
- Morphological Analysis: Parse input text using Kuromoji to get tokens with part-of-speech information
- Script Detection: Identify script types (Japanese/Latin/Number) for each token
- Rule Application: Evaluate boundary rules based on:
- Token properties (POS, surface form, reading)
- Position in text (brackets, quotes, whitespace)
- Script changes between tokens
- Break Scoring: Calculate break strength score from matched rules
- Strength Mapping: Convert scores to break strength levels (Strong/Medium/Weak/None)
- Segmentation: Split tokens into phrases based on break points
- Phrase Building: Construct phrase objects with proper text and metadata
- Timeline Application: Apply temporal information (startTime/endTime) to phrases
API Reference
Main Functions
LineArgsTokenizer(args: LineArgsTokenizerArgs): Promise<TokenizeResult>
Main tokenizer function that processes text and returns segmented phrases.
Parameters:
args.text(string) - The text to tokenizeargs.startTime(number, optional) - Start timestampargs.endTime(number, optional) - End timestamp
Returns: Promise resolving to TokenizeResult containing phrases array
parseWithKuromoji(text: string): Promise<IpadicFeatures[]>
Parse text using Kuromoji morphological analyzer.
generateBreaksOnTokens(tokens: IpadicFeatures[]): TokenWithBreak[]
Apply boundary rules to tokens and generate break information.
segmentByBreakAfter(tokens: TokenWithBreak[], text: string): Phrase[]
Segment tokens into phrases based on break information.
Development
Setup
# Install dependencies
npm install
# Run tests
npm test
# Run tests with coverage
npm run test:coverage
# Run tests in watch mode
npm run test:watch
# Build the library
npm run build
# Run linter
npm run lint
# Format code
npm run formatProject Structure
.
├── src/
│ ├── Tokenizer.Kuromoji.ts # Main tokenizer implementation
│ ├── types.ts # Type definitions
│ ├── rules.ts # Boundary rule definitions
│ ├── constants.ts # Constants
│ ├── index.ts # Entry point
│ ├── *.test.ts # Integration tests
│ └── *.unit.test.ts # Unit tests
├── dist/ # Build output
└── coverage/ # Test coverage reportsTesting
The project uses Vitest for testing with comprehensive test coverage:
- Unit tests for individual functions
- Integration tests for complete tokenization flows
- Coverage reporting with
@vitest/coverage-v8
Run npm run test:coverage to generate coverage reports.
Technical Details
Dependency Requirements
- Node.js >= 16.0
- TypeScript >= 5.0
Key Dependencies
@ioris/core- Ioris framework corekuromoji- Japanese morphological analyzer
Build Tools
- Build: esbuild + TypeScript compiler
- Testing: Vitest
- Linting/Formatting: Biome
- Task Runner: npm-run-all
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Repository
https://github.com/8beeeaaat/ioris_tokenizer_kuromoji
Issues
https://github.com/8beeeaaat/ioris_tokenizer_kuromoji/issues
