@bunkojp/text-segmentation
v0.2.0
Published
Split text into semantic or structural chunks using purely algorithmic strategies. Supports mixed Japanese/English text.
Downloads
29
Maintainers
Readme
text-segmentation
Split text into semantic or structural chunks using purely algorithmic strategies. No LLM or external API dependencies. Supports mixed Japanese/English text.
Install
npm install @bunkojp/text-segmentationQuick Start
import { segmentByNcdTfidf } from "@bunkojp/text-segmentation";
const text = `Today the weather is nice. I went for a walk. I saw flowers in the park.
Yesterday I went to the supermarket. I bought vegetables and meat. I cooked dinner.`;
const segments = segmentByNcdTfidf(text, {
targetChunkSize: 80,
minChunkSize: 20,
maxChunkSize: 300,
windowSize: 2,
});
for (const seg of segments) {
console.log(`[${seg.start}:${seg.end}]`, text.slice(seg.start, seg.end));
}Strategies
Four segmentation strategies are provided, listed from fastest/simplest to most semantically accurate.
Punctuation
Accumulates sentences up to a target size and splits at sentence boundaries. The fastest strategy.
import { segmentByPunctuation } from "@bunkojp/text-segmentation";
const segments = segmentByPunctuation(text, {
targetChunkSize: 500, // Target chunk size in characters
minChunkSize: 100, // Minimum chunk size
maxChunkSize: 2000, // Hard limit on chunk size
});Compression (NCD)
Computes Normalized Compression Distance between adjacent sentence windows to detect semantic boundaries.
import { segmentByCompression } from "@bunkojp/text-segmentation";
const segments = segmentByCompression(text, {
targetChunkSize: 500,
minChunkSize: 100,
maxChunkSize: 2000,
ncdThreshold: 0.4, // Boundary detection threshold (higher = fewer splits)
windowSize: 3, // Number of sentences per window
adaptive: false, // Set true for percentile-based automatic thresholding
ncdPercentile: 0.2, // Percentile used in adaptive mode
});TF-IDF
Uses TF-IDF cosine distance between adjacent sentence windows. Strong at detecting lexical topic shifts.
import { segmentByTfidf } from "@bunkojp/text-segmentation";
const segments = segmentByTfidf(text, {
targetChunkSize: 500,
minChunkSize: 100,
maxChunkSize: 2000,
tfidfThreshold: 0.45,
windowSize: 3,
adaptive: false,
tfidfPercentile: 0.2,
});NCD + TF-IDF
Weighted combination of compression distance and TF-IDF cosine distance. The most robust strategy.
import { segmentByNcdTfidf } from "@bunkojp/text-segmentation";
const segments = segmentByNcdTfidf(text, {
targetChunkSize: 500,
minChunkSize: 100,
maxChunkSize: 2000,
ncdTfidfThreshold: 0.42,
windowSize: 3,
ncdWeight: 0.5, // Weight for NCD component
tfidfWeight: 0.5, // Weight for TF-IDF component
adaptive: false,
ncdTfidfPercentile: 0.2,
});Streaming
All strategies provide an AsyncGenerator-based streaming API.
import { streamSegmentByCompression } from "@bunkojp/text-segmentation";
for await (const event of streamSegmentByCompression(text)) {
if (event.type === "segment") {
console.log(`Segment #${event.index}:`, event.point);
}
if (event.type === "done") {
console.log("Total segments:", event.points.length);
}
}Types
type SegmentPoint = {
start: number; // Start position (inclusive)
end: number; // End position (exclusive)
type: "heading" | "section" | "paragraph";
};Use text.slice(segment.start, segment.end) to extract segment text. All segments are contiguous with no gaps and cover the entire input text.
Utilities
The sentence splitter is also available as a standalone utility.
import { splitIntoSentences } from "@bunkojp/text-segmentation";
const sentences = splitIntoSentences("First sentence. Second sentence.");
// [{ index: 1, text: "First sentence.", start: 0, end: 16 }, ...]Example Results
Pre-generated segmentation results for Akutagawa Ryunosuke's "Rashomon" (5,839 characters) are included in spec/fixtures/results/. Each JSON file contains the strategy name, configuration, and full segment list with positions and full text content.
Punctuation uses a fixed targetChunkSize=500. Semantic strategies use adaptive mode with a wide size range (min=100, max=3000) so that semantic boundaries dominate over size constraints.
| Strategy | Segments | Avg Length | Config | |----------|----------|------------|--------| | Punctuation | 11 | 531 chars | target=500, min=100, max=2000 | | Compression | 24 | 243 chars | adaptive, percentile=0.25, window=3 | | TF-IDF | 28 | 209 chars | adaptive, percentile=0.25, window=3 | | NCD+TF-IDF | 27 | 216 chars | adaptive, percentile=0.25, window=3 |
To regenerate these results:
bun spec/fixtures/generate-results.tsHow It Works
The semantic strategies (Compression, TF-IDF, NCD+TF-IDF) share a common window-based algorithm:
- Split text into sentences
- Create sliding windows of N sentences
- Compute divergence between adjacent windows
- Select local maxima as boundary candidates
- Apply size constraints (min/max/target)
Adaptive mode replaces fixed thresholds with percentile-based automatic threshold selection and recursively subdivides oversized chunks.
