@diplodoc/sentenizer

v0.0.10

Published

17 days ago

text segmentation into sentences

Downloads

7,373

@diplodoc/sentenizer

Rule-based NLP library for sentence segmentation with Russian language support. Splits text into sentences using hand-crafted rules that handle Russian-specific cases like abbreviations, initials, and punctuation.

Features

Rule-based segmentation — Hand-crafted rules optimized for Russian text
Russian-specific handling — Correctly handles abbreviations (e.g., "и т. д.", "т. п."), initials (e.g., "И. В. Иванов"), quotations, and brackets
Functional programming — Uses Ramda for clean, composable code
High accuracy — Tested on various Russian text samples
Lightweight — No external NLP libraries, fast and self-contained

Installation

npm install @diplodoc/sentenizer

Usage

Basic Example

const {sentenize} = require('@diplodoc/sentenizer');

const text = 'Он купил фрукты - яблоки, бананы, и т. д. все были очень рады угощению. Вот такой он добродушный наш родственник И. В. Иванов.';

const sentences = sentenize(text);
// sentences:
// [
//  'Он купил фрукты - яблоки, бананы, и т. д. все были очень рады угощению.',
//  'Вот такой он добродушный наш родственник И. В. Иванов.'
// ]

ES Modules

import {sentenize} from '@diplodoc/sentenizer';

const sentences = sentenize('Первое предложение. Второе предложение!');
// ['Первое предложение.', ' Второе предложение!']

API

`sentenize(text: string): string[]`

Splits text into sentences.

Parameters:

text (string) — Text to segment

Returns:

string[] — Array of segmented sentences

Type signature:

sentenize :: string -> string[]

How It Works

The library uses a rule-based approach with two types of conditions:

Join Conditions (chunks should be merged):

Abbreviations (e.g., "и т. д.", "т. п.")
Initials (e.g., "И. В. Иванов")
Quotations and brackets
Lowercase continuation
Delimiters and spaces

Break Conditions (chunks should be split):

Hard breaks (double newlines)
Uppercase start after sentence end

Integration

This package is used by @diplodoc/translation for text segmentation during translation.

Development

See AGENTS.md for detailed development guidelines.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@diplodoc/sentenizer

Features

Installation

Usage

Basic Example

ES Modules

API

sentenize(text: string): string[]

How It Works

Integration

Development

License

`sentenize(text: string): string[]`