npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

sentencex

v1.0.13

Published

sentence segmentation library

Readme

Sentence segmenter

Rust Tests Node.js Tests Python Tests

A sentence segmentation library written in Rust language with wide language support optimized for speed and utility.

Bindings

Besides native Rust, bindings for the following programming languages are available:

Approach

  • If it's a period, it ends a sentence.
  • If the preceding token is in the hand-compiled list of abbreviations, then it doesn't end a sentence.

However, it is not 'period' for many languages. So we will use a list of known punctuations that can cause a sentence break in as many languages as possible.

We also collect a list of known, popular abbreviations in as many languages as possible.

Sometimes, it is very hard to get the segmentation correct. In such cases this library is opinionated and prefer not segmenting than wrong segmentation. If two sentences are accidentally together, that is ok. It is better than sentence being split in middle. Avoid over engineering to get everything linguistically 100% accurate.

This approach would be suitable for applications like text to speech, machine translation.

Consider this example: We make a good team, you and I. Did you see Albert I. Jones yesterday?

The accurate splitting of this sentence is ["We make a good team, you and I." ,"Did you see Albert I. Jones yesterday?"]

However, to achieve this level precision, complex rules need to be added and it could create side effects. Instead, if we just don't segment between I. Did, it is ok for most of downstream applications.

The sentence segmentation in this library is non-destructive. This means, if the sentences are combined together, you can reconstruct the original text. Line breaks, punctuations and whitespaces are preserved in the output.

Usage

Rust

Install the library using

cargo add sentencex

Then, any text can be segmented as follows.

use sentencex::segment;

fn main() {
    let text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";
    let sentences = segment("en", text);

    for (i, sentence) in sentences.iter().enumerate() {
        println!("{}. {}", i + 1, sentence);
    }
}

The first argument is language code, second argument is text to segment. The segment method returns an array of identified sentences.

Python

Install from PyPI:

pip install sentencex
import sentencex

text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development."

# Segment text into sentences
sentences = sentencex.segment("en", text)
for i, sentence in enumerate(sentences, 1):
    print(f"{i}. {sentence}")

# Get sentence boundaries with indices
boundaries = sentencex.get_sentence_boundaries("en", text)
for boundary in boundaries:
    print(f"Sentence: '{boundary['text']}' (indices: {boundary['start_index']}-{boundary['end_index']})")

See bindings/python/example.py for more examples.

Node.js

Install from npm:

npm install sentencex
import { segment, get_sentence_boundaries } from 'sentencex';

const text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";

// Segment text into sentences
const sentences = segment("en", text);
sentences.forEach((sentence, i) => {
    console.log(`${i + 1}. ${sentence}`);
});

// Get sentence boundaries with indices
const boundaries = get_sentence_boundaries("en", text);
boundaries.forEach(boundary => {
    console.log(`Sentence: '${boundary.text}' (indices: ${boundary.start_index}-${boundary.end_index})`);
});

For CommonJS usage:

const { segment, get_sentence_boundaries } = require('sentencex');

See bindings/nodejs/example.js for more examples.

WebAssembly (Browser)

Install from npm:

npm install sentencex-wasm

or use a CDN like https://esm.sh/sentencex-wasm

import init, { segment, get_sentence_boundaries } from 'https://esm.sh/sentencex-wasm;

async function main() {
    // Initialize the WASM module
    await init();

    const text = "The James Webb Space Telescope (JWST) is a space telescope specifically designed to conduct infrared astronomy. The U.S. National Aeronautics and Space Administration (NASA) led Webb's design and development.";

    // Segment text into sentences
    const sentences = segment("en", text);
    sentences.forEach((sentence, i) => {
        console.log(`${i + 1}. ${sentence}`);
    });

    // Get sentence boundaries with indices
    const boundaries = get_sentence_boundaries("en", text);
    boundaries.forEach(boundary => {
        console.log(`Sentence: '${boundary.text}' (indices: ${boundary.start_index}-${boundary.end_index})`);
    });
}

main();

Language support

The aim is to support all languages where there is a wikipedia. Instead of falling back on English for languages not defined in the library, a fallback chain is used. The closest language which is defined in the library will be used. Fallbacks for ~244 languages are defined.

Performance

Following is a sample output of sentence segmenting The Complete Works of William Shakespeare. This file is 5.29MB. As you can see below, it took half a second.

$ curl https://www.gutenberg.org/files/100/100-0.txt | ./target/release/sentencex -l en > /dev/null
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 5295k  100 5295k    0     0   630k      0  0:00:08  0:00:08 --:--:-- 1061k
Found 40923 paragraphs
Processing 540 chunks
Time taken for segment(): 521.071603ms
Total sentences: 153736

Measured on Golden Rule Set(GRS) for English. Lists are exempted (1. sentence 2. another sentence).

The following libraries are used for benchmarking:

| Tokenizer Library | English Golden Rule Set score | Speed(Avg over 100 runs) in seconds | | -------------------- | ----------------------------- | ----------------------------------- | | sentencex | 74.36 | 0.1357 | | mwtokenizer_tokenize | 30.77 | 1.54 | | blingfire_tokenize | 89.74 | 0.27 | | nltk_tokenize | 66.67 | 1.86 | | pysbd_tokenize | 97.44 | 10.57 | | spacy_tokenize | 61.54 | 2.45 | | spacy_dep_tokenize | 74.36 | 138.93 | | stanza_tokenize | 87.18 | 107.51 | | syntok_tokenize | 79.49 | 4.72 |

Thanks

License

MIT license. See License.txt