npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

cldr-segmentation

v2.2.1

Published

CLDR text segmentation for JavaScript

Downloads

21,390

Readme

Build Status

cldr-segmentation

Text segmentation library for JavaScript.

What is this thing?

This library provides CLDR-based text segmentation capabilities in JavaScript. Text segmentation is the process of identifying word, sentence, and other boundaries in a text. The segmentation rules are published by the Unicode consortium as part of the Common Locale Data Repository, or CLDR, and made freely available to the public.

Why not just split on spaces or periods?

Good question. Most of the time, that'll probably work fine. However, it's not always obvious where words or sentences should start or end. Consider this sentence:

I like Mrs. Murphy. She's nice.

Splitting only on periods will give you ["I like Mrs. ", "Murphy. ", "She's nice."], which probably isn't what you wanted - the period after Mrs doesn't indicate the end of the sentence.

In addition, other languages use different segmentation rules than English. For example, identifying sentence boundaries in Japanese is a little more difficult because sentences tend to end with \u3002 - the ideographic full stop - as opposed to a period. The CLDR contains support for hundreds of languages, meaning you don't have to consider every language when dealing with international text.

Examples

Cldr-segmentation is published as both a UMD module and an ES6 module, meaning it should work in node via require or import and the browser via a <script> tag. In the browser, use window.cldrSegmentation to access the library's functionality.

UMD module:

const cldrSegmentation = require("cldr-segmentation");

ES6 module:

import * as cldrSegmentation from 'cldr-segmentation'

Sentence Segmentation

cldrSegmentation.sentenceSplit("I like Mrs. Murphy. She's nice.");
// => ["I like Mrs. ", "Murphy. ", "She's nice."]

You'll notice that Mrs. was treated as the end of a sentence. To avoid this, use the suppressions for the language you care about. Suppressions are essentially arrays of strings. Each string represents a series of characters after which there should not be a break. Using the English suppressions for the example above yields better results:

var supp = cldrSegmentation.suppressions.en;
cldrSegmentation.sentenceSplit("I like Mrs. Murphy. She's nice.", supp);
// => ["I like Mrs. Murphy. ", "She's nice."]

If you'd like to iterate over each sentence instead of splitting, use a BreakIterator:

var breakIter = new cldrSegmentation.BreakIterator(supp);
var str = "I like Mrs. Murphy, she's nice.";

breakIter.eachSentence(str, (sentence, start, stop) => {
  // do something
});

Suppressions for all languages are available via cldrSegmentation.suppressions.all.

Other Types of Segmentation

Word, line, and grapheme cluster segmentation are supported:

cldrSegmentation.wordSplit("I like Mrs. Murphy. She's nice.");
// => ["I", " ", "like", " ", "Mrs",  ".", " ", "Murphy", ".", "She's", " ", "nice", "."]

Also available are the lineSplit and graphemeSplit functions.

When using a break iterator:

var breakIter = new cldrSegmentation.BreakIterator(supp);
var str = "I like Mrs. Murphy, she's nice.";

breakIter.eachWord(str, (word, start, stop) => {
  // do something
});

Also available are the eachLine and eachGraphemeCluster functions.

Custom Suppressions

Suppressions are just strings after which a break should not occur. This library comes with a set of common suppressions for a variety of languages, but you may want to add your own. Suppression objects can be merged. For example, here's how to add "Dr." to the set of English suppressions:

var customSupps = cldrSegmentation.Suppressions.create(['Dr.']);
var supps = cldrSegmentation.suppressions.en.merge(customSupps);
cldrSegmentation.sentenceSplit("We love Dr. Strange. He's cool.", supps);

Custom Suppression Objects

Suppression objects are just plain 'ol Javascript objects with a single shouldBreak function that returns a boolean. The function is passed a cursor object positioned at the index of the proposed break. Cursors deal exclusively with Unicode codepoints, meaning your custom suppression logic will need to be implemented in those terms. For example, let's create a custom suppression function that doesn't allow breaks after sentences that end with the letter 't'.

class TeeSuppression {
  shouldBreak(cursor) {
    var position = cursor.logicalPosition;

    // skip backwards past spaces and periods
    do {
      let cp = cursor.getCodePoint(position);
      position --;
    } while (cp === 32 || cp === 46);

    // we skipped one too many in the loop
    position ++;

    // if the ending character is 't', return false;
    // otherwise return true
    return cursor.getCodePoint(position) !== 116;
  }
}

Note that you don't have to use ES6 classes. It's equally valid to create a simple object:

let teeSuppression = {
  shouldBreak: (cursor) => {
    // logic here
  }
}

Running Tests

Tests are written in Jasmine and can be executed via jasmine-node:

  1. npm install -g jasmine-node
  2. jasmine-node spec

Authors

Written and maintained by Cameron C. Dutro (@camertron).

License

Copyright 2017 Cameron Dutro, licensed under the MIT license.