npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@jrc03c/js-text-tools

v0.0.106

Published

a few small tools for working with text in js

Readme

Introduction

js-text-tools is just a little collection of tools for modifying text. It has both programmatic and command line APIs.

Installation

To use in the command line:

git clone https://github.com/jrc03c/js-text-tools.js
cd js-text-tools
npm link

Or to use in Node or the browser:

npm install --save https://github.com/jrc03c/js-text-tools.js

Usage

In Node / bundlers:

const {
  camelify,
  collapseWhitespace,
  convertObjectToTypedArray,
  convertTypedArrayToObject,
  damerauLevenshteinDistance,
  EmailAddressStandardizer,
  extractEmailAddresses,
  fuzzyFind,
  fuzzyFindScore,
  getCharCounts,
  getChars,
  getCharSet,
  getIDFScore,
  getNGramCounts,
  getNGrams,
  getNGramSet,
  getStats,
  getTFIDFScore,
  getTFScore,
  indent,
  isEmailAddress,
  isNumberString,
  kebabify,
  levenshteinDistance,
  parse,
  pascalify,
  punctuation,
  removeDiacriticalMarks,
  screamify,
  screamingSnakeify,
  snakeify,
  spongeify,
  standardizeEmailAddress,
  StringCounter,
  stringify,
  strip,
  TextObject,
  TextStats,
  unindent,
  urlPathJoin,
  wrap,
} = require("@jrc03c/js-text-tools")

In the browser (standalone):

<script src="path/to/dist/js-text-tools.js"></script>
<script>
  // import functions individually
  const {
    camelify,
    collapseWhitespace,
    indent,
    // etc.
  } = JSTextTools

  // or dump everything into the global scope
  JSTextTools.dump()
</script>

In the command line (where all results are written out to stdout):

camelify "Hello, world!"
# helloWorld

kebabify "Hello, world!"
# hello-world

snakeify "Hello, world!"
# hello_world

# indent the lines of somefile.txt by two spaces
indent somefile.txt "  "

# unindent the lines of somefile.txt
unindent somefile.txt

# wrap the lines in somefile.txt at 80 characters and show the output in stdout
wrap somefile.txt

# wrap the lines in somefile.txt at 40 characters and save the wrapped text
# back into somefile.txt
wrap -m 40 -s somefile.txt

# wrap the lines in somefile.txt to 80 characters and save the wrapped text
# into a new file called somewrappedfile.txt
wrap -o somewrappedfile.txt somefile.txt

API

camelify(text)

Returns the text in camel-case.

camelify("Hello, world!")
// helloWorld

collapseWhitespace(text)

Returns the text with all whitespace reduced to single spaces and all leading and trailing whitespace removed.

collapseWhitespace("\n\t   \n\r \t Hello,    \t world!\n\n \t   ")
// Hello, world!

damerauLevenshteinDistance(text1, text2)

Returns the Damerau-Levenshtein distance between two strings. This metric differs from the traditional Levenshtein distance in that it considers transpositions between adjacent characters to be a single operation (as opposed to two substitution operations).

const a = "brat"
const b = "bart"

damerauLevenshteinDistance(a, b)
// 1

levenshteinDistance(a, b)
// 2

EmailAddressStandardizer

EmailAddressStandardizer(options) (constructor)

The constructor can accept an options object with any of these properties (corresponding to the instance properties described below):

  • shouldConvertDomainToPunycode
  • shouldRemoveDiacriticalMarksInUsername
  • shouldRemovePeriodsInUsername
  • shouldRemoveTagsInUsername

Properties

shouldConvertDomainToPunycode

A boolean indicating whether or not the domain name part of the email address (e.g., the "example.com" part of "[email protected]") should be converted to its Punycode equivalent. The default is false.

For example:

const standardizer = new EmailAddressStandardizer()
console.log(standardizer.standardize("sömeoné@exåmple.com"))
// someone@exåmple.com

standardizer.shouldConvertDomainToPunycode = true
console.log(standardizer.standardize("sömeoné@exåmple.com"))
// [email protected]

shouldRemoveDiacriticalMarksInUsername

A boolean indicating whether or not diacritical marks on characters in the username part of the email address (e.g., the "someone" in "[email protected]") should be removed. The default is true.

For example:

const standardizer = new EmailAddressStandardizer()
console.log(standardizer.standardize("sömeoné@exåmple.com"))
// someone@exåmple.com

standardizer.shouldRemoveDiacriticalMarksInUsername = false
console.log(standardizer.standardize("sömeoné@exåmple.com"))
// sömeoné@exåmple.com

shouldRemovePeriodsInUsername

A boolean indicating whether or not periods in the username part of the email address (e.g., the "someone" in "[email protected]") should be removed. The default is false.

For example:

const standardizer = new EmailAddressStandardizer()
console.log(standardizer.standardize("[email protected]"))
// [email protected]

standardizer.shouldRemovePeriodsInUsername = true
console.log(standardizer.standardize("[email protected]"))
// [email protected]

shouldRemoveTagsInUsername

A boolean indicating whether or not tags in the username part of the email address (e.g., the "+whatevs" in "[email protected]") should be removed. The default is false.

For example:

const standardizer = new EmailAddressStandardizer()
console.log(standardizer.standardize("[email protected]"))
// [email protected]

standardizer.shouldRemoveTagsInUsername = true
console.log(standardizer.standardize("[email protected]"))
// [email protected]

Methods

standardize(emailAddress)

Returns the standardized form of emailAddress.

transform(emailAddress)

Is an alias for the standardize method.

extractEmailAddresses(x)

Returns an array of email addresses present in x. Note that x can have any data type.

const x = "Hi! I'm Josh, and you can email me at [email protected] any time!"
console.log(extractEmailAddresses(x))
// ["[email protected]"]

const y = { name: "Josh", email: "[email protected]" }
console.log(extractEmailAddresses(y))
// ["[email protected]"]

fuzzyFind(query, docs, resultsCount=1, maxNGramLength=1)

Returns the best-scoring result(s) from fuzzyFindScore. (See that function's documentation below for more information about the returned results.) If resultsCount is greater than 1, then an array of results will be returned; otherwise only a single result (not in an array) will be returned.

const query = "cat"
const docs = ["I like bars", "I like cars", "I like cats"]

// return a single result:
console.log(fuzzyFind(query, docs))
// { matches: [ 'cats' ], score: 1, doc: 'I like cats' }

// return multiple results:
console.log(fuzzyFind(query, docs, 3))
// [
//   { matches: [ 'cats' ], score: 1, doc: 'I like cats' },
//   { matches: [ 'cars' ], score: 0.75, doc: 'I like cars' },
//   { matches: [ 'bars' ], score: 0.4375, doc: 'I like bars' }
// ]

fuzzyFindScore(query, docs, maxNGramLength=1, shouldOmitDocsFromResults=false)

Returns an array of objects (one per item in docs) with these properties:

  • doc = the string that was searched (included if shouldOmitDocsFromResults is not true)
  • matches = the part of the string (doc) that best matched query
  • score = how well the string matched the query (between [0, 1], with higher scores corresponding to better matches)
const query = "cat"
const docs = ["I like bars", "I like cars", "I like cats"]
console.log(fuzzyFindScore(query, docs))
// [
//   { matches: [ 'bars' ], score: 0.4375, doc: 'I like bars' },
//   { matches: [ 'cars' ], score: 0.75, doc: 'I like cars' },
//   { matches: [ 'cats' ], score: 1, doc: 'I like cats' }
// ]

Note that, by default, the function will not search over all of a document's (n > 1)-grams; it will only search over individual words. This means, for example, that if you search for "foo bar" and one document contains "foo bar" and another document contains "bar foo", then both of those documents will receive the same score.

const query = "foo bar"
const docs = ["foo bar", "bar foo"]
console.log(fuzzyFindScore(query, docs))
// [
//   { matches: [ 'foo', 'bar' ], score: 1 },
//   { matches: [ 'foo', 'bar' ], score: 1 }
// ]

That's because the function's default behavior is to take the mean over all of the comparison scores between all of the search terms and all of the terms in each document; and in the example above, both documents contain both search terms, and thus receive the same scores.

But the documents and their scores can be differentiated by passing in a higher maxNGramLength value. In the example above, if we pass in a maxNGramLength of 2 (meaning that not only will our search terms "foo" and "bar" be compared to the documents but also that the 2-gram "foo bar" will be compared with all of the 1- and 2-grams in all of the documents), the document containing the 2-gram "foo bar" will receive a better score than the document that merely contains the same words but in the wrong order.

const query = "foo bar"
const docs = ["foo bar", "bar foo"]
const maxNGramLength = 2
console.log(fuzzyFindScore(query, docs, maxNGramLength))
// [
//   { matches: [ 'foo', 'foo bar' ], score: 1 },
//   { matches: [ 'bar foo', 'bar' ], score: 0.891156462585034 }
// ]

The reason to avoid increasing the maxNGramLength value, though, is that it adds significantly more execution time.

getCharCounts(raw)

Returns a dictionary containing the number of times each character appears in raw.

const raw = "a ab abc abcd abcde"
console.log(getCharCounts(raw))
// { a: 5, ' ': 4, b: 4, c: 3, d: 2, e: 1 }

getChars(raw)

Returns the list of all characters in raw. Can include duplicates. (To get the list of unique characters in raw, use getCharSet.)

const raw = "a ab abc abcd abcde"
console.log(getChars(raw))
// [
//   'a', ' ', 'a', 'b', ' ',
//   'a', 'b', 'c', ' ', 'a',
//   'b', 'c', 'd', ' ', 'a',
//   'b', 'c', 'd', 'e'
// ]

getCharSet(raw)

Returns the list of unique characters in raw. (To get the full list of characters that potentially includes duplicates, use getChars.)

const raw = "a ab abc abcd abcde"
console.log(getCharSet(raw))
// [ 'a', ' ', 'b', 'c', 'd', 'e' ]

getIDFScore(term, allDocStats)

Returns the inverse document frequency (IDF) score of a string called term given an array of TextStats instances (i.e., the things returned from the getStats function).

import { getIDFScore, getStats } from "@jrc03c/js-text-tools"
import fs from "node:fs"
import path from "node:path"

const dir = "path/to/files"
const files = fs.readdirSync(dir)
const docs = files.map(f => fs.readFileSync(path.join(dir, f), "utf8"))
const maxNGramLength = 1
const allDocStats = docs.map(d => getStats(d, maxNGramLength))
console.log(getIDFScore("hello", allDocStats))

Note that the formula used to compute the inverse document frequency score is my own variant and is kind of a hodge-podge of the formulae for computing inverse document frequency on Wikipedia. Using Wikipedia's notation, my formula is something like:

$\text{idf}(t, D) = 1 - (\dfrac{n_t}{N})^{0.25}$

Where:

  • $t$ = the term
  • $D$ = the corpus of documents
  • $N$ = the number of documents in $D$
  • $n_t$ = the number of documents in $D$ in which $t$ appears

The traditional inverse document frequency formula is $-log(\dfrac{n_t}{N})$, which looks like this:

I suppose I've always found it a bit odd that the traditional IDF score was allowed to be arbitrarily large. Maybe there's good reason for it. But my formula returns a maximum possible value of 1:

getNGramCounts(raw, maxNGramLength=Infinity)

Returns a dictionary containing the number of times each n-gram appears in raw.

const raw = "a a b a b c a b c d a b c d e"
console.log(getNGramCounts(raw, 3))
// {
//   a: 5,
//   'a a': 1,
//   'a a b': 1,
//   'a b': 4,
//   'a b a': 1,
//   b: 4,
//   'b a': 1,
//   'b a b': 1,
//   'a b c': 3,
//   'b c': 3,
//   'b c a': 1,
//   c: 3,
//   'c a': 1,
//   'c a b': 1,
//   'b c d': 2,
//   'c d': 2,
//   'c d a': 1,
//   d: 2,
//   'd a': 1,
//   'd a b': 1,
//   'c d e': 1,
//   'd e': 1,
//   e: 1
// }

getNGrams(raw, maxNGramLength=Infinity)

Gets the list of n-grams in raw. Is allowed to include duplicates. (To get the set of unique n-grams, use getNGramSet.)

NOTE: Be aware that this function doesn't do any cleaning of the text, which means that (e.g.) "Hello" is treated as a different n-gram than "Hello!".

const raw = "It was the best of times..."
console.log(getNGrams(raw))
// [
//   'It',
//   'It was',
//   'It was the',
//   'It was the best',
//   'It was the best of',
//   'It was the best of times...',
//   'was',
//   'was the',
//   'was the best',
//   'was the best of',
//   'was the best of times...',
//   'the',
//   'the best',
//   'the best of',
//   'the best of times...',
//   'best',
//   'best of',
//   'best of times...',
//   'of',
//   'of times...',
//   'times...'
// ]

console.log(getNGrams(raw, 3))
// [
//   'It',          'It was',
//   'It was the',  'was',
//   'was the',     'was the best',
//   'the',         'the best',
//   'the best of', 'best',
//   'best of',     'best of times...',
//   'of',          'of times...',
//   'times...'
// ]

getNGramSet(raw, maxNGramLength=Infinity)

Returns the set of unique n-grams in raw. (To get the full list of n-grams that potentially includes duplicates, use getNGrams.)

NOTE: Be aware that this function doesn't do any cleaning of the text, which means that (e.g.) "Hello" is treated as a different n-gram than "Hello!".

const raw = "i came i saw i conquered"
console.log(getNGramSet(raw))
// [
//   'i',
//   'i came',
//   'i came i',
//   'i came i saw',
//   'i came i saw i',
//   'i came i saw i conquered',
//   'came',
//   'came i',
//   'came i saw',
//   'came i saw i',
//   'came i saw i conquered',
//   'i saw',
//   'i saw i',
//   'i saw i conquered',
//   'saw',
//   'saw i',
//   'saw i conquered',
//   'i conquered',
//   'conquered'
// ]

console.log(getNGramSet(raw, 1))
// [ 'i', 'came', 'saw', 'conquered' ]

getStats(raw, maxNGramLength=Infinity)

Returns a TextStats instance.

const raw = "It was the best of times..."
console.log(getStats(raw))
// {
//   charCounts: { ... },
//   chars: [ ... ],
//   charSet: [ ... ],
//   leastFrequentChars: [ ... ],
//   leastFrequentNGrams: [ ... ],
//   mostFrequentChars: [ ... ],
//   mostFrequentNGrams: [ ... ],
//   nGramCounts: { ... },
//   nGrams: [ ... ],
//   nGramSet: [ ... ],
// }

getTFScore(term, docStats)

Returns the term frequency (TF) score of a string called term given a TextStats instance (i.e., a thing returned from the getStats function).

const doc = fs.readFileSync("path/to/some-file.txt", "utf8")
const maxNGramLength = 1
const docStats = getStats(doc, maxNGramLength)
console.log(getTFScore("hello", docStats))

Note that the formula used to compute the text frequency score is my own variant and is kind of a hodge-podge of the formulae for computing text frequency on Wikipedia. Using Wikipedia's notation, my formula is something like:

$\text{tf}(t, d) = (\dfrac{f_{t,d}}{\text{max}{{t' \in d}} f{t',d}})^{0.25}$

Where:

  • $t$ = the term
  • $d$ = the document
  • $f_{t,d}$ = the number of times the term appears in the document
  • $\text{max}{{t' \in d}} f{t',d}$ = the maximum number of times any term appears in the document

So, it's sort of like the traditional term frequency score fraction but modified to use the augmented term frequency (as in the double normalization $K$ variant) and then taken to the power of 0.25.

Traditionally, term frequency and augmented term frequency scores are linear, meaning that the scores approach 1 as quickly as the frequency of the term in the document approaches 100%. But my variant takes the augmented frequency to the 0.25 power so that each additional appearance of a term yields less of a boost to the score:

getTFIDFScore(term, docStats, allDocStats)

Returns the TF-IDF score of a string called term relative to a particular TextStats instance given an array of TextStats instances. (TextStats instances are the things returned from the getStats function.) It's just computed as the product of the results from the getTFScore and getIDFScore functions.

import { getTFIDFScore, getStats } from "@jrc03c/js-text-tools"
import fs from "node:fs"
import path from "node:path"

const dir = "path/to/files"
const files = fs.readdirSync(dir)
const docs = files.map(f => fs.readFileSync(path.join(dir, f), "utf8"))
const maxNGramLength = 1
const allDocStats = docs.map(d => getStats(d, maxNGramLength))
console.log(getTFIDFScore("hello", allDocStats[0], allDocStats))

Please see the documentation for the getTFScore and getIDFScore functions to learn more about their inner workings. Here, though, I'll just say that I wanted a couple extra criteria not found in the TF and IDF formulae on Wikipedia:

  • I wanted the TF and IDF formulae to be as analogous to one another as possible. Unless you use the logarithmic variants, no TF formula seems conceptually similar to any IDF formula except for the fact that most of them are built on frequency percentages.
  • I wanted them to return values in the range [0, 1].

So, I ended up with these two that are vertically symmetrical:

indent(text, chars="")

Returns the text with all lines indented by chars. By default, chars is an empty string.

indent("Hello, world!", "\t\t")
// \t\tHello, world!

isEmailAddress(text)

Returns a boolean indicating whether or not the given value is an email address. Note that the function neither trims whitespace nor performs any other kind of processing of the given value before evaluating whether or not it is an email address.

console.log(isEmailAddress("[email protected]"))
// true

console.log(isEmailAddress("   [email protected]   "))
// false

console.log(isEmailAddress("Hello, world!"))
// false

console.log(isEmailAddress(true))
// false

isNumberString(text)

Returns a boolean indicating whether or not the given value is a number in string form (e.g., "23.45"). Forms recognized as number strings include:

  • positive and negative integers (e.g., "234", "+234", and "-234")
  • positive and negative BigInt values (e.g., "234n", "+234n", and "-234n")
  • positive and negative floats (e.g., "23.45", "+23.45", and "-23.45")
  • scientific notations (e.g., "23.45e67", "+23.45e+67", and "-23.45e-67")
  • positive and negative infinity (e.g., "Infinity", "+Infinity", "-Infinity", "∞", "+∞", "-∞")
  • not-a-number values (e.g., "NaN")

kebabify(text)

Returns the text in kebab-case.

kebabify("Hello, world!")
// hello-world

levenshteinDistance(text1, text2)

Returns the Levenshtein distance between two strings.

levenshteinDistance("cat", "hat")
// 1

parse(text)

Returns the value represented by the string text. For security reasons, function strings are not parsed.

pascalify(text)

Returns the text in Pascal-case.

pascalify("Hello, world!")
// HelloWorld

removeDiacriticalMarks(text)

Returns a version of the text in which characters containing diacritical marks are replaced with characters that do not contain diacritical marks.

removeDiacriticalMarks("Å")
// A

removeDiacriticalMarks("ü")
// u

removeDiacriticalMarks("ß")
// ß
// (not changed)

screamify(text)

Returns the text in screaming snake case.

screamify("Hello, world!")
// HELLO_WORLD

screamingSnakeify(text)

Identical to screamify.

snakeify(text)

Returns the text in snake-case.

snakeify("Hello, world!")
// hello_world

spongeify(text)

Returns the text in "spongecase" (AKA "SpongeBob case" or "alternating caps").

spongeify("It was the best of times...")
// It WaS tHe BeSt Of TiMeS...

standardizeEmailAddress(emailAddress, options)

A standalone function that is equivalent to doing this:

new EmailAddressStandardizer(options).standardize(emailAddress)

StringCounter (class)

A little utility class to help with string counting.

const counter = new StringCounter()
const raw = "aaaaabbbbcccdde"

for (const char of raw.split("")) {
  counter.increment(char)
}

console.log(counter.counts)
// { a: 5, b: 4, c: 3, d: 2, e: 1 }

console.log(counter.countsSorted)
// [
//   { count: 1, value: 'e' },
//   { count: 2, value: 'd' },
//   { count: 3, value: 'c' },
//   { count: 4, value: 'b' },
//   { count: 5, value: 'a' }
// ]

console.log(counter.leastFrequentValues)
// [ 'e' ]

console.log(counter.mostFrequentValues)
// [ 'a' ]

console.log(counter.getCount("a"))
// 5

StringCounter(data) (constructor)

The data object can include these properties (all of which are optional):

  • counts = corresponds to the counts property (described below)

Properties

counts

A dictionary that maps a string to the number of times that string has been counted.

countsSorted (getter)

An array of objects (each with count and value properties) sorted by count.

leastFrequentValues (getter)

An array of values with the lowest count.

mostFrequentValues (getter)

An array of values with the highest count.

values (getter)

An array of all the values that have been counted.

Methods

getCount(value)

Returns the number of times value has been counted.

increment(value)

Increases the number of times value has been counted by 1 and then returns the new count.

stringify(value, [indentation])

Returns value converted to a string. If a string is passed as indentation, then that string is used to indent each line. For example, passing " " will use two spaces for each indentation of each line; and passing "\t" will use a tab for each indentation of each line. If no value or an empty string is passed as indentation, then items in lists and key-value pairs in objects won't be placed on new lines and indented. In that way, its functionality is somewhat similar to JSON.stringify.

This function automatically handles cyclic references by replacing each cyclic reference with the string <reference to "/some/path"> where "/some/path" represents the path down through the root object to the original referent. Consider this object:

const myObj = {
  this: {
    is: {
      deeply: {
        nested: "yep!",
      },
    },
  },
}

We could add a circular reference to it:

myObj.this.is.deeply.circular = myObj.this.is

Now, when we inspect the object, we see:

const util = require("util")
console.log(util.inspect(myObj, { depth: null, colors: true }))
// {
//   this: {
//     is: <ref *1> {
//       deeply: { nested: 'yep!', circular: [Circular *1] }
//     }
//   }
// }

Since the circular reference points to myObj.this.is, the stringify function will replace the circular reference with "<reference to \"/this/is\">":

const { stringify } = require("@jrc03c/js-text-tools")
console.log(stringify(myObj, null, 2))
// {
//   "this": {
//     "is": {
//       "deeply": {
//         "nested": "yep!",
//         "circular": "<reference to \"/this/is\">"
//       }
//     }
//   }
// }

The gist is that the value to be stringified is first copied in such a way that cyclic references are replaced with string descriptions, and then the safe copy is actually what gets stringified.

Finally, note that the built-in typed arrays (e.g., Float64Array) are stringified in a special way: they're converted to objects and then stringified. The objects to which they're converted have these properties:

  • constructor = A string representing the name of the class to which the array belongs (e.g., a Float64Array would have a constructor value of "Float64Array").
  • flag = The string "FLAG_TYPED_ARRAY".
  • values = A new, non-typed array containing the values from the original typed array.

The reason for this additional stringification step is that typed arrays can't be stringified by JSON.stringify and then reinstantiated automatically in their original type by JSON.parse. So, the stringify and parse functions in this library are designed to handle those and a few other edge cases — though they otherwise function mostly like JSON.stringify and JSON.parse.

strip(text, options={})

Returns a lower-cased version of the text with all punctuation removed and all whitespace collapsed. The options object is optional and can include these properties:

  • exclude = a string of characters, a regular expression, a function, or an array of those things that indicate which characters should not be present in the return value; the default is a string containing all punctuation marks
  • include = a string of characters, a regular expression, a function, or an array of those things that indicate which characters should be present in the return value; the default is undefined
  • shouldPreserveCase = a boolean indicating whether or not character case (upper or lower) should be preserved; the default is false
  • shouldPreserveDiacriticalMarks = a boolean indicating whether or not diacritical marks should be preserved; the default is true
  • shouldPreserveWhitespace = a boolean indicating whether or not non-single-space whitespace characters (e.g., newlines, carriage returns, tabs, etc.) should be preserved; the default is false

NOTE: The exclude and include properties are mutually exclusive and cannot be used at the same time!

strip("Hello, world!")
// hello world

strip("Hello, world!", { shouldPreserveCase: true })
// Hello world

TextObject (class)

This is a helper class that can be used to give structure to otherwise unstructured text. It's intentionally minimalistic with the goal of providing maximum flexibility. The basic idea is that virtually every chunk of text:

  • can be thought of as having children (e.g., a paragraph's children are its sentences, a word's children are its characters, etc.)
  • has an identifier (e.g., a chapter has a title, a book has an ISBN number, etc.)
  • can have references to other chunks of text (e.g., footnotes, endnotes, asides, etc.)
  • can combine its children by use of a separator to form a total value (e.g., a chapter whose children are paragraphs can join those paragraphs with newlines to form its total value, a word whose children are characters can join those characters with empty strings to form its total value, etc.)

An example use case might be taking a book in plaintext from gutenberg.org, breaking it into parts, chapters, sections, paragraphs, sentences, and words; and then using this class to encode all of that structure and store it on disk.

TextObject(data) (constructor)

The data options object can have any or all of the properties described below. The only exception is that children and value properties are mutually exclusive, and a children property will be preferred over a value property if both are present.

Properties

children

An array of strings and/or TextObject instances.

id

A string.

references

An array of strings and/or TextObject instances.

separator

A string.

value (getter / setter)

A string made of the instance's children joined by its separator.

Methods

toObject()

Returns a plain JS object with the same basic structure as the instance but ready to be stringified and written to disk.

TextStats (class)

TextStats(data) (constructor)

The data options object can have any or all of the properties described below.

Properties

charCounts

A dictionary containing the number of times each character appears in a given string. Equivalent to the value returned from getCharCounts.

chars

A list of all of the characters in a given string. Equivalent to the value returned from getChars.

charSet

A list of all unique characters in a given string. Equivalent to the value returned from getCharSet.

leastFrequentChars

A list of characters that appear least often in a given string.

leastFrequentNGrams

A list of n-grams that appear least often in a given string.

mostFrequentChars

A list of characters that appear most often in a given string.

mostFrequentNGrams

A list of n-grams that appear most often in a given string.

nGramCounts

A dictionary containing the number of times each n-gram appears in a given string. Equivalent to the value returned from getNGramCounts.

nGrams

A list of all of the n-grams in a given string. Equivalent to the value returned from getNGrams.

nGramSet

A list of all unique n-grams in a given string. Equivalent to the value returned from getNGramSet.

Methods

compute(raw, maxNGramLength) (static)

Returns a TextStats instance.

unindent(text)

Returns the text with all lines unindented by the same number of characters. For example, if the smallest amount of indentation is 4 spaces, then each line will be unindented by 4 spaces.

For example, suppose I have a file called message.txt with this content:

    Hello, world!
      My name is Josh.
        What's your name?

The smallest amount of indentation in the file is 4 spaces. So, unindenting it will move all lines to the left by 4 spaces.

const { unindent } = require("@jrc03c/js-text-tools")
const fs = require("fs")
const message = fs.readFileSync("message.txt", "utf8")
const unindentedMessage = unindent(message)
fs.writeFileSync("unindented-message.txt", unindentedMessage, "utf8")

The contents of unindented-message.txt would be:

Hello, world!
  My name is Josh.
    What's your name?

NOTE: The unindent function does not pay attention to whether indentation consists of spaces or tabs. It only cares whether or not a character is a whitespace character. It also makes no attempt to make the whitespace characters consistent (i.e., it doesn't try to begin each line with all spaces or all tabs); it merely removes the minimum number of whitespace characters from each line and returns the result.

urlPathJoin(a, b, c, ...)

Returns parts of a URL joined by forward-slashes. Collapses multiple forward-slashes but preserves the colon-double-forward-slash (://) that indicates a protocol.

urlPathJoin("foo", "bar", "baz")
// "foo/bar/baz"

urlPathJoin("https://example.com", "path", "to", "image.png")
// "https://example.com/path/to/image.png"

urlPathJoin("///foo///", "//bar///", "/////baz//////")
// "/foo/bar/baz/"

urlPathJoin("path", "to", "image.png", "https://example.com")
// "path/to/image.png/https://example.com"

As illustrated by that last example, this function isn't very intelligent and isn't concerned with whether or not the value it returns is a valid URL. Its purpose is simply to concatenate strings with forward-slashes while removing duplicate forward-slashes where possible.

wrap(text, maxLineLength=80, wrappedLinePrefix="")

Returns the text with all lines wrapped to a maximum length of maxLineLength. By default, the maxLineLength is 80 in the browser or the minimum of 80 and the number of stdout columns in the command line. Note that this function only wraps at spaces; it does not wrap mid-word, and it does not attempt to hyphenate words. The wrapping does preserve indentation, though. Wrapped lines can optionally be prefixed with a specific string value.

const text =
  "Lorem ipsum dolor sit amet, consectetur adipiscing elit. Nam mollis tellus eu mi condimentum, a congue ipsum luctus. Donec vel suscipit dolor, vitae faucibus massa. Curabitur rhoncus semper tortor et mattis. Nullam laoreet lobortis nibh eget viverra. Nam molestie risus vitae ante placerat convallis. Pellentesque quis tristique dui. Vivamus efficitur mi erat, nec gravida felis posuere at. Donec sapien ipsum, viverra et aliquam quis, posuere ac ligula. Aenean egestas tincidunt mauris, in hendrerit tortor malesuada id. Proin viverra sodales ex eu fermentum. Aenean nisl ipsum, tristique venenatis massa eget, tempor facilisis felis. Praesent aliquam sem vitae arcu porta commodo. Aliquam tempor sollicitudin dapibus. Nulla ullamcorper orci eu ultricies cursus."

wrap(text, 20, ">> ")

/*
Lorem ipsum dolor
>> sit amet,
>> consectetur
>> adipiscing elit.
>> Nam mollis
>> tellus eu mi
>> condimentum, a
>> congue ipsum
>> luctus. Donec
>> vel suscipit
>> dolor, vitae
>> faucibus massa.
>> Curabitur
>> rhoncus semper
>> tortor et
>> mattis. Nullam
>> laoreet lobortis
>> nibh eget
>> viverra. Nam
>> molestie risus
>> vitae ante
>> placerat
>> convallis.
>> Pellentesque
>> quis tristique
>> dui. Vivamus
>> efficitur mi
>> erat, nec
>> gravida felis
>> posuere at.
>> Donec sapien
>> ipsum, viverra
>> et aliquam quis,
>> posuere ac
>> ligula. Aenean
>> egestas
>> tincidunt
>> mauris, in
>> hendrerit tortor
>> malesuada id.
>> Proin viverra
>> sodales ex eu
>> fermentum.
>> Aenean nisl
>> ipsum, tristique
>> venenatis massa
>> eget, tempor
>> facilisis felis.
>> Praesent aliquam
>> sem vitae arcu
>> porta commodo.
>> Aliquam tempor
>> sollicitudin
>> dapibus. Nulla
>> ullamcorper orci
>> eu ultricies
>> cursus.
*/