@twocaretcat/tally-ts

v2.0.0

Published

2 months ago

A TypeScript word counting library. Count the number of characters, words, sentences, paragraphs, and lines in your text instantly with tally-ts.

0High
0Medium
0Low

twocaretcat

character counter word counter sentence counter paragraph counter line counter text analysis text analyzer text statistics library typescript

👋 About

[!NOTE] We use the terms graphemes and characters interchangeably in this README, although technically we are counting Unicode grapheme clusters rather than Unicode characters.

tally-ts is a TypeScript library that uses modern APIs like Intl.Segmenter to count the number of characters, words, paragraphs, and lines in the input. It can also show breakdowns for different types of characters like letters, digits, spaces, punctuation, and symbols/special characters.

Features

🧮 View text metrics: Count the number of characters, words, sentences, paragraphs, and lines in your text.
📊 View character composition: View the number of spaces, digits, letters, punctuation, and symbols/special characters in the input.
🌍 Multilingual support: Uses Intl.Segmenter for accurate word and character segmentation across many languages and scripts.
👨🏻‍💻 Open-source: Know how to code? Help make tally-ts better by contributing to the project on GitHub, or copy it and make your own version!

Use Cases

📚 Students & Educators: Check essay lengths and assignment limits quickly and accurately.
✍️ Writers & Bloggers: Track writing progress and optimize structure for readability.
📄 Legal & Business Professionals: Ensure documents meet required character or word counts.
📱 Social Media Managers: Stay within platform limits for tweets, posts, and bios.
🧪 Developers & Testers: Analyze input strings and view line counts for code and data.
🌐 SEO Specialists: Optimize content length for meta descriptions, headings, and body text.

📦 Installation

[!TIP] JSR has some advantages if you're using TypeScript or Deno:
It ships typed, modern ESM code by default
No need for separate type declarations
Faster, leaner installs without extraneous files
You can use JSR with your favorite package manager.

This package is available on both JSR and npm. Install it using your preferred package manager:

deno add jsr:@twocaretcat/tally-ts     # JSR (recommended)

deno add npm:@twocaretcat/tally-ts     # npm

bunx jsr add @twocaretcat/tally-ts     # JSR

bun add @twocaretcat/tally-ts          # npm

npx jsr add @twocaretcat/tally-ts      # JSR

npm install @twocaretcat/tally-ts      # npm

pnpm i jsr:@twocaretcat/tally-ts       # JSR

pnpm add @twocaretcat/tally-ts         # npm

yarn add jsr:@twocaretcat/tally-ts     # JSR

yarn add @twocaretcat/tally-ts         # npm

vlt install jsr:@twocaretcat/tally-ts  # JSR

vlt install @twocaretcat/tally-ts      # npm

🕹️ Usage

[!WARNING] Some Caveats:
This library relies on the Intl.Segmenter API (or a compatible replacement) to split the input into graphemes, words, and sentences. Thus, the exact behavior and reproducibility of output counts depend on the JavaScript runtime used. Results may vary between browsers, Node versions, or polyfills.
There may be slight variations between the counts generated by tally-ts and other libraries due to differences in how they are implemented.
Languages like Chinese that do not have clearly defined words may have inaccurate word counts due to the segmentation algorithm used. If you need consistent or linguistically precise segmentation for these languages, use a dedicated tool instead. For Chinese, see Jieba, Stanford Segmenter, or pkuseg.

Getting Started

To get started, import the Tally class and create a new instance of it. I recommend setting the locale like so:

import { Tally } from 'tally-ts';

const tally = new Tally({ locales: 'en' });

Counting Sentences & Words

Use individual methods to get counts for sentences and words:

tally.countWords('How are you?');
// → { total: 3 }

tally.countSentences('¿Como estas?');
// → { total: 1 }

Counting Graphemes

You can get the number of graphemes (characters) the same way:

tally.countGraphemes('Hello world!');
// → {
//     total: 12,
//     by: {
//       spaces: { total: 1 },
//       letters: { total: 10 },
//       digits: { total: 0 },
//       punctuation: { total: 1 },
//       symbols: { total: 0 },
//     },
//     related: {
//       paragraphs: { total: 1 },
//       lines: { total: 1 },
//     }
//   }

This method has some extra features. You can access breakdown counts of the graphemes by type:

const result = tally.countGraphemes('Hi there!');

console.debug(result.by);
// → {
//     spaces: { total: 1 },
//     letters: { total: 7 },
//     digits: { total: 0 },
//     punctuation: { total: 1 },
//     symbols: { total: 0 }
//   }

As well as related features that were computed at the same time:

console.debug(result.related);
// → {
//     paragraphs: { total: 1 },
//     lines: { total: 1 }
//   }

Kitchen Sink

To get all counts at once, use the countAll() method:

const all = tally.countAll(`Hello world!\n\nThis is a test.`);

console.debug(all);
/* →
{
  graphemes: {
    total: 27,
    by: {
      spaces: { total: 4 },
      letters: { total: 20 },
      digits: { total: 0 },
      punctuation: { total: 1 },
      symbols: { total: 0 },
    },
    related: {
      paragraphs: { total: 2 },
      lines: { total: 3 },
    }
  },
  words: { total: 5 },
  sentences: { total: 2 },
  paragraphs: { total: 2 },
  lines: { total: 3 }
}
*/

🤖 Advanced Usage

Setting a Locale

You can pass a locale (or an array of locales) via the locales option. This value is forwarded directly to Intl.Segmenter and determines how the input string is split into graphemes, words, and sentences:

// Single locale
new Tally({ locales: 'en' });

// Multiple locales (preference order)
new Tally({ locales: ['fr-CA', 'fr'] });

If locales is not provided, Intl.Segmenter will resolve the runtime's best locale automatically.

Getting the Resolved Locale

[!NOTE] Even if you provide a locale, the resolved locale may be different if Intl.Segmenter doesn't support the one you've provided. In this case, another locale may be picked automatically.

If you didn't provide a locale, you might want to know which locale was actually used by Intl.Segmenter. You can get it by like so:

const tally = new Tally();

console.debug(tally.getResolvedLocale());
// → "en-US"

Using a Custom `Segmenter` Implementation

If your environment doesn't support Intl.Segmenter (or the exact locale you want to use), you can provide a custom implementation or polyfill instead:

new Tally({ Segmenter: SomeSegmenter });

This is also useful if you want to get consistent results across different runtimes. If you don't provide a segmenter, we will try to use the native Intl.Segmenter implementation.

Internally, we will call the constructor of Segmenter to create segmenters of different granularities.

⚠️ Usage (legacy)

[!WARNING] Deprecated: The legacy implementation is no longer maintained and it has limited support for languages other than English. Use the class-based Tally API instead if possible.

The legacy implementation exposes a single function, getCounts(), that can be used to get the number of characters, words, sentences, paragraphs, lines, spaces, letters, digits, and symbols at once:

import { getCounts } from 'tally-ts/legacy';

const counts = await getCounts(`Hello world!\n\nThis is a test.`);

console.debug(counts);
/* →
{
  characters: 27,
  words: 5,
  sentences: 2,
  paragraphs: 2,
  lines: 3,
  spaces: 4,
  letters: 20,
  digits: 0,
  symbols: 1
}
*/

You can provide an optional locale to improve segmentation accuracy for non-English text:

const counts = await getCounts(`Hello world!\n\nThis is a test.`, 'de-DE');

Note that the this only affects the segmentation of characters. If your language doesn't use spaces to separate words or uses letters outside of the ASCII range, for example, you will still not get accurate results. For multilingual counting, use the class-based Tally API instead.

🧠 Implementation Details

[!NOTE] In this section, we refer to words, graphemes, spaces, lines, etc. as tokens for simplicity.

Here's some more details about how tally-ts does its magic.

Algorithm

The class-based implementation uses Intl.Segmenter for locale-aware text segmentation at three granularities:

grapheme with countGraphemes()
word with countWords()
sentence with countSentences()

Each segmenter operates independently, and the results are combined when using countAll().

The counting functions are implemented as single-pass parsers for performance reasons. Each grapheme in the input string is classified using Unicode General Categories (e.g., \p{L}, \p{Nd}, \p{Zs}), providing accurate results for all languages and scripts supported by the platform’s ICU data.

Here’s how counts are determined for each token type:

| Count Type | Description | | --------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | grapheme | A user-perceived character as defined by Intl.Segmenter with granularity: "grapheme". Multi-codepoint characters (e.g., emojis, accented letters, combined scripts) are counted as one. Examples: a, é, 😊, 👩‍🚀, 貓. | | word | Counted using Intl.Segmenter with granularity: "word". Each segment where isWordLike is true increments the word count. This is locale-aware and works for non-Latin scripts (e.g., Chinese, Arabic). Examples: "Hello world" → 2, "你好世界" → 1. | | sentence | Counted using Intl.Segmenter with granularity: "sentence". Each non-empty segment increments the sentence count. Works for punctuation and locale rules (e.g., handling ¿ and ！). | | space | A grapheme that matches the Unicode Space Separator category (\p{Zs}). Includes ordinary spaces and non-breaking spaces. Examples: ' ', \u00A0. | | letter | A grapheme in the Unicode Letter category (\p{L}). Includes characters from all alphabets. Examples: A, ß, д, あ, م. | | digit | A grapheme in the Unicode Decimal Digit category (\p{Nd}). Works across scripts (e.g., Arabic-Indic, Devanagari). Examples: 0, ९, ٢. | | punctuation | A grapheme in the Unicode Punctuation category (\p{P}). Examples: ., ,, !, ¿, “”. | | symbol | A grapheme in the Unicode Symbol category (\p{S}). Includes math, currency, emoji, and miscellaneous symbols. Examples: +, $, ©, 🔥, ™. | | line | Determined by newline graphemes ('\n'). Each newline increments the line count. A final line is counted even if the text doesn’t end with a newline, unless the input is empty, in which case the line count is 0. | | paragraph | A non-empty, non-newline string, separated from other paragraphs by one or more newline characters. A trailing paragraph is counted even if the text doesn’t end with a newline, unless the input is empty, in which case the paragraph count is 0. Example: "Hello\n\nWorld" → 2 paragraphs. |

Legacy

Algorithm

The counting function is implemented as a single-pass parser for performance reasons. State transitions (sentence terminator → letter, letter → space, etc.) are used to determine when to increment the counts for each token type.

The following characters are used to separate tokens:

Space: ' '
Newline: \n
End Mark: ., !, ?

End of Input can also be considered a separator because words, sentences, paragraphs, and lines at the end of the input are counted even if not specifically terminated. For example, Something is counted as a word, sentence, paragraph, and line.

Here is an overview of how we determine the counts for each token type:

| Count Type | Description | | ------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | | character | A Unicode grapheme cluster (user-perceived character), as determined by Intl.Segmenter. Using this method, Emojis and other multi-codepoint characters are counted as a single character. Examples: a, 2, !, 🔥, 貓 | | word | A contiguous sequence of one or more letters or digits followed by a space, end mark, or newline. Symbols by themselves are not considered words. Examples: space, Whoa!, newline\n, 42. | | sentence | A contiguous sequence of one or more words followed by an end mark. Example: Hello, world!, 20 93.. | | paragraph | A contiguous sequence of one or more sentences followed by a newline. Examples: The quick brown cat jumps over the lazy dog\n, Hello world! Bye world!\n, 42\n. | | space | A literal space character (' '). Other whitespace (ex. tabs, newlines) are not included. | | letter | A character in the ASCII ranges A–Z or a–z. Examples: A, j, z. | | digit | A character in the ASCII range 0-9. Examples: 0, 5, 9. | | symbol | A non-letter, non-digit, non-space, non-newline character. This includes emojis, symbols, punctuation, and most whitespace. Examples: ,, %, #, 😊, 貓, \t. | | line | A literal newline character (\n). |

🤝 Contributing

Pull requests, bug reports, feature requests, and other kinds of contributions are welcome. See the contribution guide for more details.

🧾 License

This project is licensed under the MIT license. See the license for more details.

🖇️ Related

Used By

Notable projects that depend on this one:

👤 Tally: A free online tool to count the number of characters, words, paragraphs, and lines in your text. Tally uses this library to compute counts

Alternatives

Similar projects you might want to use instead:

🌐 Alfaaz: An alternative multilingual word counting library with less features, but faster execution

💕 Funding

Find this project useful? Sponsoring me will help me cover costs and commit more time to open-source.

If you can't donate but still want to contribute, don't worry. There are many other ways to help out, like:

📢 reporting (submitting feature requests & bug reports)
👨‍💻 coding (implementing features & fixing bugs)
📝 writing (documenting & translating)
💬 spreading the word
⭐ starring the project

I appreciate the support!

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

👋 About

Features

Use Cases

📦 Installation

🕹️ Usage

Getting Started

Counting Sentences & Words

Counting Graphemes

Kitchen Sink

🤖 Advanced Usage

Setting a Locale

Getting the Resolved Locale

Using a Custom Segmenter Implementation

⚠️ Usage (legacy)

🧠 Implementation Details

Algorithm

Legacy

Algorithm

🤝 Contributing

🧾 License

🖇️ Related

Recommended

Used By

Alternatives

💕 Funding

Using a Custom `Segmenter` Implementation