token-estimator

v1.0.2

Published

7 months ago

A Unicode-aware token segmentation and counting library

0High
0Medium
0Low

zhanghongfeng

token segmentation unicode text tokenizer

Token Estimator

A very fast Unicode-aware token segmentation and counting library for JavaScript/TypeScript, optimized to align tokenization results with Gemini and GPT models.

Features

Unicode-aware: Handles international text, emoji sequences, and various scripts
Robust tokenization: Uses Intl.Segmenter when available for best results, with regex fallback
Multiple token categories: word, number, whitespace, punctuation, emoji, and other
Token counting: Count tokens with configurable options
Text truncation: Truncate text by token count without splitting grapheme clusters
TypeScript support: Full TypeScript definitions included

Installation

npm install token-estimator

Usage

import { segmentIntoTokens, countTokens, truncateByTokenCount } from 'token-estimator'

// Segment text into tokens
const tokens = segmentIntoTokens("Hello world! 😊")
// Returns tokens with text, positions, and categories

// Count tokens
const tokenCount = countTokens("Hello world! 😊")
// Returns: 4

// Truncate by token count
const { truncatedText, truncatedTokenCount } = truncateByTokenCount(
  "Hello world! 😊 How are you?",
  3
)
// Returns: { truncatedText: "Hello world! 😊", truncatedTokenCount: 3 }

API

`segmentIntoTokens(sourceText, options?)`

Segments text into tokens with detailed information.

Parameters:

sourceText: The text to segment
options.keepWhitespace: Include whitespace tokens (default: true)
options.requestedLocale: Locale hint for segmentation (default: 'en')
options.maxTokensLimit: Safety limit (default: 100000)

Returns: Array of Token objects with properties:

text: The token text
startIndex: Start position in original string
endIndex: End position in original string
category: Token category ('word', 'number', 'whitespace', 'punctuation', 'emoji', 'other')

`countTokens(sourceText, options?)`

Counts tokens in the source text.

Parameters: Same as segmentIntoTokens

Returns: Number of tokens

`truncateByTokenCount(sourceText, tokenLimit, options?)`

Truncates text to specified token count.

Parameters:

sourceText: Text to truncate
tokenLimit: Maximum number of tokens
options: Same as segmentIntoTokens

Returns: Object with truncatedText and truncatedTokenCount

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme