token-estimator
v1.0.2
Published
A Unicode-aware token segmentation and counting library
Maintainers
Readme
Token Estimator
A very fast Unicode-aware token segmentation and counting library for JavaScript/TypeScript, optimized to align tokenization results with Gemini and GPT models.
Features
- Unicode-aware: Handles international text, emoji sequences, and various scripts
- Robust tokenization: Uses
Intl.Segmenterwhen available for best results, with regex fallback - Multiple token categories: word, number, whitespace, punctuation, emoji, and other
- Token counting: Count tokens with configurable options
- Text truncation: Truncate text by token count without splitting grapheme clusters
- TypeScript support: Full TypeScript definitions included
Installation
npm install token-estimatorUsage
import { segmentIntoTokens, countTokens, truncateByTokenCount } from 'token-estimator'
// Segment text into tokens
const tokens = segmentIntoTokens("Hello world! 😊")
// Returns tokens with text, positions, and categories
// Count tokens
const tokenCount = countTokens("Hello world! 😊")
// Returns: 4
// Truncate by token count
const { truncatedText, truncatedTokenCount } = truncateByTokenCount(
"Hello world! 😊 How are you?",
3
)
// Returns: { truncatedText: "Hello world! 😊", truncatedTokenCount: 3 }API
segmentIntoTokens(sourceText, options?)
Segments text into tokens with detailed information.
Parameters:
sourceText: The text to segmentoptions.keepWhitespace: Include whitespace tokens (default: true)options.requestedLocale: Locale hint for segmentation (default: 'en')options.maxTokensLimit: Safety limit (default: 100000)
Returns: Array of Token objects with properties:
text: The token textstartIndex: Start position in original stringendIndex: End position in original stringcategory: Token category ('word', 'number', 'whitespace', 'punctuation', 'emoji', 'other')
countTokens(sourceText, options?)
Counts tokens in the source text.
Parameters: Same as segmentIntoTokens
Returns: Number of tokens
truncateByTokenCount(sourceText, tokenLimit, options?)
Truncates text to specified token count.
Parameters:
sourceText: Text to truncatetokenLimit: Maximum number of tokensoptions: Same assegmentIntoTokens
Returns: Object with truncatedText and truncatedTokenCount
License
MIT
