tiktoken-bundle
v0.0.1
Published
offline-capable ESM module for cl100k_base tokenization in the browser
Downloads
4
Maintainers
Readme
tiktoken-bundle
An offline-capable ESM module for cl100k_base tokenization in the browser.
This lightweight JavaScript/TypeScript library provides tokenization functionality using the cl100k_base encoding (the same encoding used by GPT-3.5 and GPT-4 models) and is specifically designed to work in browser environments without network requests.
Installation
npm install tiktoken-bundleUsage
import {
TokensOfText,
TextFromTokens,
NumberOfTokensInText,
TokenizationOfText
} from 'tiktoken-bundle';
// Convert text to token IDs
const tokens = TokensOfText('Hello, world!');
console.log(tokens); // [9906, 11, 4435, 0]
// Convert token IDs back to text
const text = TextFromTokens([9906, 11, 4435, 0]);
console.log(text); // 'Hello, world!'
// Count tokens in text
const count = NumberOfTokensInText('Hello, world!');
console.log(count); // 4
// Get token ID-string pairs
const tokenization = TokenizationOfText('Hello, world!');
console.log(tokenization);
// [
// [9906, 'Hello'],
// [11, ', '],
// [4435, 'world'],
// [0, '!']
// ]Features
- Offline-capable: Works without internet connection or API calls
- Browser-compatible: Designed to work in modern browsers
- cl100k_base encoding: Uses the same encoding as GPT-3.5 and GPT-4
- TypeScript support: Includes TypeScript type definitions
- Simple API: Just four functions to handle common tokenization tasks
- Unicode support: Properly handles special characters and emoji
API Reference
TokensOfText(text: string): number[]
Converts text to an array of token IDs.
const tokens = TokensOfText('Hello, world!');
// [9906, 11, 4435, 0]TextFromTokens(tokenList: number[]): string
Converts an array of token IDs back to text.
const text = TextFromTokens([9906, 11, 4435, 0]);
// 'Hello, world!'NumberOfTokensInText(text: string): number
Counts the number of tokens in a text string.
const count = NumberOfTokensInText('Hello, world!');
// 4TokenizationOfText(text: string): [number, string][]
Returns an array of token ID and token string pairs.
const tokenization = TokenizationOfText('Hello, world!');
// [
// [9906, 'Hello'],
// [11, ', '],
// [4435, 'world'],
// [0, '!']
// ]What is tokenization?
Tokenization is the process of breaking text into smaller units called tokens. In the context of large language models like GPT-3.5 and GPT-4, tokens are the basic units of text that the model processes.
The cl100k_base encoding (used by this library) is specifically designed for modern LLMs. It breaks text into tokens in a way that balances efficiency and semantic meaning. A token can be as short as a single character or as long as a full word.
Why use this library?
- Token counting: Accurately count tokens for API requests to stay within limits
- Offline use: Perform tokenization without relying on external services
- Debugging: Understand how text is tokenized to optimize prompts
- Educational purposes: Learn about how text is processed by language models
License
MIT
