tiktoken-bundle

v0.0.1

Published

a year ago

offline-capable ESM module for cl100k_base tokenization in the browser

0High
0Medium
0Low

arozek

tiktoken tokens tokenizer cl100k_base browser

tiktoken-bundle

An offline-capable ESM module for cl100k_base tokenization in the browser.

This lightweight JavaScript/TypeScript library provides tokenization functionality using the cl100k_base encoding (the same encoding used by GPT-3.5 and GPT-4 models) and is specifically designed to work in browser environments without network requests.

Installation

npm install tiktoken-bundle

Usage

import { 
  TokensOfText, 
  TextFromTokens, 
  NumberOfTokensInText, 
  TokenizationOfText 
} from 'tiktoken-bundle';

// Convert text to token IDs
const tokens = TokensOfText('Hello, world!');
console.log(tokens); // [9906, 11, 4435, 0]

// Convert token IDs back to text
const text = TextFromTokens([9906, 11, 4435, 0]);
console.log(text); // 'Hello, world!'

// Count tokens in text
const count = NumberOfTokensInText('Hello, world!');
console.log(count); // 4

// Get token ID-string pairs
const tokenization = TokenizationOfText('Hello, world!');
console.log(tokenization); 
// [
//   [9906, 'Hello'],
//   [11, ', '],
//   [4435, 'world'],
//   [0, '!']
// ]

Features

Offline-capable: Works without internet connection or API calls
Browser-compatible: Designed to work in modern browsers
cl100k_base encoding: Uses the same encoding as GPT-3.5 and GPT-4
TypeScript support: Includes TypeScript type definitions
Simple API: Just four functions to handle common tokenization tasks
Unicode support: Properly handles special characters and emoji

API Reference

`TokensOfText(text: string): number[]`

Converts text to an array of token IDs.

const tokens = TokensOfText('Hello, world!');
// [9906, 11, 4435, 0]

`TextFromTokens(tokenList: number[]): string`

Converts an array of token IDs back to text.

const text = TextFromTokens([9906, 11, 4435, 0]);
// 'Hello, world!'

`NumberOfTokensInText(text: string): number`

Counts the number of tokens in a text string.

const count = NumberOfTokensInText('Hello, world!');
// 4

`TokenizationOfText(text: string): [number, string][]`

Returns an array of token ID and token string pairs.

const tokenization = TokenizationOfText('Hello, world!');
// [
//   [9906, 'Hello'],
//   [11, ', '],
//   [4435, 'world'],
//   [0, '!']
// ]

What is tokenization?

Tokenization is the process of breaking text into smaller units called tokens. In the context of large language models like GPT-3.5 and GPT-4, tokens are the basic units of text that the model processes.

The cl100k_base encoding (used by this library) is specifically designed for modern LLMs. It breaks text into tokens in a way that balances efficiency and semantic meaning. A token can be as short as a single character or as long as a full word.

Why use this library?

Token counting: Accurately count tokens for API requests to stay within limits
Offline use: Perform tokenization without relying on external services
Debugging: Understand how text is tokenized to optimize prompts
Educational purposes: Learn about how text is processed by language models

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

tiktoken-bundle

Installation

Usage

Features

API Reference

TokensOfText(text: string): number[]

TextFromTokens(tokenList: number[]): string

NumberOfTokensInText(text: string): number

TokenizationOfText(text: string): [number, string][]

What is tokenization?

Why use this library?

License

`TokensOfText(text: string): number[]`

`TextFromTokens(tokenList: number[]): string`

`NumberOfTokensInText(text: string): number`

`TokenizationOfText(text: string): [number, string][]`