ts-tokenizer
v0.1.0
Published
A TypeScript tokenizer library scaffold.
Maintainers
Readme
TypeScript Tokenizer MVP
This repository is an MVP implementation of a learning-oriented BPE tokenizer in TypeScript.
It demonstrates the core tokenizer loop:
text -> initial token ids -> pair statistics -> merge rules -> encode/decodeCurrent Capabilities
- Train a small BPE model from text with
trainBpe. - Learn multiple merge rules with
maxMerges. - Encode text with a trained model.
- Decode token ids back to text.
- Fall back to UTF-8 byte tokens for characters not seen during training.
- Create a convenient tokenizer object with
createTokenizer. - Export and import models with JSON-compatible data.
- Validate serialized model version, mode, vocabulary, and merge rules on import.
- Train, encode, and decode files from the CLI.
- Run tests with Vitest.
- Keep the package entry separate from the runnable demo.
Commands
Install dependencies:
npm installRun the demo:
npm run devRun the demo with a custom merge count:
npm run dev -- --max-merges=20Run tests:
npm testTypecheck and build:
npm run typecheck
npm run buildCLI
Train a model from a text file:
npm run cli -- train --input corpus.txt --output tokenizer.json --max-merges 100
# After package installation:
tokenize train --input corpus.txt --output tokenizer.json --max-merges 100Encode a text file into token ids:
npm run cli -- encode --model tokenizer.json --input input.txt --output ids.json
# After package installation:
tokenize encode --model tokenizer.json --input input.txt --output ids.jsonDecode token ids back to text:
npm run cli -- decode --model tokenizer.json --input ids.json --output decoded.txt
# After package installation:
tokenize decode --model tokenizer.json --input ids.json --output decoded.txtProject Structure
src/bpe.ts: core BPE training, encode/decode, and tokenizer facade.src/types.ts: shared tokenizer, model, token, and training result types.src/utf8.ts: UTF-8 byte fallback helpers and 256-byte base vocabulary constants.src/pairs.ts: pair keys, pair statistics, most-frequent-pair selection, and pair merging.src/model.ts: vocabulary lookup plus model import/export.src/cli.ts: command-line train/encode/decode entry point.src/demo.ts: runnable learning/demo script.tests/: Vitest test suite for public entry points and BPE behavior.ROADMAP.md: follow-up feature directions and architectural next steps.
Example API
import { createTokenizer, trainBpe } from "ts-tokenizer";
const result = trainBpe("banana bandana banana", { maxMerges: 5 });
const tokenizer = createTokenizer(result.model);
const ids = tokenizer.encode("banana");
const text = tokenizer.decode(ids);
const count = tokenizer.count("banana");
const json = tokenizer.toJSON();MVP Limitations
- This is a teaching-oriented BPE implementation, not a GPT/tiktoken-compatible tokenizer.
- Pair statistics are recomputed from scratch after every merge.
See ROADMAP.md for planned directions and .note/ for deferred design concerns discovered during development.
