@rajan_kush/tokenizer
v1.0.0
Published
A Simple Byte-Pair Encoding (BPE) tokenizer built from scratch.
Readme
BPE Tokenizer JS (@rajan-kush/tokenizer)
A lightweight, zero-dependency Byte-Pair Encoding (BPE) tokenizer written in pure JavaScript. This package allows you to train your own vocabulary from a text corpus and use it to encode text into tokens or decode tokens back into text.
It was built from scratch to demonstrate the core concepts behind the tokenization algorithms that power modern Large Language Models.
Features
- Train from Scratch: Generate a BPE vocabulary from any
.txtcorpus file. - Encode & Decode: Seamlessly convert text to token IDs and back.
- Pure JavaScript: Runs in both Node.js and modern browsers. No external dependencies.
- Visualization App: Comes with a companion Vite + React app to visualize the tokenization process in real-time.
- Well-Documented: Clear and easy-to-understand API.
Installation
Install the package from NPM using your favorite package manager.
npm install @rajan-kush/tokenizer