bpe-tokenizer-encoding
v2.0.0
Published
A simple JavaScript implementation of Byte Pair Encoding (BPE)
Downloads
18
Maintainers
Readme
bpe-tokenizer
🔤 A simple JavaScript implementation of Byte Pair Encoding (BPE) for tokenizing text into subword units.
🚀 Install
npm install bpe-tokenizer📆 Usage
In Node.js:
const { bytePairEncoding } = require('bpe-tokenizer');
const tokens = bytePairEncoding("low_low_lower", 5, true);
console.log("Tokens:", tokens);Output:
Step 1: Merged 'l o' -> lo w _ lo w _ lo w e r
Step 2: Merged 'lo w' -> low _ low _ low e r
Step 3: Merged 'low _' -> low_ low_ low e r
Step 4: Merged 'e r' -> low_ low_ low er
Step 5: Merged 'low_ low_' -> low_low_ low er
Tokens: [ 'low_low_', 'low', 'er' ]🔧 CLI Usage
If you installed globally or linked it locally:
npx bpe-tokenizer "low_low_lower" 5Parameters:
- First argument: input string (e.g.
"low_low_lower") - Second argument (optional): number of merges (default = 10)
📚 API
bytePairEncoding(text: string, numMerges: number = 10, verbose: boolean = false): string[]
- text: Input string
- numMerges: Number of BPE merge iterations
- verbose: Whether to log each merge step
🛠️ Development
Run locally:
npm install
node cli.js "low_low_lower" 5📃 License
MIT © 2025 MOHD RAZA
