npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

ts-tokenizer

v0.1.0

Published

A TypeScript tokenizer library scaffold.

Readme

TypeScript Tokenizer MVP

This repository is an MVP implementation of a learning-oriented BPE tokenizer in TypeScript.

It demonstrates the core tokenizer loop:

text -> initial token ids -> pair statistics -> merge rules -> encode/decode

Current Capabilities

  • Train a small BPE model from text with trainBpe.
  • Learn multiple merge rules with maxMerges.
  • Encode text with a trained model.
  • Decode token ids back to text.
  • Fall back to UTF-8 byte tokens for characters not seen during training.
  • Create a convenient tokenizer object with createTokenizer.
  • Export and import models with JSON-compatible data.
  • Validate serialized model version, mode, vocabulary, and merge rules on import.
  • Train, encode, and decode files from the CLI.
  • Run tests with Vitest.
  • Keep the package entry separate from the runnable demo.

Commands

Install dependencies:

npm install

Run the demo:

npm run dev

Run the demo with a custom merge count:

npm run dev -- --max-merges=20

Run tests:

npm test

Typecheck and build:

npm run typecheck
npm run build

CLI

Train a model from a text file:

npm run cli -- train --input corpus.txt --output tokenizer.json --max-merges 100
# After package installation:
tokenize train --input corpus.txt --output tokenizer.json --max-merges 100

Encode a text file into token ids:

npm run cli -- encode --model tokenizer.json --input input.txt --output ids.json
# After package installation:
tokenize encode --model tokenizer.json --input input.txt --output ids.json

Decode token ids back to text:

npm run cli -- decode --model tokenizer.json --input ids.json --output decoded.txt
# After package installation:
tokenize decode --model tokenizer.json --input ids.json --output decoded.txt

Project Structure

  • src/bpe.ts: core BPE training, encode/decode, and tokenizer facade.
  • src/types.ts: shared tokenizer, model, token, and training result types.
  • src/utf8.ts: UTF-8 byte fallback helpers and 256-byte base vocabulary constants.
  • src/pairs.ts: pair keys, pair statistics, most-frequent-pair selection, and pair merging.
  • src/model.ts: vocabulary lookup plus model import/export.
  • src/cli.ts: command-line train/encode/decode entry point.
  • src/demo.ts: runnable learning/demo script.
  • tests/: Vitest test suite for public entry points and BPE behavior.
  • ROADMAP.md: follow-up feature directions and architectural next steps.

Example API

import { createTokenizer, trainBpe } from "ts-tokenizer";

const result = trainBpe("banana bandana banana", { maxMerges: 5 });
const tokenizer = createTokenizer(result.model);

const ids = tokenizer.encode("banana");
const text = tokenizer.decode(ids);
const count = tokenizer.count("banana");
const json = tokenizer.toJSON();

MVP Limitations

  • This is a teaching-oriented BPE implementation, not a GPT/tiktoken-compatible tokenizer.
  • Pair statistics are recomputed from scratch after every merge.

See ROADMAP.md for planned directions and .note/ for deferred design concerns discovered during development.