npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

token-vocabs

v0.6.0

Published

Count and inspect token IDs across several modern tokenizer families offline.

Downloads

1,174

Readme

token-vocabs

Count tokens or inspect token IDs across several modern tokenizer families from one local, offline-friendly package.

Supported models

  • GPT → o200k_base
  • Gemma 4 31B it
  • Qwen 3.6 27B
  • Kimi K2.7 Code
  • DeepSeek V4 Pro
  • MiMo V2.5 Pro
  • Stable Diffusion XL
  • GLM 5.1
  • MiniMax M3
  • Hy3 Preview
  • Step 3.7 Flash

Highlights

  • offline at runtime once the vendored assets are present
  • browser-friendly once bundled
  • exact golden outputs for the core sample fixture
  • one Brotli-compressed MessagePack asset bundle per model
  • browser Brotli decompression with a bundled JS fallback where native stream support is missing
  • Rolldown browser builds that emit binary vocabulary bundles, shared chunks and the required WASM asset
  • async auto-loading API plus loaded-only sync helpers
  • one small single-model API for counts, token IDs and byte offsets
  • generated tokenizer assets via bun run fetch
  • publish-ready browser dist/ builds that keep vocabularies outside the JavaScript entry, emit the required WASM files and include package metadata plus declarations

Usage

import tokenize from 'token-vocabs'

console.dir(await tokenize('mind goblin', 'gpt'))
import {count} from 'token-vocabs'

console.dir(await count(new TextEncoder().encode('mind goblin'), {model: 'gpt'}))
import {load, tokenizeLoaded} from 'token-vocabs'

await load(['gpt', 'deepseek'])
console.dir(tokenizeLoaded('mind goblin', 'gpt'))

Example output

await count('mind goblin', 'gpt')
// 3
await tokenize('mind goblin', 'gpt')
// {
//   offsets: [4, 8],
//   tokens: [77021, 18778, 4724],
// }

API

async count(textOrBytes, optionsOrModel)

Returns the token count for exactly one model and loads the required vocabulary bundle on demand.

Uint8Array input is decoded as UTF-8.

await count('mind goblin', 'sdxl')
await count('mind goblin', {model: 'gpt'})
await count(new TextEncoder().encode('mind goblin'), 'gpt')

countLoaded(textOrBytes, optionsOrModel)

Synchronous count helper that uses the existing in-memory tokenizer state and throws if the requested vocabulary is not loaded yet.

This is useful after await load() or after a previous await count() / await tokenize() call has already loaded the model.

async tokenize(textOrBytes, optionsOrModel)

Returns a RawTokenizeResult for exactly one model and loads the required vocabulary bundle on demand.

await tokenize('mind goblin', 'gpt')
await tokenize('mind goblin', {model: 'gpt'})

tokenizeLoaded(textOrBytes, optionsOrModel)

Synchronous tokenization helper that reuses already loaded vocabularies and throws if the requested model is not in memory yet.

The result shape is:

type RawTokenizeResult = {
  offsets: number[]
  tokens: number[]
  processedInput?: string | Uint8Array
}

offsets omits the first token’s implicit 0 byte start to save one array slot.

If a tokenizer normalizes or otherwise preprocesses the input, processedInput contains the effective tokenizer input. Its type matches the input kind – string in, string out; Uint8Array in, Uint8Array out.

If you need results for several models, call count() or tokenize() once per model and combine the results yourself.

async load(modelSelection?)

Preloads one or more model vocabularies into memory.

  • await load('gpt') → resolves to 'gpt'
  • await load(['gpt', 'deepseek']) → resolves to ['gpt', 'deepseek']
  • await load() → loads every supported model and resolves to modelIds

free(modelId?)

Releases a loaded model from memory, or every loaded model if no argument is provided.

modelIds

Exports the supported model IDs in stable default order.

models

Exports model metadata, including the original upstream source URLs used by bun run fetch.

token-vocabs/browser

Browser entry with the same count(), countLoaded(), tokenize(), tokenizeLoaded(), load() and free() API as the desktop entry.

It loads the .bin asset bundles via fetch().

token-vocabs/browser/all

Eager browser entry that runs await load() at module initialization time so countLoaded() and tokenizeLoaded() work immediately after import.

Distribution layout

The published browser package exposes token-vocabs and token-vocabs/browser as the lazy entry backed by main.js, plus token-vocabs/browser/all as the eager entry backed by all.js.

It also contains:

  • one Brotli-compressed MessagePack asset bundle per model at the package root, shared chunks and the required WASM asset
  • package.json, README.md, LICENSE and declaration files so the folder can be published on its own

Example lazy browser usage from the published package:

import {countLoaded, load} from 'token-vocabs/browser'

await load(['gpt', 'deepseek'])
console.dir(countLoaded('mind goblin', 'deepseek'))

Notes

  • sdxl intentionally implements the shared CLIP BPE core used by SDXL without auto-adding BOS/EOS tokens.
  • GPT uses tiktoken lite plus a vendored o200k_base model string, so the browser WASM stays lean and the vocabulary still lives in the regular per-model asset bundle.
  • Structured tokenizer payloads are stored inside per-model .bin bundles and decompressed after loading.
  • Tokenizer assets are large. That is inherent to exact offline tokenization.