npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@codecai/maps-cli

v0.5.0

Published

Generate Codec tokenizer dialect maps from HuggingFace tokenizer.json files. The 'tsc --declaration' for LLM token vocabularies.

Readme

@codecai/maps-cli

The tsc --declaration for LLM token vocabularies.

Generate Codec tokenizer dialect maps from HuggingFace tokenizer.json files. Maps are content-addressed, immutable JSON files that any @codecai/web client can use to encode/decode token streams.

Install

npm install -g @codecai/maps-cli

Or run without installing:

npx @codecai/maps-cli build Qwen/Qwen2.5-7B-Instruct --id=qwen/qwen2

CLI

build — fetch from HuggingFace and convert

codecai-maps build <hf-model> [--id=<id>] [--out=<path>] [--token=<hf-token>]

Fetches tokenizer.json from https://huggingface.co/<hf-model>, converts to a Codec TokenizerMap, writes JSON to disk, and prints the canonical sha256 hash.

$ codecai-maps build Qwen/Qwen2.5-7B-Instruct --id=qwen/qwen2
▶ fetching Qwen/Qwen2.5-7B-Instruct from HuggingFace…
✓ written  qwen_qwen2.json
  id           qwen/qwen2
  vocab_size   151665
  encoder      byte_level
  merges       151387
  hash         sha256:c73972f7a580…

For gated models (Llama, Gemma) pass a HuggingFace access token: --token=hf_xxx.

convert — local file in, map out

codecai-maps convert ./tokenizer.json --id=my-org/my-model --out=./my-model.json

validate — schema check

codecai-maps validate ./qwen_qwen2.json

hash — print canonical sha256

codecai-maps hash ./qwen_qwen2.json
# → sha256:c73972f7a580936d724ffd8df9df2ce546d255c543e9d09b6d75e5bf69b1a64d

Use this value when pinning a map: loadMap({ url, hash }) will reject any map that doesn't match.

preview — sanity check round-trip

codecai-maps preview ./qwen_qwen2.json --text="Explain entropy."
# map:           qwen/qwen2
# tokenizer:     BPETokenizer
# input:         "Explain entropy."
# token IDs:     [840, 20772, 47502, 13]
# token count:   4
# round-trip:    "Explain entropy."
# exact match:   YES

translate — cross-vocab token stream conversion

Pipe one tokenizer's IDs through another's vocab with streaming-safe word-boundary buffering. Useful for previewing what an agent-to-agent handoff actually emits at the token level.

codecai-maps translate --from=qwen2.json --to=llama-3.json \
  --text="The quick brown fox."

# from:    qwen/qwen2
# to:      meta-llama/llama-3
# input:   "The quick brown fox."
# src ids: [785, 4937, 13876, 38835, 13]   (5 tokens, qwen-2)
# dst ids: [791, 4062, 14198, 39935, 13]   (5 tokens, llama-3)
# decoded: "The quick brown fox."          (round-trip via llama-3 detok)

Or with raw IDs:

codecai-maps translate --from=qwen2.json --to=llama-3.json --ids=785,4937

translation-table — context-free V_A → V_B[] lookup

codecai-maps translation-table --from=qwen2.json --to=llama-3.json \
  --out=qwen-to-llama.json

Emits a JSON file mapping every non-special source ID to the sequence of target IDs its rendered text encodes to. Context-free (BPE merges depend on context), so prefer the streaming translate for runtime use; the static table is for analysis (vocab overlap, cost estimation).

Programmatic API

import { convertHFTokenizer, fetchAndConvert, hashMap } from '@codecai/maps-cli/convert';

// From a parsed tokenizer.json object
const map = convertHFTokenizer(hfJson, { id: 'my-org/my-model' });

// Or fetch from HuggingFace directly
const map = await fetchAndConvert({
  hfModel: 'Qwen/Qwen2.5-7B-Instruct',
  id: 'qwen/qwen2',
});

// Compute the hash for pinning
const hash = await hashMap(map);

What gets generated

The output is a JSON file matching the TokenizerMap schema from @codecai/web (v2.1):

{
  "id": "qwen/qwen2",
  "version": "2",
  "vocab_size": 151665,
  "vocab": { "Hello": 9707, "Ġworld": 1879, "...": 0 },
  "encoder": "byte_level",
  "merges": ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "..."],
  "pre_tokenizer_pattern": "(?i:'s|'t|'re|...)| ?\\p{L}+|...",
  "pre_tokenizer_program": {
    "version": 1,
    "ops": [
      { "op": "literals_ci", "patterns": ["'s","'t","'re","'ve","'m","'ll","'d"] },
      { "op": "letters",     "lead_other": true },
      { "op": "numbers",     "max_run": 1 },
      { "op": "punct_run",   "lead_space": true, "trailing_newlines": true },
      { "op": "newline_block" },
      { "op": "trailing_ws" },
      { "op": "ws_run" }
    ]
  },
  "special_tokens": {
    "<|endoftext|>": 151643,
    "<|im_start|>": 151644
  },
  "published_at": "2026-05-06T12:00:00.000Z"
}

The schema covers three tokenizer families that span ~95% of open models:

  • byte_level — GPT-2 byte→unicode BPE (Llama-3, Qwen, Phi-3, Mistral-Nemo, DeepSeek-V3, …).
  • metaspace-prefix BPE with byte fallback (Llama-2, Mistral-v3, Mixtral, Gemma).
  • identity — vocab-only tokenizers without merges (canonical-IR / closed vocabs).

Pre-tokenizer program (v2.1, additive)

Both pre_tokenizer_pattern and pre_tokenizer_program describe the same splitter. The program is the regex compiled into a named-op list; runtimes prefer it when present so they can encode without a Unicode regex engine. The CLI emits it automatically for any pre-tokenizer regex it recognises (currently the GPT-2-family canonical form used by Llama-3, Qwen, Phi-3, DeepSeek-V3, Mistral-Nemo, Falcon, SmolLM2, Codestral byte_level). Maps with unrecognised regexes still build normally — pre_tokenizer_program is just omitted, and runtimes fall back to the regex string.

See spec/PRETOKENIZER_PROGRAM.md for the full op set and equivalence rules.

Hosting your map

Once generated, host the JSON anywhere static:

  • GitHub + jsDelivr (free CDN): commit to a public repo, then
    https://cdn.jsdelivr.net/gh/<user>/<repo>/path/to/map.json
  • Hugging Face: push to a Space or alongside your model weights.
  • S3 / Cloudflare R2: standard static hosting.
  • Codec community registry: contribute via PR to codec-maps.

Then any client can pin against your hash:

import { loadMap } from '@codecai/web';

const map = await loadMap({
  url: 'https://your-host/your-model.json',
  hash: 'sha256:abcd1234…',
});

well-known — publish for .well-known/codec/ discovery

Generate the static directory tree clients need to find your map by (origin, id) alone, so consumers don't have to hard-code your CDN URL:

codecai-maps well-known --map=./qwen_qwen2.json \
  --url=https://cdn.example/qwen2.json \
  --out-dir=./public

This writes:

public/.well-known/codec/maps/qwen/qwen2.json   ← pointer { id, url, hash }
public/.well-known/codec/index.json             ← directory of all your maps

Drop ./public onto any static host (GitHub Pages, S3, Vercel) under the origin you control, and any client can do:

import { discoverMap } from '@codecai/web';
const map = await discoverMap({ origin: 'https://qwen.io', id: 'qwen/qwen2' });

Pass --inline instead of --url to embed the full map at the well-known location (skips the CDN indirection — recommended only for small maps). Re-running with the same id replaces the existing index entry. See spec/WELL_KNOWN_DISCOVERY.md for the publishing contract.

policies-* — safety-policy descriptor lifecycle (v0.4)

The v0.4 safety-policy negotiation spec ships four CLI subcommands that mirror the tokenizer-map shape exactly:

# Validate that an operator-internal policy is well-formed.
codecai-maps policies-validate ./internal-config.json

# Strip internal-only fields (banned_token_ids, regex_patterns,
# grammar_constraints, multi_token_patterns, classifier thresholds /
# weights) and emit the publishable descriptor — what the world sees
# at .well-known/codec/policies/<id>.json. Internal-field counts
# survive as rules_summary.* for auditors.
codecai-maps policies-sanitize --internal=./internal-config.json \
  --out=./acme-strict-v3.policy.json

# Canonical sha256 over the sanitized descriptor — bit-identical
# across @codecai/web, codecai (Python), codec-rs, Codec.Net,
# codec (Java), libcodec, and codec-supervisor.
codecai-maps policies-hash ./acme-strict-v3.policy.json

# Emit both .well-known/codec/policies/<id>.json (mutable pointer or
# inline) AND .well-known/codec/policies/sha256/<hex>.json (immutable
# content-addressed sibling) so clients that received a hash in READY
# can fetch + verify without a redirect hop.
codecai-maps policies-well-known --descriptor=./acme-strict-v3.policy.json \
  --inline --out-dir=./public

# v0.5 (resolves v0.4-OQ4): productize the offline enumerator scripts.
# Reads a JSON array of literal strings, generates surface variants
# (verbatim / leading-space / leading-newline / lowercase / titlecase /
# uppercase / trimmed), tokenizes each variant through the supplied
# tokenizer map, deduplicates by token sequence, and writes a JSON file
# ready to paste into your internal policy's 'multi_token_patterns'
# field. The output pins the tokenizer-map sha256 so the enumeration is
# verifiably tied to the exact map bytes the runtime will tokenize
# against.
codecai-maps policies-enumerate --map=./qwen_qwen2.json \
  --literals=./adversarial-strings.json \
  --out=./enumerated-patterns.json

The descriptor never contains operator-internal contents — that's the disclosure-boundary contract (an attacker who can fetch the .well-known page learns the shape of enforcement, not the contents of banned-token lists or classifier thresholds). The internal-config side lives in codec-supervisor under policies_dir/ and is never published.

License

MIT.