@codecai/maps-cli
v0.5.0
Published
Generate Codec tokenizer dialect maps from HuggingFace tokenizer.json files. The 'tsc --declaration' for LLM token vocabularies.
Maintainers
Readme
@codecai/maps-cli
The tsc --declaration for LLM token vocabularies.
Generate Codec tokenizer dialect maps from HuggingFace tokenizer.json files. Maps are content-addressed, immutable JSON files that any @codecai/web client can use to encode/decode token streams.
Install
npm install -g @codecai/maps-cliOr run without installing:
npx @codecai/maps-cli build Qwen/Qwen2.5-7B-Instruct --id=qwen/qwen2CLI
build — fetch from HuggingFace and convert
codecai-maps build <hf-model> [--id=<id>] [--out=<path>] [--token=<hf-token>]Fetches tokenizer.json from https://huggingface.co/<hf-model>, converts to a Codec TokenizerMap, writes JSON to disk, and prints the canonical sha256 hash.
$ codecai-maps build Qwen/Qwen2.5-7B-Instruct --id=qwen/qwen2
▶ fetching Qwen/Qwen2.5-7B-Instruct from HuggingFace…
✓ written qwen_qwen2.json
id qwen/qwen2
vocab_size 151665
encoder byte_level
merges 151387
hash sha256:c73972f7a580…For gated models (Llama, Gemma) pass a HuggingFace access token: --token=hf_xxx.
convert — local file in, map out
codecai-maps convert ./tokenizer.json --id=my-org/my-model --out=./my-model.jsonvalidate — schema check
codecai-maps validate ./qwen_qwen2.jsonhash — print canonical sha256
codecai-maps hash ./qwen_qwen2.json
# → sha256:c73972f7a580936d724ffd8df9df2ce546d255c543e9d09b6d75e5bf69b1a64dUse this value when pinning a map: loadMap({ url, hash }) will reject any map that doesn't match.
preview — sanity check round-trip
codecai-maps preview ./qwen_qwen2.json --text="Explain entropy."
# map: qwen/qwen2
# tokenizer: BPETokenizer
# input: "Explain entropy."
# token IDs: [840, 20772, 47502, 13]
# token count: 4
# round-trip: "Explain entropy."
# exact match: YEStranslate — cross-vocab token stream conversion
Pipe one tokenizer's IDs through another's vocab with streaming-safe word-boundary buffering. Useful for previewing what an agent-to-agent handoff actually emits at the token level.
codecai-maps translate --from=qwen2.json --to=llama-3.json \
--text="The quick brown fox."
# from: qwen/qwen2
# to: meta-llama/llama-3
# input: "The quick brown fox."
# src ids: [785, 4937, 13876, 38835, 13] (5 tokens, qwen-2)
# dst ids: [791, 4062, 14198, 39935, 13] (5 tokens, llama-3)
# decoded: "The quick brown fox." (round-trip via llama-3 detok)Or with raw IDs:
codecai-maps translate --from=qwen2.json --to=llama-3.json --ids=785,4937translation-table — context-free V_A → V_B[] lookup
codecai-maps translation-table --from=qwen2.json --to=llama-3.json \
--out=qwen-to-llama.jsonEmits a JSON file mapping every non-special source ID to the sequence
of target IDs its rendered text encodes to. Context-free (BPE merges
depend on context), so prefer the streaming translate for runtime
use; the static table is for analysis (vocab overlap, cost estimation).
Programmatic API
import { convertHFTokenizer, fetchAndConvert, hashMap } from '@codecai/maps-cli/convert';
// From a parsed tokenizer.json object
const map = convertHFTokenizer(hfJson, { id: 'my-org/my-model' });
// Or fetch from HuggingFace directly
const map = await fetchAndConvert({
hfModel: 'Qwen/Qwen2.5-7B-Instruct',
id: 'qwen/qwen2',
});
// Compute the hash for pinning
const hash = await hashMap(map);What gets generated
The output is a JSON file matching the TokenizerMap schema from @codecai/web (v2.1):
{
"id": "qwen/qwen2",
"version": "2",
"vocab_size": 151665,
"vocab": { "Hello": 9707, "Ġworld": 1879, "...": 0 },
"encoder": "byte_level",
"merges": ["Ġ Ġ", "ĠĠ ĠĠ", "i n", "..."],
"pre_tokenizer_pattern": "(?i:'s|'t|'re|...)| ?\\p{L}+|...",
"pre_tokenizer_program": {
"version": 1,
"ops": [
{ "op": "literals_ci", "patterns": ["'s","'t","'re","'ve","'m","'ll","'d"] },
{ "op": "letters", "lead_other": true },
{ "op": "numbers", "max_run": 1 },
{ "op": "punct_run", "lead_space": true, "trailing_newlines": true },
{ "op": "newline_block" },
{ "op": "trailing_ws" },
{ "op": "ws_run" }
]
},
"special_tokens": {
"<|endoftext|>": 151643,
"<|im_start|>": 151644
},
"published_at": "2026-05-06T12:00:00.000Z"
}The schema covers three tokenizer families that span ~95% of open models:
byte_level— GPT-2 byte→unicode BPE (Llama-3, Qwen, Phi-3, Mistral-Nemo, DeepSeek-V3, …).metaspace—▁-prefix BPE with byte fallback (Llama-2, Mistral-v3, Mixtral, Gemma).- identity — vocab-only tokenizers without merges (canonical-IR / closed vocabs).
Pre-tokenizer program (v2.1, additive)
Both pre_tokenizer_pattern and pre_tokenizer_program describe the
same splitter. The program is the regex compiled into a named-op list;
runtimes prefer it when present so they can encode without a Unicode
regex engine. The CLI emits it automatically for any pre-tokenizer
regex it recognises (currently the GPT-2-family canonical form used by
Llama-3, Qwen, Phi-3, DeepSeek-V3, Mistral-Nemo, Falcon, SmolLM2,
Codestral byte_level). Maps with unrecognised regexes still build
normally — pre_tokenizer_program is just omitted, and runtimes fall
back to the regex string.
See spec/PRETOKENIZER_PROGRAM.md
for the full op set and equivalence rules.
Hosting your map
Once generated, host the JSON anywhere static:
- GitHub + jsDelivr (free CDN): commit to a public repo, then
https://cdn.jsdelivr.net/gh/<user>/<repo>/path/to/map.json - Hugging Face: push to a Space or alongside your model weights.
- S3 / Cloudflare R2: standard static hosting.
- Codec community registry: contribute via PR to
codec-maps.
Then any client can pin against your hash:
import { loadMap } from '@codecai/web';
const map = await loadMap({
url: 'https://your-host/your-model.json',
hash: 'sha256:abcd1234…',
});well-known — publish for .well-known/codec/ discovery
Generate the static directory tree clients need to find your map by (origin, id) alone, so consumers don't have to hard-code your CDN URL:
codecai-maps well-known --map=./qwen_qwen2.json \
--url=https://cdn.example/qwen2.json \
--out-dir=./publicThis writes:
public/.well-known/codec/maps/qwen/qwen2.json ← pointer { id, url, hash }
public/.well-known/codec/index.json ← directory of all your mapsDrop ./public onto any static host (GitHub Pages, S3, Vercel) under the origin you control, and any client can do:
import { discoverMap } from '@codecai/web';
const map = await discoverMap({ origin: 'https://qwen.io', id: 'qwen/qwen2' });Pass --inline instead of --url to embed the full map at the well-known location (skips the CDN indirection — recommended only for small maps). Re-running with the same id replaces the existing index entry. See spec/WELL_KNOWN_DISCOVERY.md for the publishing contract.
policies-* — safety-policy descriptor lifecycle (v0.4)
The v0.4 safety-policy negotiation spec ships four CLI subcommands that mirror the tokenizer-map shape exactly:
# Validate that an operator-internal policy is well-formed.
codecai-maps policies-validate ./internal-config.json
# Strip internal-only fields (banned_token_ids, regex_patterns,
# grammar_constraints, multi_token_patterns, classifier thresholds /
# weights) and emit the publishable descriptor — what the world sees
# at .well-known/codec/policies/<id>.json. Internal-field counts
# survive as rules_summary.* for auditors.
codecai-maps policies-sanitize --internal=./internal-config.json \
--out=./acme-strict-v3.policy.json
# Canonical sha256 over the sanitized descriptor — bit-identical
# across @codecai/web, codecai (Python), codec-rs, Codec.Net,
# codec (Java), libcodec, and codec-supervisor.
codecai-maps policies-hash ./acme-strict-v3.policy.json
# Emit both .well-known/codec/policies/<id>.json (mutable pointer or
# inline) AND .well-known/codec/policies/sha256/<hex>.json (immutable
# content-addressed sibling) so clients that received a hash in READY
# can fetch + verify without a redirect hop.
codecai-maps policies-well-known --descriptor=./acme-strict-v3.policy.json \
--inline --out-dir=./public
# v0.5 (resolves v0.4-OQ4): productize the offline enumerator scripts.
# Reads a JSON array of literal strings, generates surface variants
# (verbatim / leading-space / leading-newline / lowercase / titlecase /
# uppercase / trimmed), tokenizes each variant through the supplied
# tokenizer map, deduplicates by token sequence, and writes a JSON file
# ready to paste into your internal policy's 'multi_token_patterns'
# field. The output pins the tokenizer-map sha256 so the enumeration is
# verifiably tied to the exact map bytes the runtime will tokenize
# against.
codecai-maps policies-enumerate --map=./qwen_qwen2.json \
--literals=./adversarial-strings.json \
--out=./enumerated-patterns.jsonThe descriptor never contains operator-internal contents — that's the
disclosure-boundary contract (an attacker who can fetch the
.well-known page learns the shape of enforcement, not the contents
of banned-token lists or classifier thresholds). The internal-config
side lives in codec-supervisor
under policies_dir/ and is never published.
License
MIT.
