@huggingface/tokenizers
v0.1.3
Published
🤗 Tokenizers.js: A pure JS/TS implementation of today's most used tokenizers
Downloads
35,396
Keywords
Readme
Run today's most used tokenizers directly in your browser or Node.js application. No heavy dependencies, no server required. Just fast, client-side tokenization compatible with thousands of models on the Hugging Face Hub. These tokenizers are also used in 🤗 Transformers.js
Features
- Lightweight (~ 8.3kB gzip)
- Zero dependencies
- Works in browsers and Node.js
Installation
npm install @huggingface/tokenizersAlternatively, you can use it via a CDN as follows:
<script type="module">
import { Tokenizer } from "https://cdn.jsdelivr.net/npm/@huggingface/tokenizers";
</script>Usage
import { Tokenizer } from "@huggingface/tokenizers";
// Load files from the Hugging Face Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`).then((res) => res.json());
const tokenizerConfig = await fetch(`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`).then((res) => res.json());
// Create tokenizer
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Tokenize text
const tokens = tokenizer.tokenize("Hello World"); // ['Hello', 'ĠWorld']
const encoded = tokenizer.encode("Hello World"); // { ids: [9906, 4435], tokens: ['Hello', 'ĠWorld'], attention_mask: [1, 1] }
const decoded = tokenizer.decode(encoded.ids); // 'Hello World'Requirements
This library expects two files from Hugging Face models:
tokenizer.json- Contains the tokenizer configurationtokenizer_config.json- Contains additional metadata
Components
Tokenizers.js supports Hugging Face tokenizer components:
Normalizers
- NFD
- NFKC
- NFC
- NFKD
- Lowercase
- Strip
- StripAccents
- Replace
- BERT Normalizer
- Precompiled
- Sequence
Pre-tokenizers
- BERT
- ByteLevel
- Whitespace
- WhitespaceSplit
- Metaspace
- CharDelimiterSplit
- Split
- Punctuation
- Digits
Models
- BPE (Byte-Pair Encoding)
- WordPiece
- Unigram
- Legacy
Post-processors
- ByteLevel
- TemplateProcessing
- RobertaProcessing
- BertProcessing
- Sequence
Decoders
- ByteLevel
- WordPiece
- Metaspace
- BPE
- CTC
- Replace
- Fuse
- Strip
- ByteFallback
- Sequence
