@chonkiejs/token
v0.0.3
Published
HuggingFace tokenizer support for Chonkie - extends @chonkiejs/core with real tokenization
Maintainers
Readme

@chonkiejs/token
HuggingFace tokenizer support for Chonkie - extends @chonkiejs/core with real tokenization.
Features
🤗 HuggingFace Integration - Use any HuggingFace tokenizer model 🔌 Optional Plugin - Install only when you need real tokenization 📦 Zero Config - Works automatically with @chonkiejs/core ⚡ Progressive Enhancement - Core works without it, better with it
Installation
Install with npm:
npm i @chonkiejs/token @chonkiejs/coreInstall with pnpm:
pnpm add @chonkiejs/token @chonkiejs/coreInstall with yarn:
yarn add @chonkiejs/token @chonkiejs/coreInstall with bun:
bun add @chonkiejs/token @chonkiejs/coreQuick Start
Simply install this package alongside @chonkiejs/core, then use tokenizer names:
import { RecursiveChunker, TokenChunker } from '@chonkiejs/core';
// Use GPT-2 tokenization (automatically uses @chonkiejs/token)
const chunker = await RecursiveChunker.create({
tokenizer: 'Xenova/gpt2',
chunkSize: 512
});
const chunks = await chunker.chunk('Your text here...');Supported Models
Any HuggingFace model from transformers.js:
Xenova/gpt2Xenova/gpt-4bert-base-uncasedgoogle-bert/bert-base-multilingual-cased- And many more!
See: https://huggingface.co/models?library=transformers.js
Usage Examples
With RecursiveChunker
import { RecursiveChunker } from '@chonkiejs/core';
const chunker = await RecursiveChunker.create({
tokenizer: 'Xenova/gpt2',
chunkSize: 512
});
const chunks = await chunker.chunk('Your document...');With TokenChunker
import { TokenChunker } from '@chonkiejs/core';
const chunker = await TokenChunker.create({
tokenizer: 'bert-base-uncased',
chunkSize: 256,
chunkOverlap: 50
});
const chunks = await chunker.chunk('Your text...');Direct Tokenizer Usage
import { HuggingFaceTokenizer } from '@chonkiejs/token';
const tokenizer = await HuggingFaceTokenizer.create('Xenova/gpt2');
const count = tokenizer.countTokens('Hello world!');
const tokens = tokenizer.encode('Hello world!');
const text = tokenizer.decode(tokens);
console.log(`Token count: ${count}`);How It Works
When you call Tokenizer.create('gpt2') in @chonkiejs/core:
- Core tries to dynamically import
@chonkiejs/token - If installed: Uses HuggingFaceTokenizer
- If not installed: Shows helpful error message
This keeps core lightweight while allowing advanced tokenization when needed!
Contributing
Want to help grow Chonkie? Check out CONTRIBUTING.md to get started! Whether you're fixing bugs, adding features, improving docs, or simply leaving a ⭐️ on the repo, every contribution helps make Chonkie a better CHONK for everyone.
Remember: No contribution is too small for this tiny hippo!
