polytokenizer

v1.0.23

Published

9 days ago

A lightweight, multi-provider Node.js library for text tokenization, embedding, and context management

0High
0Medium
0Low

blevinstein

tokenizer embedding openai anthropic google gemini ai llm nlp

PolyTokenizer

A lightweight, multi-provider Node.js library for text tokenization, embedding, and context management across different AI service providers (OpenAI, Anthropic, Google Gemini, Vertex AI).

Features

Easily get embeddings for different models and providers.
Easily count tokens for different models and providers.
Simple token-aware methods for context management.

Installation

npm install polytokenizer

Quick Start

import { embedText, countTokens, splitTextMaxTokens, trimMessages } from 'polytokenizer';

// Configure API keys (see Configuration section)
process.env.OPENAI_API_KEY = 'your-openai-key';
process.env.ANTHROPIC_API_KEY = 'your-anthropic-key';
process.env.GEMINI_API_KEY = 'your-gemini-key';

// Generate embeddings
const embedding = await embedText('openai/text-embedding-3-small', 'Hello world');
console.log(embedding.vector); // [0.1, -0.2, 0.3, ...]

// Try different providers
const vertexEmbedding = await embedText('vertex/text-embedding-005', 'Hello world');
const googleEmbedding = await embedText('google/gemini-embedding-001', 'Hello world');

// Use configurable dimensions with gemini-embedding-001
const embedding768 = await embedText('google/gemini-embedding-001', 'Hello world', 768);
const embedding1536 = await embedText('google/gemini-embedding-001', 'Hello world', 1536);
const embedding3072 = await embedText('google/gemini-embedding-001', 'Hello world', 3072); // default

// Count tokens
const tokens = await countTokens('anthropic/claude-sonnet-4-5', 'This is a test message');
console.log(tokens); // 6

// Split text to fit model context
const chunks = await splitTextMaxTokens('openai/gpt-5', longText, 1000);
console.log(chunks); // ['chunk1...', 'chunk2...']

Configuration

To use the library, you'll need to configure API keys for the providers you want to use.

Setting API Keys

The easiest way is to set environment variables:

# OpenAI
export OPENAI_API_KEY="sk-..."

# Anthropic  
export ANTHROPIC_API_KEY="sk-ant-..."

# Google Gemini
export GEMINI_API_KEY="AIza..."

# Vertex AI
export VERTEX_PROJECT_ID="your-gcp-project"
export VERTEX_LOCATION="us-central1"
export VERTEX_CREDENTIALS='{"type":"service_account","project_id":"your-project","private_key":"-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n","client_email":"[email protected]",...}'

Programmatic Configuration

import { configure } from 'polytokenizer';

configure({
  openai: { 
    apiKey: 'sk-...',
    baseURL: 'https://api.openai.com/v1' // optional
  },
  anthropic: { 
    apiKey: 'sk-ant-...',
    baseURL: 'https://api.anthropic.com' // optional
  },
  google: { 
    apiKey: 'AIza...' 
  },
  vertex: {
    projectId: 'your-gcp-project',
    location: 'us-central1', // optional, defaults to us-central1
    credentials: {
      type: 'service_account',
      project_id: 'your-project',
      private_key: '-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n',
      client_email: '[email protected]',
      // ... other service account fields
    }
  }
});

Vertex AI Setup

Automated Setup (Recommended)

The easiest way to set up Vertex AI is using our Terraform automation:

# Navigate to infrastructure directory
cd infrastructure/terraform

# One-command setup (replace with your project ID)
./setup.sh setup --project your-gcp-project-id

# For different region
./setup.sh setup --project your-gcp-project --region europe-west1

# Get credentials for your application
export VERTEX_PROJECT_ID="$(terraform output -raw project_id)"
export VERTEX_LOCATION="$(terraform output -raw region)"
export VERTEX_CREDENTIALS="$(terraform output -raw service_account_key_json)"

This will automatically:

✅ Create a GCP service account with proper permissions
✅ Enable required APIs (Vertex AI, IAM, Compute)
✅ Generate and output JSON credentials
✅ Set up all infrastructure with security best practices

See infrastructure/terraform/README.md for detailed setup instructions.

Manual Vertex AI Setup

If you prefer manual setup:

Enable APIs in your GCP project:

gcloud services enable aiplatform.googleapis.com iam.googleapis.com

Create service account:

gcloud iam service-accounts create vertex-ai-embeddings \
  --display-name="Vertex AI Embeddings"

Grant permissions:

gcloud projects add-iam-policy-binding YOUR_PROJECT_ID \
  --member="serviceAccount:vertex-ai-embeddings@YOUR_PROJECT_ID.iam.gserviceaccount.com" \
  --role="roles/aiplatform.user"

Create and download key:

gcloud iam service-accounts keys create key.json \
  --iam-account=vertex-ai-embeddings@YOUR_PROJECT_ID.iam.gserviceaccount.com

Usage Examples

API Reference

Text Embedding

`embedText(model, text, options?)`

Generate embeddings for text using the specified model.

// OpenAI embeddings
const result = await embedText('openai/text-embedding-3-small', 'Hello world');
// Returns: { vector: number[], model: string, usage: {...} }

// Vertex AI embeddings (very cost-effective)
const result = await embedText('vertex/text-embedding-005', 'Hello world');
// Returns: { vector: number[], model: string, usage: { tokens: 2, cost: 0.00004 } }

// Google Gemini embeddings with configurable dimensions
const result = await embedText('google/gemini-embedding-001', 'Hello world');
// Returns: { vector: number[] (3072 dimensions by default), model: string, usage: { tokens: -1 } }

const result768 = await embedText('google/gemini-embedding-001', 'Hello world', 768);
// Returns: { vector: number[] (768 dimensions), model: string, usage: { tokens: -1 } }

Parameters:

model (string): Model identifier in format provider/model (see Supported Models)
text (string): Text to embed (up to model's context limit)
dimensions (number, optional): Output dimensionality

Returns: Promise<EmbeddingResult> with:

vector: Array of numbers representing the text embedding
model: The model used for embedding
usage: Token count and cost information

Token Counting

`countTokens(model, text)`

Count tokens in text for the specified model.

const count = await countTokens('openai/gpt-5', 'Hello world');
const count = await countTokens('anthropic/claude-sonnet-4-5', 'Hello world');
const count = await countTokens('google/gemini-2.5-pro', 'Hello world');

Parameters:

model (string): Model identifier in format provider/model
text (string): Text to count tokens for

Returns: Promise<number> - Token count

Text Splitting

`splitTextMaxTokens(text, model, maxTokens, options?)`

Split text into chunks that fit within the specified token limit.

const chunks = await splitTextMaxTokens(longText, 'openai/gpt-5', 1000, {
  preserveSentences: true,  // default: true
  preserveWords: true       // default: true
});
// Returns: string[] - Array of text chunks

Parameters:

text (string): Text to split
model (string): Model identifier (for accurate token counting)
maxTokens (number): Maximum tokens per chunk
options (object, optional): Splitting preferences

Features:

Preserves sentence boundaries when possible
Falls back to word boundaries if sentences are too long
Smart handling of paragraphs and line breaks

Context Management

`trimMessages(messages, model, maxTokens, options?)`

Intelligently trim conversation messages to fit within token limits.

const messages = [
  { role: 'system', content: 'You are a helpful assistant.' },
  { role: 'user', content: 'Hello!' },
  { role: 'assistant', content: 'Hi there!' },
  // ... more messages
];

const trimmed = await trimMessages(messages, 'openai/gpt-5', 4000, {
  strategy: 'early',           // 'early' | 'late'
  preserveSystem: true,        // default: true
  extraTokensPerMessage: 4,    // optional: tokens added per message (default: 4)
  extraTokensTotal: 2          // optional: tokens added for conversation overhead (default: 2)
});

Parameters:

messages (array): Array of message objects with role and content
model (string): Model identifier for accurate token counting
maxTokens (number): Maximum total tokens allowed
options (object, optional): Trimming options

Options:

strategy: Trimming strategy
- 'early': Remove oldest non-system messages first (default)
- 'late': Remove newer messages first
preserveSystem: Keep system messages (default: true)
extraTokensPerMessage: Additional tokens per message for chat formatting overhead (default: 4 for OpenAI models, 0 for others)
extraTokensTotal: Additional tokens for entire conversation overhead (default: 2 for OpenAI models, 0 for others)

Chat Format Overhead: OpenAI models add extra tokens for chat formatting:

4 tokens per message for role boundaries and structure
2 tokens total for priming the assistant response
These defaults are automatically applied for OpenAI models but can be overridden

Returns: Promise<Message[]> - Trimmed messages that fit within token limit

Supported Models

Note: Model availability and specifications change frequently. Refer to the official documentation links below for the most current information.

OpenAI Models

Official Documentation: OpenAI Models | Changelog

GPT-5 Series (Current - o200k_base tokenizer):

openai/gpt-5.2 - Latest flagship model (400K context) - $1.25/MTok input
openai/gpt-5.1 - Previous GPT-5 version (400K context)
openai/gpt-5 - Released August 2025 (400K context)
openai/gpt-5-mini - Faster, cost-efficient (400K context) - $0.25/MTok input
openai/gpt-5-nano - Most efficient variant (400K context) - $0.05/MTok input

O-Series Reasoning Models (o200k_base tokenizer):

openai/o3 - O3 reasoning model (200K context)
openai/o1 - O1 reasoning model (200K context)
openai/o1-mini - O1 mini (128K context)

Embedding Models:

openai/text-embedding-3-small - 1536 dimensions (8K context) - $0.02/MTok
openai/text-embedding-3-large - 3072 dimensions (8K context) - $0.13/MTok
openai/text-embedding-ada-002 - 1536 dimensions (8K context) - $0.10/MTok

Anthropic Models

Official Documentation: Anthropic Models Overview

Claude 4.5 Series (Current):

anthropic/claude-opus-4-5 - Claude 4.5 Opus (200K context)
anthropic/claude-sonnet-4-5 - Claude 4.5 Sonnet (200K context)
anthropic/claude-haiku-4-5 - Claude 4.5 Haiku (200K context)

Claude 4 Series (Legacy):

anthropic/claude-opus-4-1 - Claude 4.1 Opus (200K context)
anthropic/claude-opus-4-0 - Claude 4 Opus (200K context)
anthropic/claude-sonnet-4-0 - Claude 4 Sonnet (200K context)

Claude 3 Series (Legacy):

anthropic/claude-3-7-sonnet-latest - Claude 3.7 Sonnet (200K context)
anthropic/claude-3-5-haiku-latest - Claude 3.5 Haiku (200K context)

Note: Anthropic models support tokenization only (no embedding capabilities)

Google Models

Official Documentation: Gemini API Models | Changelog

Chat Models (Tokenization Support):

Gemini 2.5 Series (Current):

google/gemini-2.5-pro - Gemini 2.5 Pro (2M context)
google/gemini-2.5-flash - Gemini 2.5 Flash (1M context)
google/gemini-2.5-flash-lite - Gemini 2.5 Flash Lite (1M context) - cost efficient

Embedding Models:

google/gemini-embedding-001 - configurable dimensions (3072 default) (2K tokens context)

Vertex AI Models

Official Documentation: Vertex AI Embeddings | Text Embeddings API

Embedding Models:

vertex/gemini-embedding-001 - configurable dimensions (3072 default) (2K tokens context) - $0.15/1M tokens
vertex/text-embedding-005 - 768 dimensions (2K tokens context) - $0.025/1M chars - Latest specialized model, English/code optimized
vertex/text-multilingual-embedding-002 - 768 dimensions (2K tokens context) - $0.025/1M chars - Multilingual support (100+ languages)

Note: Vertex AI models support embeddings only (no tokenization capabilities). Vertex AI provides the most cost-effective embedding options.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

PolyTokenizer

Features

Installation

Quick Start

Configuration

Setting API Keys

Programmatic Configuration

Vertex AI Setup

Automated Setup (Recommended)

Manual Vertex AI Setup

Usage Examples

API Reference

Text Embedding

embedText(model, text, options?)

Token Counting

countTokens(model, text)

Text Splitting

splitTextMaxTokens(text, model, maxTokens, options?)

Context Management

trimMessages(messages, model, maxTokens, options?)

Supported Models

OpenAI Models

Anthropic Models

Google Models

Vertex AI Models

`embedText(model, text, options?)`

`countTokens(model, text)`

`splitTextMaxTokens(text, model, maxTokens, options?)`

`trimMessages(messages, model, maxTokens, options?)`