npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

bpe-tokenizer

v2.2.0

Published

Build your own vocabulary from application-specific corpus using Byte pair encoding (BPE) algorithm.

Downloads

33

Readme

BPE Tokenizer

Build your own vocabulary from application-specific corpus using Byte pair encoding (BPE) algorithm.

npm Package Version Minified Package Size Minified and Gzipped Package Size

Background

This algorithm was first described in 1994 by Philip Gage for encoding strings of text into tabular form for use in downstream modeling. It was adopted by OpenAI to build GPT.

Motivation

Instead of using an over sized vocabulary set from generic dataset, you can build a smaller vocabulary set tailored for your application.

Package Overview

This package consists of two BPE tokenizers: in-memory BPETokenizer and sqlite-backed BPETokenizerDB.

The in-memory implementation can merge tokens in faster manner. The incremental merges can be persisted, which can be re-applied to resume merging progress after restart.

After the merging iteration is finished, the complete token table in the tokenizer and be persist and recovered after restart for encoding and decoding.

The sqlite-backed implementation store the characters and occurrences of each token in the database. The required tables are created automatically if not existing.

To facilitate applications built on-top of this package, the token table in BPETokenizer is a public property, and the tokens in BPETokenizerDB are accessible via typed proxy array.

The entity relation diagram (ERD) of BPETokenizerDB is documented in erd.txt.

Features Highlight

  • BPETokenizer
    • in-memory
    • zero-dependencies
    • cross-platform
      • works in both nodejs and browser
    • encode/decode to token
      • as token object
      • as vector (integer index of token)
      • as binary string (compact format)
    • faster at merging token than BPETokenizerDB
    • support continuos merging after restart
      • can export tokens and merges as json (to be persisted)
      • can resume from full snapshot or incremental merges
      • need to re-add content to corpus
  • BPETokenizerDB
    • using sqlite as backend
    • works in nodejs (not in browser)
    • encode/decode to token
      • as token object (proxy to the row in database)
      • as vector (integer id of token)
      • as binary string (compact format)
    • auto create tables if not existing
    • support larger size of corpus than BPETokenizer
      • can handle corpus size larger than the RAM can hold
    • easy to obtain statistics on tokens
    • support import from snapshot exported from BPETokenizer
    • support continuos merging after restart
      • can resume merging without extra steps
      • do not need to re-add content to corpus
    • hot data are cached in memory
      • fast encoding / decoding

Installation

npm install bpe-tokenizer

This package has two optional dependencies. They can be omitted when using BPETokenizer. You can check corresponding installation options from npm or pnpm to omit optional dependencies.

Package Overview

require('bpe-tokenizer')
{
  /* main class */
  BPETokenizer,

  /* constants */
  FS: '\x1C',
  EOF: '\x04',
  LF: '\n',
  CR: '\r',

  /* helper functions */
  fileContentToCorpus,
  linesToCorpus,
  linesTrimmedToCorpus,
  compactMerge,
}

require('bpe-tokenizer/db')
{
  /* main class */
  BPETokenizerDB,

  /* helper functions */
  resetBPETokenizerDB,
  connectDB,
}

Usage Example

import { BPETokenizer } from 'bpe-tokenizer'
import fs from 'fs'

let tokenizer = new BPETokenizer()

let content = 'aaabdaaabac'

// you can add this method multiple times to add more samples from application-specific corpus
tokenizer.addToCorpus(content)

// you can set a higher threshold for the minimum number of occurrences
tokenizer.mergeUntil({ min_weight: 2 })

// persist the tokenizer, you can restore it with tokenizer.fromJSON()
fs.writeFileSync('tokenizer.json', JSON.stringify(tokenizer))

// encode into object array for extended usage
let tokens = tokenizer.encodeToTokens(content)

// encode into integer array for numeric operations
let vector = tokenizer.encodeToVector(content)

// you can decode from token or vector
let decoded_content = tokenizer.decodeTokens(tokens)
decoded_content = tokenizer.decodeVector(vector)

console.log({
  content: content.length,
  vector: vector.length,
  ratio: vector.length / content.length,
  segments: tokens.map(token => token.chars),
  match: content == decoded_content,
})

More usage examples see core-test.ts and the example folder.

Typescript Signature

Type signatures of BPETokenizer, helper functions, types and constants in bpe-tokenizer:

The main class BPETokenizer:

export class BPETokenizer {
  /** @description token.index -> Token */
  token_table: Token[]

  /**
   * @description export token tables and merge list.
   * The json can be used to restore after restart, or to populate database with BPETokenizerDB.
   */
  toJSON(): BPETokenizerJSON

  /** @description restore from json (after restart) */
  fromJSON(json: BPETokenizerJSON): void

  /**
   * @description add new content to corpus.
   * Token weights are updated when adding content.
   */
  addToCorpus(content: string): void

  /**
   * @description called by `mergeUntil()`.
   * Can be used to implement custom iteration conditions.
   */
  findNextMerge(options?: {
    /** @default 2 */
    min_weight?: number
    /** @default unlimited */
    max_length?: number
  }): MergeToken | null
  /**
   * @description called by `mergeUntil()`.
   * Can be used to implement custom iteration conditions.
   */
  applyMerge(merge: MergeToken): void

  /**
   * @description call `findNextMerge()` and `applyMerge()` in loop
   */
  mergeUntil(options?: {
    /** @default 2 */
    min_weight?: number
    /** @default unlimited */
    max_length?: number
    /** @default unlimited */
    max_iterations?: number
  }): void

  encodeToTokens(content: string): Token[]
  encodeToVector(content: string): number[]

  decodeTokens(tokens: Token[]): string
  decodeVector(vector: number[]): string

  /* for restore */

  /**
   * @description restore content to corpus (after restart) for continuous merging.
   * Token weights are not updated when restoring content.
   */
  restoreToCorpus(content: string): void

  /**
   * @description restore merge produced from `compactMerge(this.findNextMerge())`.
   * To be used after restart for continuous merging.
   */
  restoreMerge(compactMerge: CompactMerge): void

  /* internal methods */

  /**
   * @description encode to binary string.
   * Used by:
   *   - `restoreToCorpus()`
   *   - `encodeToTokens()`
   *   - `encodeToVector()`
   */
  encodeToCode(content: string): string
}

Object Types:

/**
 * @description a + b -> c, e.g. "app" + "le" -> "apple"
 */
export type MergeToken = [a: Token, b: Token, c: Token]

/**
 * @description to be stored to file for restoring
 */
export type CompactMerge = [a_code: string, b_code: string, c_weight: number]

/** @description for BPETokenizer.fromJSON() */
export type BPETokenizerJSON = {
  version: 2
  char_count: number
  token_table: [chars: string, weight: number, original_weight: number][]
  merge_codes: [a_code: string, b_code: string, c_code: string][]
}

Helper functions:

/** @description wrap with FS and EOF */
export function fileContentToCorpus(content: string | Buffer): string

/** @description split into lines, wrap with \r and \n */
export function linesToCorpus(text: string): string[]

/** @description split into lines, trim spaces, wrap with \r and \n */
export function linesTrimmedToCorpus(text: string): string[]

/**
 * @description to store MergeToken in compact format
 */
export function compactMerge(merge: MergeToken): CompactMerge

export type Token = {
  chars: string
  /** @description the weight after merge */
  weight: number
  /** @description the weight before merge */
  original_weight: number
  code: string
  /** @description including zero-weight tokens in token_table */
  index: number
}

Constants:

/** @description file separator */
export let FS: string

/** @description end of file */
export let EOF: string

/** @description "\n" line feed, new line */
export let LF: string

/** @description "\r" carriage return */
export let CR: string

Type signatures of BPETokenizerDB, helper functions, and object types in bpe-tokenizer/db:

The main class BPETokenizerDB:

import { BetterSqlite3Helper } from '@beenotung/better-sqlite3-helper'
import { DBProxy, Token } from './proxy'
import { BPETokenizerJSON } from '../core'

export class BPETokenizerDB {
  db: BetterSqlite3Helper.DBInstance
  proxy: DBProxy

  constructor(options: { db: BetterSqlite3Helper.DBInstance })

  /** @description delete all tokens and corpus from database, called by fromJSON() */
  reset(): void

  /** @description for in-memory BPETokenizer */
  toJSON(): BPETokenizerJSON

  /** @description delete all existing tokens and corpus, then import tokens from the json */
  fromJSON(json: BPETokenizerJSON): void

  /** @description to enable adding more corpus without duplication */
  getLastCorpusId(): number | null

  hasCorpus(id: number): boolean

  /**
   * @description add new content to corpus.
   * Token weights are updated when adding content.
   */
  addToCorpus(id: number, content: string): void

  /**
   * @description restore content to corpus (after import tokens with fromJSON()) for continuous merging.
   * Token weights are not updated when restoring content.
   */
  restoreToCorpus(id: number, content: string): void

  /**
   * @description called by `mergeUntil()`.
   * Can be used to implement custom iteration conditions.
   */
  findNextMerge(options?: {
    /** @default 2 */
    min_weight?: number
    /** @default unlimited */
    max_length?: number
  }): MergeToken | null

  /**
   * @description called by `mergeUntil()`.
   * Can be used to implement custom iteration conditions.
   */
  applyMerge(merge: MergeToken): void

  /**
   * @description call `findNextMerge()` and `applyMerge()` in loop
   */
  mergeUntil(options?: {
    /** @default 2 */
    min_weight?: number
    /** @default unlimited */
    max_length?: number
    /** @default unlimited */
    max_iterations?: number
  }): void

  encodeToTokens(content: string): Token[]
  encodeToVector(content: string): number[]

  decodeTokens(tokens: Token[]): string
  decodeVector(vector: number[]): string

  /**
   * @description encode to binary string.
   * Used by:
   *   - `restoreToCorpus()`
   *   - `encodeToTokens()`
   *   - `encodeToVector()`
   */
  encodeToCode(content: string): string
}

Object Types:

/**
 * @description a + b -> c, e.g. "app" + "le" -> "apple"
 */
export type MergeToken = [a: Token, b: Token, c: Token]

Helper functions:

export function connectDB(path: string): BetterSqlite3Helper.DBInstance

export function resetBPETokenizerDB(db: BetterSqlite3Helper.DBInstance): void

Type signatures of the tables in bpe-tokenizer/db/proxy:

import { BetterSqlite3Helper } from '@beenotung/better-sqlite3-helper'
import { ProxySchemaOptions } from 'better-sqlite3-proxy'

export type Corpus = {
  id?: null | number
  content_code: string
}
export type Token = {
  id?: null | number
  chars: string
  weight: number
  original_weight: number
  code: string
}
export type CharToken = {
  id?: null | number
  token?: Token
}
export type Merge = {
  id?: null | number
  a_id: number
  a?: Token
  b_id: number
  b?: Token
  c_id: number
  c?: Token
}

export type DBProxy = {
  corpus: Corpus[]
  token: Token[]
  char_token: CharToken[]
  merge: Merge[]
}

export let tableFields: ProxySchemaOptions<DBProxy>['tableFields']

export function createProxy(options: {
  db: BetterSqlite3Helper.DBInstance
  auto_update_timestamp?: boolean | undefined
}): DBProxy

License

This project is licensed with BSD-2-Clause

This is free, libre, and open-source software. It comes down to four essential freedoms [ref]:

  • The freedom to run the program as you wish, for any purpose
  • The freedom to study how the program works, and change it so it does your computing as you wish
  • The freedom to redistribute copies so you can help others
  • The freedom to distribute copies of your modified versions to others