tokengeex

v0.6.2

Published

2 years ago

This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for [CodeGeeX](https://github.com/THUDM/Codegeex2) aimed at code and Chinese. It is based on [UnigramLM (Taku Kudo 2018)](https://arxiv.org/abs/1804.1

0High
0Medium
0Low

luojiesi

NLP tokenizers

TokenGeeX - Efficient Tokenizer for CodeGeeX

This repository holds the code for the TokenGeeX Rust crate and Python package. TokenGeeX is a tokenizer for CodeGeeX aimed at code and Chinese. It is based on UnigramLM (Taku Kudo 2018) and TokenMonster.

Python

You can install the PyPI TokenGeeX package through pip.

pip install tokengeex

Example usage:

import tokengeex

tokenizer = tokengeex.load("code-32k-strict.json")

# Vocab
print(tokenizer.vocab_size()) # 32768
print(tokenizer.token_to_id(b"token")) # 13513
print(tokenizer.id_to_token(13513)) # (b"token", -13.322)

# Encode
ids = tokenizer.encode("def main(): print(\"Hello world!\")")
print(ids) # [68, 437, 12747, 58, 14653, 2807, 1735, 10120]

# Decode
print(tokenizer.decode(ids, include_special_tokens=False)) # "def main(): print(\"Hello world!\")"

# Byte fallbacks
print([tokenizer.id_to_token(id) for id in tokenizer.encode("电脑")]) # ["电", "<0xe8>", "<0x84>", "<0x91>"]

Rust

You can install the Rust library crate through cargo.

cargo add tokengeex

Example usage:

fn main() {
    let tokenizer = tokengeex::load("code-32k-strict.json").unwrap();

    // Vocab
    println!("{}", tokenizer.vocab_size());
    println!("{}", tokenizer.token_to_id("token").unwrap())
    println!("{:?}", tokenizer.id_to_token(13513).unwrap())

    // Encode
    let ids = tokenizer.encode("def main(): print(\"Hello world!\")");
    println!("{:?}", ids); // [68, 437, 12747, 58, 14653, 2807, 1735, 10120]

    // Decode
    println!("{:?}", tokenizer.decode(ids, false)); // "def main(): print(\"Hello world!\")"

    // Byte fallbacks
    println!("{:?}", tokenizer.encode("电脑").map(|id| tokenizer.id_to_token(id))); // ["电", "<0xe8>", "<0x84>", "<0x91>"]
}

CLI

Train

You can install the Rust binary crate through cargo.

cargo install tokengeex --features cli

Here's the full command used to train base vocabularies.

RUST_LOG=debug RAYON_NUM_THREADS=120 tokengeex train \
    --model 'unigram' \
    --output 'base-131k.json' \
    --logfile 'base-131k.log' \
    --vocab-size 131072 \
    --processor 'nfc' \
    --processor 'crlf' \
    --initial-vocab-max-token-length 32 \
    --initial-vocab-size 10000000 \
    --initial-vocab-insert-probability 0.01 \
    --initial-vocab-allow "$(cat data/base.regex)" \
    --unigram-shrinking-factor 0.8 \
    --unigram-num-sub-iterations 2 \
    --unigram-sample-regularization 'log' \
    --added-tokens-file './hub/tokens/base/added.json' \
    --suggested-tokens-file './hub/tokens/base/suggested.json' \
    $(for lang in infilling assembly cuda hcl kotlin php shell xml c-sharp dart html powershell sql yaml c diff java lua python swift zig chinese-markdown dockerfile javascript makefile r tex cmake elixir json markdown ruby toml cpp go jsx pascal rust typescript css haskell julia perl scala vue; do echo "--train ${lang}:./hub/data/train/${lang}.bin --test ${lang}:./hub/data/test/${lang}.bin --suggested-tokens-file ./hub/tokens/base/suggested-${lang}.json "; done)

Here's the full command used to train capcode vocabularies.

RUST_LOG=debug RAYON_NUM_THREADS=120 tokengeex train \
    --model 'unigram' \
    --output 'capcode-65k.json' \
    --logfile 'capcode-65k.log' \
    --vocab-size 65536 \
    --processor 'nfc' \
    --processor 'crlf' \
    --processor 'capcode' \
    --initial-vocab-max-token-length 32 \
    --initial-vocab-size 10000000 \
    --initial-vocab-insert-probability 0.01 \
    --initial-vocab-allow "$(cat data/capcode.regex)" \
    --unigram-shrinking-factor 0.8 \
    --unigram-num-sub-iterations 2 \
    --unigram-sample-regularization 'log' \
    --added-tokens-file './hub/tokens/capcode/added.json' \
    --suggested-tokens-file './hub/tokens/capcode/suggested.json' \
    $(for lang in infilling assembly cuda hcl kotlin php shell xml c-sharp dart html powershell sql yaml c diff java lua python swift zig chinese-markdown dockerfile javascript makefile r tex cmake elixir json markdown ruby toml cpp go jsx pascal rust typescript css haskell julia perl scala vue; do echo "--train ${lang}:./hub/data/train/${lang}.bin --test ${lang}:./hub/data/test/${lang}.bin --suggested-tokens-file ./hub/tokens/capcode/suggested-${lang}.json "; done)

Extend with BPE

RUST_LOG=debug RAYON_NUM_THREADS=120 tokengeex bpe \
    --output ./capcode-131k-extended.json \
    --vocab ./capcode-131k.json \
    --num-merges 1000 \
    --step 10 \
    --score-scale-factor 0.75 \
    --max-merge-length 12 \
    --ignore '^$' \
    $(for lang in infilling assembly cuda hcl kotlin php shell xml c-sharp dart html powershell sql yaml c diff java lua python swift zig chinese-markdown dockerfile javascript makefile r tex cmake elixir json markdown ruby toml cpp go jsx pascal rust typescript css haskell julia perl scala vue; do echo "--train ${lang}:./hub/data/train/${lang}.bin --test ${lang}:./hub/data/test/${lang}.bin "; done)

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

TokenGeeX - Efficient Tokenizer for CodeGeeX

Python

Rust

CLI

Train

Extend with BPE