chinese-tokenizer-module

v1.0.1

Published

3 years ago

Simple algorithm to tokenize Chinese texts into words using CC-CEDICT.

0High
0Medium
0Low

kimeiga

chinese text tokenizer words language

chinese-tokenizer-module

Took the below npm package and made it module and es6 syntax and included the dictionary in the project as a js file

The hope is so that I can use it in my sveltekit project without having issues with the server routes returning 500 because they cannot find the binary file that i'm trying to include (the .u8 file)

So now the syntax for using this is just

import tokenize from 'chinese-tokenizer-module'

console.log(JSON.stringify(tokenize('我是中国人。'), null, '  '))
console.log(JSON.stringify(tokenize('我是中國人。'), null, '  '))

This means that occassionally the src/dict.js file needs to be updated with the latest cedict_ts.u8 contents but you already knew that and blah blah

Thanks to Yichuan Shen for making this in the first place!

chinese-tokenizer

Simple algorithm to tokenize Chinese texts into words using CC-CEDICT. You can try it out at the demo page. The code for the demo page can be found in the gh-pages branch of this repository.

How this works

This tokenizer uses a simple greedy algorithm: It always looks for the longest word in the CC-CEDICT dictionary that matches the input, one at a time.

Installation

Use npm to install:

npm install chinese-tokenizer --save

Usage

Make sure to provide the CC-CEDICT data.

const tokenize = require('chinese-tokenizer').loadFile('./cedict_ts.u8')

console.log(JSON.stringify(tokenize('我是中国人。'), null, '  '))
console.log(JSON.stringify(tokenize('我是中國人。'), null, '  '))

Output:

[
  {
    "text": "我",
    "traditional": "我",
    "simplified": "我",
    "position": { "offset": 0, "line": 1, "column": 1 },
    "matches": [
      {
        "pinyin": "wo3",
        "pinyinPretty": "wǒ",
        "english": "I/me/my"
      }
    ]
  },
  {
    "text": "是",
    "traditional": "是",
    "simplified": "是",
    "position": { "offset": 1, "line": 1, "column": 2 },
    "matches": [
      {
        "pinyin": "shi4",
        "pinyinPretty": "shì",
        "english": "is/are/am/yes/to be"
      }
    ]
  },
  {
    "text": "中國人",
    "traditional": "中國人",
    "simplified": "中国人",
    "position": { "offset": 2, "line": 1, "column": 3 },
    "matches": [
      {
        "pinyin": "Zhong1 guo2 ren2",
        "pinyinPretty": "Zhōng guó rén",
        "english": "Chinese person"
      }
    ]
  },
  {
    "text": "。",
    "traditional": "。",
    "simplified": "。",
    "position": { "offset": 5, "line": 1, "column": 6 },
    "matches": []
  }
]

API

`chineseTokenizer.loadFile(path)`

Reads the CC-CEDICT file from given path and returns a tokenize function based on the dictionary.

`chineseTokenizer.load(content)`

Parses CC-CEDICT string content from content and returns a tokenize function based on the dictionary.

`tokenize(text)`

Tokenizes the given text string and returns an array with tokens of the following form:

{
  "text": <string>,
  "traditional": <string>,
  "simplified": <string>,
  "position": { "offset": <number>, "line": <number>, "column": <number> },
  "matches": [
    {
      "pinyin": <string>,
      "pinyinPretty": <string>,
      "english": <string>
    },
    ...
  ]
}

#� �c�h�i�n�e�s�e�-�t�o�k�e�n�i�z�e�r�-�m�o�d�u�l�e� � �#� �c�h�i�n�e�s�e�-�t�o�k�e�n�i�z�e�r�-�m�o�d�u�l�e� � �

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

chinese-tokenizer-module

chinese-tokenizer

How this works

Installation

Usage

API

chineseTokenizer.loadFile(path)

chineseTokenizer.load(content)

tokenize(text)

`chineseTokenizer.loadFile(path)`

`chineseTokenizer.load(content)`

`tokenize(text)`