@ig3/node-jieba-js

v0.0.8

Published

2 months ago

Chinese word segmentation for node in pure javascript.

0High
0Medium
0Low

ian.goodacre

chinese word segmentation jieba 中文

@ig3/node-jieba-js

@ig3/node-jieba-js A Chinese word segmentation tool in pure javascript, based on Python Jieba package

This package provides for segregation of Chinese text using essentially the same algorithm as Python Jieba cut function: without the Hidden Markov Model and without Paddle.

It is compatible with dictionary files used with Python Jieba.

install

npm install @ig3/node-jieba-js

Usage

import jiebaFactory from '@ig3/node-jieba-js';

jiebaFactory({
  cacheFile: '/path/to/jieba-dictionary-cache.json',
})
.then(jiebaInstance => {
  const segregation = jiebaInstance.cut("我爸新学会了一项解决日常烦闷的活动，就是把以前的照片抱回办公室扫描保存，弄成电子版的。更无法接受的是，还居然放到网上来，时不时给我两张。\n这些积尘的化石居然突然重现，简直是招架不住。这个怀旧的阀门一旦打开，那就直到意识模糊都没停下来。");
  console.log(segregation);
});

Or, from a CJS script, use dynamic import:

import('@ig3/node-jieba-js')
.then(module => {
  module.jiebaFactory({
    cacheFile: '/path/to/jieba-dictionary-cache.json',
  })
  .then(jiebaInstance => {
    const segregation = jieba.cut("我爸新学会了一项解决日常烦闷的活动，就是把以前的照片抱回办公>室扫描保存，弄成电子版的。更无法接受的是，还居然放到网上来，时不时给我两张。\n这些积尘的化石居然突然重现，简直是招架不住。这个怀旧的阀门一旦打开，那就直到意识模
糊都没停下来。");
    console.log(segregation);
  });
});

Exports

The @ig3/node-jieba-js package is an ESM module providing the following exports:

default: jiebaFactory
jiebaFactory
jiebaFactorySync

jiebaFactory([options])

Returns an instance of the jieba object.

options <Object>
- cacheFile <string> Path to the dictionary cache file.
- dictionaryFiles <string[]> Paths to dictionary files.
- dictionaryEntries <(<string>, <integer>, <string>)[]> - dictionary entries
Returns: a Promise that resolves to an instance of the Jieba object.

The dictionaryEntries is for adding a small number of additional dictionary entries. For larger numbers of entries it is better to put them into a dictionary file and load the file.

The dictionary entries must be provided in 'internal' format: an array of three elements: the word, the 'frequency' as an integer number of occurrences per 100 million words (n.b. not text) and the part of speech as text. The value of the dictionaryEntries option must be an array of such arrays.

For example:

jiebaFactory({
  cacheFile: '/path/to/jieba-dictionary-cache.json',
  dictionaryEntries: [
    ['一', 217830, 'm'],
    ['一一二', 11, 'm']
  ]
})
.then(jiebaInstance => {
});

jiebaFactorySync([options])

Like jiebaFactory except that it returns the jieba instance object directly, rather than returning a Promise.

jieba Instance Methods

cut(text[,options][,cb])

text <string> The text to be segregated.
options <Object>
- HMM <Boolean> If true, use the Hidden Markov Model (not implemented)
- cutAll <Boolean> If true, return all matching words except single characters that are part of another word
cb <Function> An optional callback to be called with the resulting segregation of the array.
Returns: <string[]> The segregation of the given text.

Segregate the text into an array of words.

cutAll

The cutAll mode (sometimes referred to as 'Full Mode' in Python Jieba documentation) returns all multi-character words found and single characters that aren't part of any multi-character word. The returned words may be overlapping.

getDict()

Returns: <Array> The loaded dictionary data.

getDictFiles()

Returns: <Array> The paths to dictionary files to be loaded.

init()

Returns a promise that resolves after initialization has completed.

initSync()

Returns after initialization has completed.

useDict(dict)

dict <string>|<Array>|<Function>
Returns: the jieba instance object

If dict is a string, it is appended to the list of dictionary files to be loaded. This must be done before initialization.

If dict is an array, it is appended to the loaded dictionary data. Each element must be an array with three elements: word, frequency (occurrences per 100 million words) and part of speech. As in the dictionary text files except split into separate array elements and frequency must be a number.

If dict is a function it is called with this set to the jieba instance object and two arguments: the loaded dictionary array and the jieba instance object. The return value is passed to useDict, unless it is a Promise, in which case the value it resolves to is passed to useDict. Where dict is a string or an array of strings or a function that returns a string or an array of strings or a Promise that resolves to one of these.

The strings must be paths to dictionary files. They are appended to the list of dictionaries to be loaded.

Notes

Generators Vs Arrays from cut

In Python Jieba, the cut method is a 'generator' but in @ig3/node-jieba-js it returns an array of substrings. Why not return a generator function in @ig3/node-jieba-js?

A generator would be advantageous if it were possible to process the input sentence incrementally but this is not possible. To algorithm to segregate the sentence is to determine the best path through the entire sentence. This requires processing the entire sentence before the best segregation of any part of it can be determined.

trie Vs prefix dictionary

Prior to commit 51df778 on 20141019, Python Jieba generated a trie from the dictionary and used that to produce the DAG then used the DAG to produce the possible and 'best' route to segregate the sentence. Commit 51df778 changed this. Rather than generating a trie, pfdict is a set (set()) and FREQ is a dictionary ({}). pfdict has all prefixes of words in the dictionary, including the full word. Both pfdict and FREQ are used to generate the DAG. Lookup in lfreq is used to identify words and failed lookup in pfdict is used to terminate the search loop. Both lfreq and pfdict are large indexes. The commit offers no explanation why this change was made.

Subsequently the prefix dictionary has been merged with the FREQ: FREQ contains all prefixes with a 'frequency' of 0. Real words have a non-zero frequencey. When generating the DAG, the 'word' must exist in FREQ but it is added to the DAG only if it has a non-zero frequency, so that prefixes are not added. This avoids having two large indexes but FREQ becomes larger: including all the prefixes of words.

What are the advantages of the large index Vs the trie? More or less memory? More or less CPU? More or less time? There is nothing in the Python jieba commit logs to indicate why the implementation was changed.

But there is https://github.com/fxsjy/jieba/pull/187

Translation of the initial comment:

For the get_DAG() function, employing a Trie data structure, particularly within a Python environment, results in excessive memory consumption. Experiments indicate that constructing a prefix set resolves this issue.

This set stores words and their prefixes, e.g. set([“number”, “data”, “data structure”, “data structure”]). When searching for words in a sentence, a forward lookup is performed within the prefix list until the word is not found in the prefix list or the search exceeds the sentence's boundaries. This approach increases the entry count by approximately 40% compared to the original lexicon.

This version passed all tests, yielding identical segmentation results to the original version. Test: A 5.7MB novel, using default dictionary, 64-bit Ubuntu, Python 2.7.6.

Trie: Initial load 2.8 seconds, cached load 1.1 seconds; memory 277.4MB, average rate 724kB/s

Prefix dictionary: Initial load 2.1 seconds, cached load 0.4 seconds; memory 99.0MB, average rate 781kB/s

This approach resolves Trie's low space efficiency in pure Python implementations.

Simultaneously refined code details, adhered to PEP8 formatting, and optimised several logical checks.

Added main.py, enabling direct word segmentation via python -m jieba.

At least in Python then, using the larger index reduced memory consumption and processing time. Might the same be true in JavaScript?

Dictionary files

The input dictionary format is:

one word per line
Each line has three fields, separated by single spaces:
- The word
- Frequency
- Part of speech

In response to fsxjy/jieba issue#3, fsxjy describes the 'frequency' number in the dictionary:

That number indicates how many times the word appears in my corpus.

And in response to fsxjy/jieba issue#7, fsxjy describes the sources:

@feriely, the sources are primarily twofold: one is the segmented corpus from the 1998 People's Daily available for download online, along with the MSR segmented corpus. The other consists of some text novels I collected myself, which I segmented using ICTCLAS (though there might be some inaccuracies). Then I used a Python script to count word frequencies.

It isn't certain that this describes the sources for the dictionaries as the issue was regarding probabilities in 'finalseg/prob_*.py', but it seems likely that the same corpus would be used.

I have not found The corpus used to build the dictionaries. It seems it is not published. Certainly not as part of the Python Jieba source. It is some collection of texts and the numbers in the dictionaries (small, medium and big) are counts of occurrences in that unknown body of text. In particular, the total number of characters and words in the corpus are unknown.

Comparing dict.txt.big and dict.txt.small, the 'frequency' numbers for a selection of common words are the same in both dictionaries, with the exception of the frequency for 的 in dict.txt.big and dict.txt.small: 318825 in the former and 3188252 in the latter.

Checking dict.txt it is 318825. So, perhaps 318825 is correct and 3188252 is an error.

Otherwise, frequency was the same in dict.txt.big and dict.txt.small, for every word checked.

I checked dict.txt.small in the Python Jieba source and it is 3188252. So, it seems not an error I introduced.

的 is one of the most common words in every corpus I find. Checking another source, frequency is about 4000000 / 100 million. And another source: 236106 per million words = 23610600 per 100 million. So, values are quite variable. Perhaps all that really matters is that it is one of the most frequent.

Presumably the numbers would be the same for all words in both files and the difference is that the different dictionary files have different subsets of the total number of words in the corpus. The dictionary dict.txt.big must be the most complete, but whether than includes all words in the corpus or only a subset is unknown. Assuming it is a large subset of the total words, then the sum of the 'frequency' numbers in dict.txt.big will be less than the total words in the corpus. Also, as some words contain other words (e.g. '的' is contained in '目的', '真的', '的话', etc.) it is not clear whether the count for '的' is a count for '的' alone (i.e. not part of any larger word) or if it is the much larger count of how many times '的' appears in the corpus, regardless of context (i.e. including occurrences in 'longer' words).

According to Dictionary Formats, describing the dictionary formats for jieba-php, the parts of speech are:

m = Numeral (数词)
n = Noun (名词)
v = Verb (动词)
a = Adjective (形容词)
d = Adverb (副词)

According to Custom Dictionary Format, describing custom dictionaries for jieba-php, the frequency is optional with a default of 1.

I see no provision for this default frequency in the code that reads the main dictionary in the current Python version of jieba. It appears that if the frequency is missing a ValueError will result, reporting the 'invalid' dictionary entry. However, in calculation of the route it does use a default of 1 when a word is not in the FREQ lookup index. (log(self.FREQ.get(sentence[idx:x + 1]) or 1).

In current Python jieba, when loading a user dictionary, the frequency and part of speech may be omitted. The default frequency isn't necessarily 1: there is a function that determines the default (suggest_freq).

In this implementation, the frequency in the dictionaries is scaled to occurrences per 100 million words. The scaling assumes that the sum of frequencies in dict.txt.big is a good approximation of the total words in the corpus.

This allows words from other sources to be added, as long as occurrences per 100 million words can be determined.

Credits

This began as a fork of bluelovers/jieba-js as at 2025-09-20. It has been substantially rewritten.

Some details of the algorithms and dictionaries are derived from fxsjy/jieba.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@ig3/node-jieba-js

install

Usage

Exports

jiebaFactory([options])

jiebaFactorySync([options])

jieba Instance Methods

cut(text[,options][,cb])

cutAll

getDict()

getDictFiles()

init()

initSync()

useDict(dict)

Notes

Generators Vs Arrays from cut

trie Vs prefix dictionary

Dictionary files

Credits

links