jieba-node

v1.0.1

Published

4 months ago

Pure JavaScript implementation of jieba Chinese word segmentation

0High
0Medium
0Low

jieba chinese segmentation nlp tokenizer pos-tagging keyword-extraction chinese-nlp word-segmentation natural-language-processing

jieba-node

Pure JavaScript implementation of jieba Chinese word segmentation.

Features

✅ Core Segmentation

Accurate mode (default)
Full mode
Search engine mode
HMM-based unknown word recognition

✅ POS Tagging (词性标注)

Part-of-speech tagging using HMM
Support for custom word tags

✅ Keyword Extraction

TF-IDF algorithm
TextRank algorithm

✅ Dictionary Management

Custom dictionary support
User dictionary loading
Dynamic word addition/deletion
Word frequency tuning
Dictionary caching for faster startup

✅ Advanced Features

Position tracking (tokenize)
Parallel processing with Worker Threads
CLI tool
Logging control
Pure JavaScript (no native dependencies)

Installation

npm install jieba-node

Quick Start

import jieba from 'jieba-node';

// Basic segmentation
const words = jieba.lcut('我来到北京清华大学');
console.log(words);
// ['我', '来到', '北京', '清华大学']

// Search engine mode
const searchWords = jieba.lcutForSearch('小明硕士毕业于中国科学院计算所');
console.log(searchWords);
// ['小明', '硕士', '毕业', '于', '中国', '科学', '学院', '科学院', '中国科学院', '计算', '计算所']

// POS tagging
const pairs = jieba.posseg.lcut('我爱北京天安门');
console.log(pairs.map(p => p.toString()).join(' '));
// '我/r 爱/v 北京/ns 天安门/ns'

// Keyword extraction
const keywords = jieba.analyse.tfidf('文本内容...', 10);
console.log(keywords);

API Reference

Basic Segmentation

`cut(sentence, cutAll = false, HMM = true)`

Segment sentence into words (returns generator).

for (const word of jieba.cut('我来到北京清华大学')) {
  console.log(word);
}

`lcut(sentence, cutAll = false, HMM = true)`

Segment sentence and return as array.

const words = jieba.lcut('我来到北京清华大学');

`cutForSearch(sentence, HMM = true)`

Segment for search engines (returns generator).

`lcutForSearch(sentence, HMM = true)`

Segment for search engines and return as array.

Dictionary Management

`addWord(word, freq = null)`

Add word to dictionary.

jieba.addWord('石墨烯');

`delWord(word)`

Delete word from dictionary.

jieba.delWord('石墨烯');

`loadUserDict(filePath)`

Load user dictionary from file.

Format: word freq pos_tag (freq and pos_tag are optional)

jieba.loadUserDict('./userdict.txt');

`suggestFreq(segment, tune = false)`

Suggest word frequency.

jieba.suggestFreq('台中', true);

`setDictionary(dictPath)`

Set custom dictionary path.

jieba.setDictionary('./custom_dict.txt');

Position Tracking

`tokenize(sentence, mode = 'default', HMM = true)`

Tokenize and return words with positions (returns generator).

for (const [word, start, end] of jieba.tokenize('永和服装饰品有限公司')) {
  console.log(`${word}: ${start}-${end}`);
}
// 永和: 0-2
// 服装: 2-4
// 饰品: 4-6
// 有限公司: 6-10

POS Tagging

`posseg.cut(sentence, HMM = true)`

Segment with POS tags (returns generator).

for (const pair of jieba.posseg.cut('我爱北京天安门')) {
  console.log(pair.word, pair.flag);
}

`posseg.lcut(sentence, HMM = true)`

Segment with POS tags and return as array.

const pairs = jieba.posseg.lcut('我爱北京天安门');
for (const pair of pairs) {
  const [word, flag] = pair; // Pair is iterable
  console.log(`${word}/${flag}`);
}

Keyword Extraction

`analyse.tfidf(text, topK = 20, withWeight = false)`

Extract keywords using TF-IDF.

const keywords = jieba.analyse.tfidf('文本内容...', 10);
// ['关键词1', '关键词2', ...]

const keywordsWithWeight = jieba.analyse.tfidf('文本内容...', 10, true);
// [['关键词1', 0.5], ['关键词2', 0.3], ...]

`analyse.textrank(text, topK = 20, withWeight = false)`

Extract keywords using TextRank.

const keywords = jieba.analyse.textrank('文本内容...', 10);

Utilities

`initialize()`

Manually initialize tokenizer (lazy loading by default).

jieba.initialize();

`getFreq(word, defaultValue = 0)`

Get word frequency from dictionary.

const freq = jieba.getFreq('北京');

`setLogLevel(level)`

Set log level: 'DEBUG', 'INFO', 'WARN', 'ERROR', or 'SILENT'.

jieba.setLogLevel('SILENT');

Parallel Processing

`enableParallel(processNum = null)`

Enable parallel processing using Worker Threads.

// Use CPU count
jieba.enableParallel();

// Use specific number of workers
jieba.enableParallel(4);

`disableParallel()`

Disable parallel processing.

await jieba.disableParallel();

`parallel.parallelProcess(text, type, cutAll, HMM)`

Process multi-line text in parallel.

const text = `
第一行文本
第二行文本
第三行文本
`;

// Parallel cut
const words = await jieba.parallel.parallelProcess(text, 'cut', false, true);

// Parallel search mode
const searchWords = await jieba.parallel.parallelProcess(text, 'cutForSearch', false, true);

Note: Parallel processing is most beneficial for large texts with multiple lines. For small texts, the overhead of creating workers may outweigh the benefits.

CLI Usage

# Install globally
npm install -g jieba-node

# Basic usage
echo "我爱北京天安门" | jieba
# 我 / 爱 / 北京 / 天安门

# From file
jieba input.txt

# Custom delimiter
jieba -d " " input.txt
# 我 爱 北京 天安门

# POS tagging
jieba -p input.txt
# 我_r / 爱_v / 北京_ns / 天安门_ns

# Full mode
jieba -a input.txt

# Disable HMM
jieba -n input.txt

# Custom dictionary
jieba -D custom_dict.txt -u user_dict.txt input.txt

# Quiet mode (no loading messages)
jieba -q input.txt

# Help
jieba --help

Performance

All tests pass with performance comparable to the original implementation:

✔ 32 tests passed
✔ Basic segmentation
✔ POS tagging
✔ Keyword extraction (TF-IDF & TextRank)
✔ User dictionary
✔ Position tracking
✔ Parallel processing

Dictionary Caching

The dictionary is automatically cached after first load, significantly improving startup time:

First load: ~1.0s (builds cache)
Subsequent loads: ~0.3s (loads from cache)

Cache files are stored in the system temp directory and automatically invalidated when the dictionary file changes.

Parallel Processing

Parallel processing using Worker Threads can significantly speed up processing of large texts:

jieba.enableParallel(4);

const largeText = /* 100+ lines */;
const words = await jieba.parallel.parallelProcess(largeText, 'cut');

await jieba.disableParallel();

Best for:

Processing large documents
Batch processing multiple texts
Server-side applications with high throughput

Differences from Original jieba

Not Implemented:

Paddle mode (deep learning, requires PaddlePaddle)

Implemented:

✅ Core segmentation (accurate, full, search modes)
✅ HMM-based unknown word recognition
✅ POS tagging (posseg module)
✅ Keyword extraction (TF-IDF & TextRank)
✅ User dictionary
✅ Dictionary caching (JSON format instead of marshal)
✅ Parallel processing (Worker Threads instead of multiprocessing)
✅ CLI tool
✅ Logging control

Advantages:

Pure JavaScript (no native dependencies)
Works in Node.js and browsers (with bundler)
ESM module support
Modern async/await patterns
TypeScript-friendly

License

MIT

Credits

Based on the original jieba by fxsjy.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

jieba-node

Features

Installation

Quick Start

API Reference

Basic Segmentation

cut(sentence, cutAll = false, HMM = true)

lcut(sentence, cutAll = false, HMM = true)

cutForSearch(sentence, HMM = true)

lcutForSearch(sentence, HMM = true)

Dictionary Management

addWord(word, freq = null)

delWord(word)

loadUserDict(filePath)

suggestFreq(segment, tune = false)

setDictionary(dictPath)

Position Tracking

tokenize(sentence, mode = 'default', HMM = true)

POS Tagging

posseg.cut(sentence, HMM = true)

posseg.lcut(sentence, HMM = true)

Keyword Extraction

analyse.tfidf(text, topK = 20, withWeight = false)

analyse.textrank(text, topK = 20, withWeight = false)

Utilities

initialize()

getFreq(word, defaultValue = 0)

setLogLevel(level)

Parallel Processing

enableParallel(processNum = null)

disableParallel()

parallel.parallelProcess(text, type, cutAll, HMM)

CLI Usage

Performance

Dictionary Caching

Parallel Processing

Differences from Original jieba

License

Credits

`cut(sentence, cutAll = false, HMM = true)`

`lcut(sentence, cutAll = false, HMM = true)`

`cutForSearch(sentence, HMM = true)`

`lcutForSearch(sentence, HMM = true)`

`addWord(word, freq = null)`

`delWord(word)`

`loadUserDict(filePath)`

`suggestFreq(segment, tune = false)`

`setDictionary(dictPath)`

`tokenize(sentence, mode = 'default', HMM = true)`

`posseg.cut(sentence, HMM = true)`

`posseg.lcut(sentence, HMM = true)`

`analyse.tfidf(text, topK = 20, withWeight = false)`

`analyse.textrank(text, topK = 20, withWeight = false)`

`initialize()`

`getFreq(word, defaultValue = 0)`

`setLogLevel(level)`

`enableParallel(processNum = null)`

`disableParallel()`

`parallel.parallelProcess(text, type, cutAll, HMM)`