jieba-node
v1.0.1
Published
Pure JavaScript implementation of jieba Chinese word segmentation
Maintainers
Readme
jieba-node
Pure JavaScript implementation of jieba Chinese word segmentation.
Features
✅ Core Segmentation
- Accurate mode (default)
- Full mode
- Search engine mode
- HMM-based unknown word recognition
✅ POS Tagging (词性标注)
- Part-of-speech tagging using HMM
- Support for custom word tags
✅ Keyword Extraction
- TF-IDF algorithm
- TextRank algorithm
✅ Dictionary Management
- Custom dictionary support
- User dictionary loading
- Dynamic word addition/deletion
- Word frequency tuning
- Dictionary caching for faster startup
✅ Advanced Features
- Position tracking (tokenize)
- Parallel processing with Worker Threads
- CLI tool
- Logging control
- Pure JavaScript (no native dependencies)
Installation
npm install jieba-nodeQuick Start
import jieba from 'jieba-node';
// Basic segmentation
const words = jieba.lcut('我来到北京清华大学');
console.log(words);
// ['我', '来到', '北京', '清华大学']
// Search engine mode
const searchWords = jieba.lcutForSearch('小明硕士毕业于中国科学院计算所');
console.log(searchWords);
// ['小明', '硕士', '毕业', '于', '中国', '科学', '学院', '科学院', '中国科学院', '计算', '计算所']
// POS tagging
const pairs = jieba.posseg.lcut('我爱北京天安门');
console.log(pairs.map(p => p.toString()).join(' '));
// '我/r 爱/v 北京/ns 天安门/ns'
// Keyword extraction
const keywords = jieba.analyse.tfidf('文本内容...', 10);
console.log(keywords);API Reference
Basic Segmentation
cut(sentence, cutAll = false, HMM = true)
Segment sentence into words (returns generator).
for (const word of jieba.cut('我来到北京清华大学')) {
console.log(word);
}lcut(sentence, cutAll = false, HMM = true)
Segment sentence and return as array.
const words = jieba.lcut('我来到北京清华大学');cutForSearch(sentence, HMM = true)
Segment for search engines (returns generator).
lcutForSearch(sentence, HMM = true)
Segment for search engines and return as array.
Dictionary Management
addWord(word, freq = null)
Add word to dictionary.
jieba.addWord('石墨烯');delWord(word)
Delete word from dictionary.
jieba.delWord('石墨烯');loadUserDict(filePath)
Load user dictionary from file.
Format: word freq pos_tag (freq and pos_tag are optional)
jieba.loadUserDict('./userdict.txt');suggestFreq(segment, tune = false)
Suggest word frequency.
jieba.suggestFreq('台中', true);setDictionary(dictPath)
Set custom dictionary path.
jieba.setDictionary('./custom_dict.txt');Position Tracking
tokenize(sentence, mode = 'default', HMM = true)
Tokenize and return words with positions (returns generator).
for (const [word, start, end] of jieba.tokenize('永和服装饰品有限公司')) {
console.log(`${word}: ${start}-${end}`);
}
// 永和: 0-2
// 服装: 2-4
// 饰品: 4-6
// 有限公司: 6-10POS Tagging
posseg.cut(sentence, HMM = true)
Segment with POS tags (returns generator).
for (const pair of jieba.posseg.cut('我爱北京天安门')) {
console.log(pair.word, pair.flag);
}posseg.lcut(sentence, HMM = true)
Segment with POS tags and return as array.
const pairs = jieba.posseg.lcut('我爱北京天安门');
for (const pair of pairs) {
const [word, flag] = pair; // Pair is iterable
console.log(`${word}/${flag}`);
}Keyword Extraction
analyse.tfidf(text, topK = 20, withWeight = false)
Extract keywords using TF-IDF.
const keywords = jieba.analyse.tfidf('文本内容...', 10);
// ['关键词1', '关键词2', ...]
const keywordsWithWeight = jieba.analyse.tfidf('文本内容...', 10, true);
// [['关键词1', 0.5], ['关键词2', 0.3], ...]analyse.textrank(text, topK = 20, withWeight = false)
Extract keywords using TextRank.
const keywords = jieba.analyse.textrank('文本内容...', 10);Utilities
initialize()
Manually initialize tokenizer (lazy loading by default).
jieba.initialize();getFreq(word, defaultValue = 0)
Get word frequency from dictionary.
const freq = jieba.getFreq('北京');setLogLevel(level)
Set log level: 'DEBUG', 'INFO', 'WARN', 'ERROR', or 'SILENT'.
jieba.setLogLevel('SILENT');Parallel Processing
enableParallel(processNum = null)
Enable parallel processing using Worker Threads.
// Use CPU count
jieba.enableParallel();
// Use specific number of workers
jieba.enableParallel(4);disableParallel()
Disable parallel processing.
await jieba.disableParallel();parallel.parallelProcess(text, type, cutAll, HMM)
Process multi-line text in parallel.
const text = `
第一行文本
第二行文本
第三行文本
`;
// Parallel cut
const words = await jieba.parallel.parallelProcess(text, 'cut', false, true);
// Parallel search mode
const searchWords = await jieba.parallel.parallelProcess(text, 'cutForSearch', false, true);Note: Parallel processing is most beneficial for large texts with multiple lines. For small texts, the overhead of creating workers may outweigh the benefits.
CLI Usage
# Install globally
npm install -g jieba-node
# Basic usage
echo "我爱北京天安门" | jieba
# 我 / 爱 / 北京 / 天安门
# From file
jieba input.txt
# Custom delimiter
jieba -d " " input.txt
# 我 爱 北京 天安门
# POS tagging
jieba -p input.txt
# 我_r / 爱_v / 北京_ns / 天安门_ns
# Full mode
jieba -a input.txt
# Disable HMM
jieba -n input.txt
# Custom dictionary
jieba -D custom_dict.txt -u user_dict.txt input.txt
# Quiet mode (no loading messages)
jieba -q input.txt
# Help
jieba --helpPerformance
All tests pass with performance comparable to the original implementation:
✔ 32 tests passed
✔ Basic segmentation
✔ POS tagging
✔ Keyword extraction (TF-IDF & TextRank)
✔ User dictionary
✔ Position tracking
✔ Parallel processingDictionary Caching
The dictionary is automatically cached after first load, significantly improving startup time:
- First load: ~1.0s (builds cache)
- Subsequent loads: ~0.3s (loads from cache)
Cache files are stored in the system temp directory and automatically invalidated when the dictionary file changes.
Parallel Processing
Parallel processing using Worker Threads can significantly speed up processing of large texts:
jieba.enableParallel(4);
const largeText = /* 100+ lines */;
const words = await jieba.parallel.parallelProcess(largeText, 'cut');
await jieba.disableParallel();Best for:
- Processing large documents
- Batch processing multiple texts
- Server-side applications with high throughput
Differences from Original jieba
Not Implemented:
- Paddle mode (deep learning, requires PaddlePaddle)
Implemented:
- ✅ Core segmentation (accurate, full, search modes)
- ✅ HMM-based unknown word recognition
- ✅ POS tagging (posseg module)
- ✅ Keyword extraction (TF-IDF & TextRank)
- ✅ User dictionary
- ✅ Dictionary caching (JSON format instead of marshal)
- ✅ Parallel processing (Worker Threads instead of multiprocessing)
- ✅ CLI tool
- ✅ Logging control
Advantages:
- Pure JavaScript (no native dependencies)
- Works in Node.js and browsers (with bundler)
- ESM module support
- Modern async/await patterns
- TypeScript-friendly
License
MIT
Credits
Based on the original jieba by fxsjy.
