npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

jieba-node

v1.0.1

Published

Pure JavaScript implementation of jieba Chinese word segmentation

Readme

jieba-node

Pure JavaScript implementation of jieba Chinese word segmentation.

Features

Core Segmentation

  • Accurate mode (default)
  • Full mode
  • Search engine mode
  • HMM-based unknown word recognition

POS Tagging (词性标注)

  • Part-of-speech tagging using HMM
  • Support for custom word tags

Keyword Extraction

  • TF-IDF algorithm
  • TextRank algorithm

Dictionary Management

  • Custom dictionary support
  • User dictionary loading
  • Dynamic word addition/deletion
  • Word frequency tuning
  • Dictionary caching for faster startup

Advanced Features

  • Position tracking (tokenize)
  • Parallel processing with Worker Threads
  • CLI tool
  • Logging control
  • Pure JavaScript (no native dependencies)

Installation

npm install jieba-node

Quick Start

import jieba from 'jieba-node';

// Basic segmentation
const words = jieba.lcut('我来到北京清华大学');
console.log(words);
// ['我', '来到', '北京', '清华大学']

// Search engine mode
const searchWords = jieba.lcutForSearch('小明硕士毕业于中国科学院计算所');
console.log(searchWords);
// ['小明', '硕士', '毕业', '于', '中国', '科学', '学院', '科学院', '中国科学院', '计算', '计算所']

// POS tagging
const pairs = jieba.posseg.lcut('我爱北京天安门');
console.log(pairs.map(p => p.toString()).join(' '));
// '我/r 爱/v 北京/ns 天安门/ns'

// Keyword extraction
const keywords = jieba.analyse.tfidf('文本内容...', 10);
console.log(keywords);

API Reference

Basic Segmentation

cut(sentence, cutAll = false, HMM = true)

Segment sentence into words (returns generator).

for (const word of jieba.cut('我来到北京清华大学')) {
  console.log(word);
}

lcut(sentence, cutAll = false, HMM = true)

Segment sentence and return as array.

const words = jieba.lcut('我来到北京清华大学');

cutForSearch(sentence, HMM = true)

Segment for search engines (returns generator).

lcutForSearch(sentence, HMM = true)

Segment for search engines and return as array.

Dictionary Management

addWord(word, freq = null)

Add word to dictionary.

jieba.addWord('石墨烯');

delWord(word)

Delete word from dictionary.

jieba.delWord('石墨烯');

loadUserDict(filePath)

Load user dictionary from file.

Format: word freq pos_tag (freq and pos_tag are optional)

jieba.loadUserDict('./userdict.txt');

suggestFreq(segment, tune = false)

Suggest word frequency.

jieba.suggestFreq('台中', true);

setDictionary(dictPath)

Set custom dictionary path.

jieba.setDictionary('./custom_dict.txt');

Position Tracking

tokenize(sentence, mode = 'default', HMM = true)

Tokenize and return words with positions (returns generator).

for (const [word, start, end] of jieba.tokenize('永和服装饰品有限公司')) {
  console.log(`${word}: ${start}-${end}`);
}
// 永和: 0-2
// 服装: 2-4
// 饰品: 4-6
// 有限公司: 6-10

POS Tagging

posseg.cut(sentence, HMM = true)

Segment with POS tags (returns generator).

for (const pair of jieba.posseg.cut('我爱北京天安门')) {
  console.log(pair.word, pair.flag);
}

posseg.lcut(sentence, HMM = true)

Segment with POS tags and return as array.

const pairs = jieba.posseg.lcut('我爱北京天安门');
for (const pair of pairs) {
  const [word, flag] = pair; // Pair is iterable
  console.log(`${word}/${flag}`);
}

Keyword Extraction

analyse.tfidf(text, topK = 20, withWeight = false)

Extract keywords using TF-IDF.

const keywords = jieba.analyse.tfidf('文本内容...', 10);
// ['关键词1', '关键词2', ...]

const keywordsWithWeight = jieba.analyse.tfidf('文本内容...', 10, true);
// [['关键词1', 0.5], ['关键词2', 0.3], ...]

analyse.textrank(text, topK = 20, withWeight = false)

Extract keywords using TextRank.

const keywords = jieba.analyse.textrank('文本内容...', 10);

Utilities

initialize()

Manually initialize tokenizer (lazy loading by default).

jieba.initialize();

getFreq(word, defaultValue = 0)

Get word frequency from dictionary.

const freq = jieba.getFreq('北京');

setLogLevel(level)

Set log level: 'DEBUG', 'INFO', 'WARN', 'ERROR', or 'SILENT'.

jieba.setLogLevel('SILENT');

Parallel Processing

enableParallel(processNum = null)

Enable parallel processing using Worker Threads.

// Use CPU count
jieba.enableParallel();

// Use specific number of workers
jieba.enableParallel(4);

disableParallel()

Disable parallel processing.

await jieba.disableParallel();

parallel.parallelProcess(text, type, cutAll, HMM)

Process multi-line text in parallel.

const text = `
第一行文本
第二行文本
第三行文本
`;

// Parallel cut
const words = await jieba.parallel.parallelProcess(text, 'cut', false, true);

// Parallel search mode
const searchWords = await jieba.parallel.parallelProcess(text, 'cutForSearch', false, true);

Note: Parallel processing is most beneficial for large texts with multiple lines. For small texts, the overhead of creating workers may outweigh the benefits.

CLI Usage

# Install globally
npm install -g jieba-node

# Basic usage
echo "我爱北京天安门" | jieba
# 我 / 爱 / 北京 / 天安门

# From file
jieba input.txt

# Custom delimiter
jieba -d " " input.txt
# 我 爱 北京 天安门

# POS tagging
jieba -p input.txt
# 我_r / 爱_v / 北京_ns / 天安门_ns

# Full mode
jieba -a input.txt

# Disable HMM
jieba -n input.txt

# Custom dictionary
jieba -D custom_dict.txt -u user_dict.txt input.txt

# Quiet mode (no loading messages)
jieba -q input.txt

# Help
jieba --help

Performance

All tests pass with performance comparable to the original implementation:

✔ 32 tests passed
✔ Basic segmentation
✔ POS tagging
✔ Keyword extraction (TF-IDF & TextRank)
✔ User dictionary
✔ Position tracking
✔ Parallel processing

Dictionary Caching

The dictionary is automatically cached after first load, significantly improving startup time:

  • First load: ~1.0s (builds cache)
  • Subsequent loads: ~0.3s (loads from cache)

Cache files are stored in the system temp directory and automatically invalidated when the dictionary file changes.

Parallel Processing

Parallel processing using Worker Threads can significantly speed up processing of large texts:

jieba.enableParallel(4);

const largeText = /* 100+ lines */;
const words = await jieba.parallel.parallelProcess(largeText, 'cut');

await jieba.disableParallel();

Best for:

  • Processing large documents
  • Batch processing multiple texts
  • Server-side applications with high throughput

Differences from Original jieba

Not Implemented:

  • Paddle mode (deep learning, requires PaddlePaddle)

Implemented:

  • ✅ Core segmentation (accurate, full, search modes)
  • ✅ HMM-based unknown word recognition
  • ✅ POS tagging (posseg module)
  • ✅ Keyword extraction (TF-IDF & TextRank)
  • ✅ User dictionary
  • ✅ Dictionary caching (JSON format instead of marshal)
  • ✅ Parallel processing (Worker Threads instead of multiprocessing)
  • ✅ CLI tool
  • ✅ Logging control

Advantages:

  • Pure JavaScript (no native dependencies)
  • Works in Node.js and browsers (with bundler)
  • ESM module support
  • Modern async/await patterns
  • TypeScript-friendly

License

MIT

Credits

Based on the original jieba by fxsjy.