@bunkojp/text-sentiment
v0.1.0
Published
Multilingual sentiment analysis with Scunthorpe-safe tokenization.
Readme
@bunkojp/text-sentiment
Multilingual sentiment analysis with Scunthorpe-safe tokenization.
- Multiple strategies — Naive Bayes (primary), TF-IDF, NCD, and weighted ensemble
- Toxic content detection — offensive word matching that never flags safe compound words
- No server required — runs entirely in-process (Node, Bun, or browser)
- Data-driven — all lexicons loaded from pre-built compressed binaries, no hardcoded word lists in source
Overview
Architecture
analyze(text, language, options?)
|
+-- tokenizeText(text, language) <- Single Source of Truth
| |
| +-- JA: mikan.js-style segmenter (character-class boundaries)
| +-- EN: whitespace split + punctuation strip
|
+-- tokens -> classifyByNaiveBayes() <- pre-trained model from binary
+-- tokens -> classifyByTfidf()
+-- tokens -> classifyByNcd()
+-- tokens -> toxic detection (exact token match)
+-- tokens -> category breakdown (lexicon lookup)Scunthorpe Problem Prevention
The Japanese segmenter splits text at character-class transitions (kanji / hiragana / katakana / latin). Katakana compound words stay as single tokens, so offensive substrings embedded within them are never produced as independent tokens.
For English, standard whitespace tokenization already keeps compound words like "Scunthorpe" and "cocktail" intact.
Data Format
All lexicon data is stored as compressed MessagePack binaries in src/data/:
| Optimization | Effect | |---|---| | Columnar layout | words[], scores[], categories[] stored separately | | Score delta encoding | sorted scores have small consecutive differences | | NB palette indexing | log-likelihood triplets compressed to palette + uint8 index | | Category enum | string categories mapped to uint8 | | deflate level 9 | maximum compression |
Supported Languages
| Language | Sentiment Lexicon | Toxic Lexicon | |---|---|---| | Japanese (ja) | 11,293 words (Tohoku Univ. via oseti) | 748 words | | English (en) | 8,219 words (AFINN + VADER) | 1,540 words (cuss) |
Adding a new language requires only running the build script with new corpus URLs — no code changes.
Getting Started
Quick Start (Node / Bun)
import { readFileSync } from "node:fs";
import { analyze, registerLexiconFromBinary, registerToxicLexiconFromBinary } from "@bunkojp/text-sentiment";
// Load lexicon binaries (once at startup)
registerLexiconFromBinary("ja", new Uint8Array(readFileSync("node_modules/@bunkojp/text-sentiment/src/data/sentiment-ja.bin")));
registerLexiconFromBinary("en", new Uint8Array(readFileSync("node_modules/@bunkojp/text-sentiment/src/data/sentiment-en.bin")));
registerToxicLexiconFromBinary("ja", new Uint8Array(readFileSync("node_modules/@bunkojp/text-sentiment/src/data/toxic-ja.bin")));
registerToxicLexiconFromBinary("en", new Uint8Array(readFileSync("node_modules/@bunkojp/text-sentiment/src/data/toxic-en.bin")));
// Analyze
const result = analyze("This movie is absolutely wonderful.", "en");
console.log(result.sentiment.label); // "positive"
console.log(result.sentiment.confidence); // 0.95Quick Start (Browser)
import { analyze, registerLexiconFromBinary, registerToxicLexiconFromBinary } from "@bunkojp/text-sentiment";
// Fetch and register binaries
const data = await fetch("/sentiment-en.bin").then(r => r.arrayBuffer());
registerLexiconFromBinary("en", new Uint8Array(data));
const result = analyze("Terrible experience.", "en");
console.log(result.sentiment.label); // "negative"Usage
Basic Sentiment Analysis
// Default: Naive Bayes
analyze("素晴らしい作品です", "ja")
// { sentiment: { label: "positive", confidence: 0.94, scores: {...} }, tokens: [...] }
// Select strategy
analyze("Great product", "en", { strategy: "tfidf" })
analyze("Great product", "en", { strategy: "ncd" })Ensemble
Weighted combination of all three strategies:
analyze("素晴らしい作品です", "ja", { ensemble: {} })
// Default weights: naive-bayes 0.6, tfidf 0.25, ncd 0.15
// Custom weights
analyze("text", "en", {
ensemble: {
strategies: ["naive-bayes", "tfidf"],
weights: { "naive-bayes": 0.7, "tfidf": 0.3 },
},
})Toxic Content Detection
const r = analyze("text", "ja", { toxic: true });
r.toxic?.toxic // boolean
r.toxic?.matches // [{ word, severity, category }]Per-Category Breakdown
const r = analyze("text", "ja", { categories: true });
r.categories
// { general: { label, confidence, scores }, quality: {...}, ... }Categories: general, quality, service, price, usability, emotion, appearance
All Options Combined
analyze("text", "ja", {
strategy: "naive-bayes",
ensemble: { weights: { "naive-bayes": 0.6, tfidf: 0.25, ncd: 0.15 } },
categories: true,
toxic: true,
smoothing: 1,
neutralThreshold: 0.05,
})Low-Level API
For direct access to individual classifiers:
import { tokenizeText, classifyByNaiveBayes, getLexicon } from "@bunkojp/text-sentiment";
const tokens = tokenizeText("text", "ja");
const lexicon = getLexicon("ja");
const result = classifyByNaiveBayes(tokens, lexicon);Installation
npm install @bunkojp/text-sentiment
# or
bun add @bunkojp/text-sentimentBuilding from Source
git clone https://github.com/bunko-jp/text-sentiment.git
cd text-sentiment
bun install
bun run build:data # Download corpora and build lexicon binaries
bun run build # Build library
bun run test # Run testsRebuilding Lexicon Data
The lexicon binaries in src/data/ are pre-built and included in the package. To rebuild from external corpora:
bun run build:dataThis downloads from:
- Tohoku University sentiment dictionary (via oseti)
- AFINN-165 + VADER
- inappropriate-words-ja + LDNOOBW V2
- words/cuss
See THIRD-PARTY-LICENSES for full license details.
Demo
bun run demo
# Opens http://localhost:5173 with a React-based interactive demoLicense
CC0-1.0 - see LICENSE for details.
