@dev-pi2pie/word-counter
v0.1.5
Published
Locale-aware word counting powered by the Web API [`Intl.Segmenter`](https://developer.mozilla.org/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter). The script automatically detects the primary writing system for each portion of the input, seg
Maintainers
Readme
Word Counter
Locale-aware word counting powered by the Web API Intl.Segmenter. The script automatically detects the primary writing system for each portion of the input, segments the text with matching BCP 47 locale tags, and reports word totals per locale.
Quick Start (npx)
Runtime requirement: Node.js >=22.18.0.
Run without installing:
npx @dev-pi2pie/word-counter "Hello 世界 안녕"Pipe stdin:
echo "こんにちは world مرحبا" | npx @dev-pi2pie/word-counterFile input:
npx @dev-pi2pie/word-counter --path ./examples/yaml-basic.mdInstall and Usage Paths
Pick one path based on how often you use it:
- One-off use:
npx @dev-pi2pie/word-counter ...(no install, best for quick checks and CI snippets). - Frequent CLI use:
npm install -g @dev-pi2pie/word-counter@latestthen runword-counter .... - Library use in code:
npm install @dev-pi2pie/word-counterand import from your app/scripts.
For local development in this repository:
git clone https://github.com/dev-pi2pie/word-counter.git
cd word-counter
rustup target add wasm32-unknown-unknown
cargo install wasm-pack --locked
bun install
bun run build
npm linkThen:
word-counter "Hello 世界 안녕"To remove the global link:
npm unlink --global @dev-pi2pie/word-counterCLI Usage
Basic text:
word-counter "Hello 世界 안녕"Hint a language tag for ambiguous Latin text:
word-counter --latin-language en "Hello world"
word-counter --latin-tag en "Hello world"Add custom Latin hint rules (repeatable) or load from JSON:
word-counter --latin-hint 'pl=[ąćęłńóśźżĄĆĘŁŃÓŚŹŻ]' "Zażółć gęślą jaźń"
word-counter --latin-hint 'tr=[çğıöşüÇĞİÖŞÜ]' --latin-hint 'ro=[ăâîșțĂÂÎȘȚ]' "șță"
word-counter --latin-hints-file ./examples/latin-hints.json "Zażółć Știință Iğdır"
word-counter --no-default-latin-hints --latin-hint 'pl=[ąćęłńóśźżĄĆĘŁŃÓŚŹŻ]' "Zażółć"examples/latin-hints.json format:
[
{ "tag": "pl", "pattern": "[ąćęłńóśźżĄĆĘŁŃÓŚŹŻ]" },
{ "tag": "tr", "pattern": "[çğıöşüÇĞİÖŞÜ]", "priority": 1 }
]Hint a language tag for Han fallback:
word-counter --han-language zh-Hant "漢字測試"
word-counter --han-tag zh-Hans "汉字测试"Enable the optional WASM detector for ambiguous Latin and Han routes:
word-counter --detector wasm "This sentence should clearly be detected as English for the wasm detector path."
word-counter --detector wasm "漢字測試需要更多內容才能觸發偵測"
word-counter --detector wasm --content-gate strict "Internationalization documentation remains understandable."
word-counter --detector wasm --content-gate loose "四字成語"
word-counter --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."Inspect detector behavior without count output:
word-counter inspect "こんにちは、世界!これはテストです。"
word-counter inspect --detector wasm --view engine "This sentence should clearly be detected as English for the wasm detector path."
word-counter inspect --detector regex -f json "こんにちは、世界!これはテストです。"
word-counter inspect --detector regex -f json --pretty "こんにちは、世界!これはテストです。"
word-counter inspect --detector wasm --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
word-counter inspect -p ./examples/yaml-basic.md
word-counter inspect -p ./examples/test-case-multi-files-support
word-counter inspect -p ./examples/test-case-multi-files-support --section content -f json --prettyDetector mode notes:
--detector regexis the default behavior.--detector wasmonly runs for ambiguousund-Latnandund-Hanichunks.--content-gate default|strict|loose|offconfigures the shared detector policy mode used by the WASM detector path.default: current fixture-backed project policystrict: raises detector eligibility thresholds and makes more borderline windows fall backloose: lowers detector eligibility thresholds and makes more borderline windows eligible or upgradableoff: bypassescontentGateevaluation only
- mode behavior differs by route:
und-Latn:default|strict|looseaffect both eligibility and the Latin prose-stylecontentGateund-Hani:default|strict|looseaffect eligibility only, whilecontentGatestill reportspolicy=none
- current Hani behavior:
default: keeps the current Hani diagnostic-sample thresholdstrict: raises the Hani diagnostic-sample thresholdloose: uses a short-window Han-focused threshold so idiom-length samples such as四字成語can become eligibleoff: keeps the same Hani eligibility thresholds asdefault
--detector regexkeeps the original script/regex chunk-first detection path.--detector wasmuses a detector-oriented ambiguous-window scoring pass before accepted tags are projected back onto the counting chunks.- In
--detector wasmmode, Latin hint rules and explicit Latin hint flags are deferred until after detector evaluation and only relabel unresolvedund-Latnoutput. - Very short chunks stay on the original
und-*fallback. - Low-confidence or unsupported detector results fall back to
und-*. - Technical-noise-heavy Latin windows stay conservative and may remain
und-Latneven when the detector produces a wrong-but-confident language guess. - inspect/debug disclosure uses
contentGateas the canonical gate field. - legacy debug/evidence payloads still emit
qualityGateas a compatibility alias derived fromcontentGate.passed. - for practical verification, use
inspectto compare direct mode outcomes acrossdefault,strict,loose, andoff; use--debug --detector-evidencewhen you specifically need counting-flow event details or legacyqualityGatecompatibility word-counter inspectsupports:- positional text input
- one direct
-p, --path <file>input - repeated
-p, --pathinputs for batch inspect - directory inputs in default
--path-mode auto - literal file-only path handling in
--path-mode manual --section all|frontmatter|content
- batch inspect keeps counting-style path acquisition but not counting aggregation:
- no inspect
--merged - no inspect
--per-file - no inspect
--jobs
- no inspect
Config Files
word-counter supports config files in these canonical names:
wc-intl-seg.config.tomlwc-intl-seg.config.jsonwc-intl-seg.config.jsonc
Config precedence is:
built-in defaults
< user config dir / wc-intl-seg.config.{toml|jsonc|json}
< cwd / wc-intl-seg.config.{toml|jsonc|json}
< environment variables
< flag optionsSame-scope file priority is toml > jsonc > json.
If lower-priority sibling config files are ignored, the CLI emits a warning.
Detector config notes:
- counting defaults to
regex inspectalso defaults toregex- root
detectorcontrols normal counting - optional
inspect.detectoroverrides inspect-only behavior - root
contentGate.modecontrols detector-policy defaults for counting - optional
inspect.contentGate.modeoverrides inspect-only detector-policy behavior WORD_COUNTER_CONTENT_GATEoverrides config-derived content-gate defaults--content-gatestays the highest-precedence detector-policy overrideinspect --detectoronly affects the current inspect invocation
Examples:
word-counter -d wasm "This sentence should clearly be detected as English for the wasm detector path."
word-counter --content-gate strict "Internationalization documentation remains understandable."
word-counter inspect -d regex -f json "こんにちは、世界!これはテストです。"
word-counter inspect --content-gate off "mode: debug\ntee: true\npath: logs\nUse this for testing."
word-counter --path ./examples/test-case-multi-files-support --format jsonDefault-reference config examples live under:
examples/wc-config/wc-intl-seg.config.tomlexamples/wc-config/wc-intl-seg.config.jsonexamples/wc-config/wc-intl-seg.config.jsonc
For full config behavior, platform-specific user config locations, merge rules, and examples, see docs/config-usage-guide.md.
Detector Subpath (@dev-pi2pie/word-counter/detector)
Use the detector subpath when you need async detector-aware APIs directly in library code.
import {
inspectTextWithDetector,
segmentTextByLocaleWithDetector,
wordCounterWithDetector,
} from "@dev-pi2pie/word-counter/detector";
const inspectResult = await inspectTextWithDetector("こんにちは、世界!これはテストです。", {
detector: "wasm",
view: "pipeline",
});
const countResult = await wordCounterWithDetector(
"Internationalization documentation remains understandable.",
{
detector: "wasm",
contentGate: { mode: "strict" },
},
);Detector subpath notes:
- detector entrypoints are async
- use the root package for normal counting when you do not need detector-specific control
- detector-subpath APIs that execute detector policy also accept:
contentGate: { mode: "default" | "strict" | "loose" | "off" }
- use
detectorDebugfor counting-flow runtime diagnostics - use
inspectTextWithDetector()for direct detector diagnosis as structured data
Collect non-words (emoji/symbols/punctuation):
word-counter --non-words "Hi 👋, world!"Override total composition:
word-counter --non-words --total-of words "Hi 👋, world!"
word-counter --total-of punctuation --format raw "Hi, world!"
word-counter --total-of words,emoji --format json "Hi 👋, world!"Batch Counting (--path)
Repeat --path for mixed inputs (files and/or directories):
word-counter --path ./docs/a.md --path ./docs --path ./notes.txtDirectory scans are recursive by default:
word-counter --path ./examples/test-case-multi-files-support
word-counter --path ./examples/test-case-multi-files-support --recursive
word-counter --path ./examples/test-case-multi-files-support --no-recursiveShow per-file plus merged summary:
word-counter --path ./examples/test-case-multi-files-support --per-fileProgress behavior in standard batch mode:
word-counter --path ./examples/test-case-multi-files-support
word-counter --path ./examples/test-case-multi-files-support --progress
word-counter --path ./examples/test-case-multi-files-support --no-progress
word-counter --path ./examples/test-case-multi-files-support --keep-progressProgress is transient by default, auto-disabled for single-input runs, and suppressed in --format raw and --format json.
Batch Concurrency (--jobs)
Use --jobs to control batch concurrency:
word-counter --path ./examples/test-case-multi-files-support --jobs 1
word-counter --path ./examples/test-case-multi-files-support --jobs 4Quick policy:
- no
--jobsand--jobs 1are equivalent baseline behavior. --jobs 1: async main-threadload+countbaseline.--jobs > 1: workerload+countwith async fallback when workers are unavailable.- if requested
--jobsexceeds hostsuggestedMaxJobs(from--print-jobs-limit), the CLI warns and runs with the suggested limit as a safety cap. - use
--quiet-warningsto suppress non-fatal warning lines (for example config discovery notes, jobs-limit advisory, and worker-fallback warning).
Inspect host jobs diagnostics:
word-counter --print-jobs-limit--print-jobs-limit must be used alone (no other inputs or runtime flags).
Doctor (doctor)
Use doctor to verify whether the current host can run word-counter reliably:
word-counter doctor
word-counter doctor --format json
word-counter doctor --format json --prettyDoctor scope in v1:
- checks runtime support policy against Node.js
>=22.18.0 - verifies
Intl.Segmenteravailability plus word/grapheme constructor health - reports batch jobs host limits using the same heuristics as
--print-jobs-limit - reports worker-route preflight signals and the worker-disable env toggle that affects worker availability
Doctor output contract:
- default output is human-readable text
--format jsonprints compact machine-readable JSON--format json --prettyprints indented JSON- doctor exits with code
0forok/warn,1for invalid doctor usage, and2for runtimefail - doctor does not accept counting inputs,
--path,--jobs, or other counting/debug flags
For a field-by-field explanation of doctor text and JSON output, see docs/doctor-usage-guide.md.
For full policy details, JSON parity expectations (--misc, --total-of whitespace,words), and benchmark standards, see docs/batch-jobs-usage-guide.md.
Stable Path Resolution Contract
- Repeated
--pathvalues are accepted as mixed inputs (file + directory). - In
--path-mode auto(default), directory inputs are expanded to files.--recursiveexplicitly enables recursive traversal and overrides non-recursive config/env defaults.--no-recursiveexplicitly disables recursive traversal for the current invocation. - In
--path-mode manual,--pathvalues are treated as literal file inputs;--path <dir>is not supported and is skipped asnot a regular file. - Extension and regex filters apply only to files discovered from directory expansion.
- Direct file inputs are always considered regardless of
--include-ext/--exclude-ext/--regex. - Overlap dedupe is by resolved absolute file path.
- If the same file is discovered multiple ways (repeated roots, nested roots, explicit file + directory), it is counted once.
- Final processing order is deterministic: resolved files are sorted by absolute path ascending before load/count.
Path mode examples:
word-counter --path ./examples/test-case-multi-files-support --path-mode auto
word-counter --path ./examples/test-case-multi-files-support --path-mode manual
word-counter --path ./examples/test-case-multi-files-support/a.md --path-mode manualExtension Filters
Use include/exclude filters for directory scans:
word-counter --path ./examples/test-case-multi-files-support --include-ext .md,.mdx
word-counter --path ./examples/test-case-multi-files-support --include-ext .md,.txt --exclude-ext .txtDirect file path example (filters do not block explicit file inputs):
word-counter --path ./examples/test-case-multi-files-support/ignored.js --include-ext .md --exclude-ext .mdRegex Filter (--regex)
Use --regex to include only directory-scanned files whose root-relative path matches:
word-counter --path ./examples/test-case-multi-files-support --regex '^a\\.md$'
word-counter --path ./examples/test-case-multi-files-support --regex '^nested/.*\\.md$'
word-counter --path ./examples/test-case-multi-files-support --path ./examples --regex '\\.md$'Regex behavior contract:
--regexapplies only to files discovered from--path <dir>expansion.- Matching is against each directory root-relative path.
- The same regex is applied across all provided directory roots.
- Direct file inputs are literal and are not blocked by regex filters.
- In
--path-mode manual, directories are not expanded, so--include-ext,--exclude-ext, and--regexhave no effect. --regexis single-use; repeated--regexflags fail fast with a misuse error.- Empty regex values are treated as no regex restriction.
For additional usage details and troubleshooting, see docs/regex-usage-guide.md.
Debugging Diagnostics (--debug)
Noise policy: default output shows errors + warnings; --debug enables diagnostics; --verbose enables per-item diagnostics; --quiet-warnings suppresses warnings.
--debug remains the diagnostics gate and now defaults to compact event volume:
- lifecycle/stage timing events
- resolved/skipped summary events
- dedupe/filter summary counts
Use --verbose to include per-file/per-path events:
word-counter --path ./examples/test-case-multi-files-support --debug --verboseUse --debug-report [path] to route debug diagnostics to a JSONL report file:
- no path: writes to current working directory with pattern
wc-debug-YYYYMMDD-HHmmss-utc-<pid>.jsonl - no path with
--detector-evidence: writes with patternwc-detector-evidence-YYYYMMDD-HHmmss-utc-<pid>.jsonl - path provided: writes to the specified location
- default-name collision handling: appends
-<n>suffix to avoid overwriting existing files - explicit path validation: existing directories are rejected (explicit paths are treated as file targets)
- compatibility note: the autogenerated filename moved from the older local-time pattern to the new UTC
...-utc-...jsonlpattern
By default with --debug-report, debug lines are file-only (not mirrored to terminal).
Use --debug-report-tee (alias: --debug-tee) to mirror to both file and stderr.
Flag dependencies: --verbose requires --debug; --detector-evidence requires --debug and --detector wasm; --debug-report requires --debug; --debug-report-tee/--debug-tee requires --debug-report.
Use --detector-evidence to add per-window detector evidence onto the same debug stream:
- only meaningful with
--detector wasm - compact mode emits bounded single-line previews plus detector decision metadata
- verbose mode emits full raw detector windows and full normalized samples
- evidence remains detector-window based even when output mode changes to
collector,char, or another counting mode - fallback evidence reports the post-fallback final tag used by downstream counting output; in rare split-relabel cases it may also include
finalLocales
Examples:
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-report-tee
word-counter --path ./examples/test-case-multi-files-support --debug --debug-report ./logs/debug.jsonl --debug-tee
word-counter --detector wasm --debug --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
word-counter --detector wasm --debug --verbose --detector-evidence "This sentence should clearly be detected as English for the wasm detector path."
word-counter --detector wasm --debug --detector-evidence --debug-reportSkip details stay debug-gated and can be suppressed with --quiet-skips.
When --format json is combined with --debug, debug-only diagnostics are emitted under debug.*:
- single input and merged batch may include
debug.detector - per-file batch may include
debug.skipped,debug.detector, and per-entryfiles[i].debug.detector - per-file top-level
skippedis still emitted temporarily for compatibility
How It Works
- The runtime inspects each character's Unicode script to infer its likely locale tag (e.g.,
und-Latn,und-Hani,ja). - Adjacent characters that share the same locale tag are grouped into a chunk.
- Each chunk is counted with
Intl.Segmenteratgranularity: "word", caching segmenters to avoid re-instantiation. - Per-locale counts are summed into an overall total and printed to stdout.
- With
--detector wasm, ambiguousund-Latnandund-Hanichunks can be relabeled through the optional WASM detector before counting; unresolvedund-Latnchunks then fall back to the existing Latin hint rules and explicit Latin hint precedence.
Locale vs Language Code
- Output keeps the field name
localefor compatibility. - In this project, locale values are BCP 47 tags and are often language/script focused (for example:
en,und-Latn,und-Hani) rather than region-specific tags (for example:en-US,zh-TW). - Default detection prefers language/script tags to avoid incorrect region assumptions.
- You can still provide region-specific locale tags through hint flags when needed.
Library Usage
The package exports can be used after installing from the npm registry or linking locally with npm link.
ESM
import wordCounter, {
countCharsForLocale,
countWordsForLocale,
countSections,
parseMarkdown,
segmentTextByLocale,
showSingularOrPluralWord,
} from "@dev-pi2pie/word-counter";
import {
wordCounterWithDetector,
segmentTextByLocaleWithDetector,
} from "@dev-pi2pie/word-counter/detector";
wordCounter("Hello world", { latinLanguageHint: "en" });
wordCounter("Hello world", { latinTagHint: "en" });
wordCounter("Zażółć gęślą jaźń", {
latinHintRules: [{ tag: "pl", pattern: "[ąćęłńóśźżĄĆĘŁŃÓŚŹŻ]" }],
});
wordCounter("Über", { useDefaultLatinHints: false });
wordCounter("漢字測試", { hanTagHint: "zh-Hant" });
wordCounter("Hi 👋, world!", { nonWords: true });
wordCounter("Hi 👋, world!", { mode: "char", nonWords: true });
wordCounter("飛鳥 bird 貓 cat", { mode: "char-collector" });
wordCounter("Hi\tthere\n", { nonWords: true, includeWhitespace: true });
countCharsForLocale("👋", "en");
await wordCounterWithDetector(
"This sentence should clearly be detected as English for the wasm detector path.",
{ detector: "wasm" },
);
await segmentTextByLocaleWithDetector("Hello 世界", { detector: "regex" });Note: includeWhitespace only affects results when nonWords: true is enabled.
Sample output (with nonWords: true and includeWhitespace: true):
{
"total": 4,
"counts": { "words": 2, "nonWords": 2, "total": 4 },
"breakdown": {
"mode": "chunk",
"items": [
{
// ...
"words": 2,
"nonWords": {
"emoji": [],
"symbols": [],
"punctuation": [],
"counts": { "emoji": 0, "symbols": 0, "punctuation": 0, "whitespace": 2 },
"whitespace": { "spaces": 0, "tabs": 1, "newlines": 1, "other": 0 }
}
}
]
}
}CJS
const wordCounter = require("@dev-pi2pie/word-counter");
const detector = require("@dev-pi2pie/word-counter/detector");
const {
countCharsForLocale,
countWordsForLocale,
countSections,
parseMarkdown,
segmentTextByLocale,
showSingularOrPluralWord,
} = wordCounter;
wordCounter("Hello world", { latinLanguageHint: "en" });
wordCounter("Hello world", { latinTagHint: "en" });
wordCounter("Zażółć gęślą jaźń", {
latinHintRules: [{ tag: "pl", pattern: "[ąćęłńóśźżĄĆĘŁŃÓŚŹŻ]" }],
});
wordCounter("Über", { useDefaultLatinHints: false });
wordCounter("漢字測試", { hanTagHint: "zh-Hant" });
wordCounter("Hi 👋, world!", { nonWords: true });
wordCounter("Hi 👋, world!", { mode: "char", nonWords: true });
wordCounter("飛鳥 bird 貓 cat", { mode: "char-collector" });
wordCounter("Hi\tthere\n", { nonWords: true, includeWhitespace: true });
countCharsForLocale("👋", "en");
await detector.wordCounterWithDetector(
"This sentence should clearly be detected as English for the wasm detector path.",
{ detector: "wasm" },
);Note: includeWhitespace only affects results when nonWords: true is enabled.
Sample output (with nonWords: true and includeWhitespace: true):
{
"total": 4,
"counts": { "words": 2, "nonWords": 2, "total": 4 },
"breakdown": {
"mode": "chunk",
"items": [
{
// ...
"words": 2,
"nonWords": {
"emoji": [],
"symbols": [],
"punctuation": [],
"counts": { "emoji": 0, "symbols": 0, "punctuation": 0, "whitespace": 2 },
"whitespace": { "spaces": 0, "tabs": 1, "newlines": 1, "other": 0 }
}
}
]
}
}Export Summary
Core API
| Export | Kind | Notes |
| --------------------- | -------- | -------------------------------------------------- |
| default | function | wordCounter(text, options?) -> WordCounterResult |
| wordCounter | function | Alias of the default export. |
| countCharsForLocale | function | Low-level helper for per-locale char counts. |
| countWordsForLocale | function | Low-level helper for per-locale counts. |
| segmentTextByLocale | function | Low-level helper for locale-tag segmentation. |
Markdown Helpers
| Export | Kind | Notes |
| --------------- | -------- | --------------------------------------------- |
| parseMarkdown | function | Parses Markdown and detects frontmatter. |
| countSections | function | Counts words by frontmatter/content sections. |
Utility Helpers
| Export | Kind | Notes |
| -------------------------- | -------- | ------------------------------ |
| showSingularOrPluralWord | function | Formats singular/plural words. |
Detector Subpath
Import from @dev-pi2pie/word-counter/detector for the explicit detector-enabled API.
| Export | Kind | Notes |
| ----------------------------- | -------- | ----------------------------------------------- |
| wordCounterWithDetector | function | Async detector-aware counting entrypoint. |
| segmentTextByLocaleWithDetector | function | Async detector-aware locale segmentation. |
| countSectionsWithDetector | function | Async detector-aware section counting. |
| inspectTextWithDetector | function | Async detector-aware inspect entrypoint. |
| DEFAULT_DETECTOR_MODE | value | Current default detector mode (regex). |
| DETECTOR_MODES | value | Supported detector modes. |
Types
| Export | Kind | Notes |
| ---------------------- | ---- | ------------------------------------------------- |
| WordCounterOptions | type | Options for the wordCounter function. |
| WordCounterResult | type | Returned by wordCounter. |
| WordCounterBreakdown | type | Breakdown payload in WordCounterResult. |
| WordCounterMode | type | "chunk" \| "segments" \| "collector" \| "char" \| "char-collector". |
| NonWordCollection | type | Non-word segments + counts payload. |
Display Modes
Choose a breakdown style with --mode (or -m):
chunk(default) – list each contiguous locale block in order of appearance.segments– show the actual wordlike segments used for counting.collector– aggregate counts per locale regardless of text position. Keeps per-locale segment lists in memory, so very large corpora can use noticeably more memory thanchunkmode.char– count grapheme clusters (user-perceived characters) per locale.char-collector– aggregate grapheme-cluster counts per locale (collector-style char mode).
Aliases are normalized for CLI + API:
chunk,chunkssegments,segment,segcollector,collect,collechar,chars,character,characterschar-collector,charcollector,char-collect,collector-char,characters-collector,colchar,charcol,char-col,char-colle
Examples:
# chunk mode (default)
word-counter "飛鳥 bird 貓 cat; how do you do?"
# show captured segments
word-counter --mode segments "飛鳥 bird 貓 cat; how do you do?"
# aggregate per locale
word-counter -m collector "飛鳥 bird 貓 cat; how do you do?"
# grapheme-aware character count
word-counter -m char "Hi 👋, world!"
# aggregate grapheme-aware character counts per locale
word-counter -m char-collector "飛鳥 bird 貓 cat; how do you do?"Section Modes (Frontmatter)
Use --section to control which parts of a markdown document are counted:
all(default) – count the whole file (fast path, no section split).split– count frontmatter and content separately.frontmatter– count frontmatter only.content– count content only.per-key– count frontmatter per key (frontmatter only).split-per-key– per-key frontmatter counts plus a content total.
Supported frontmatter formats:
- YAML fenced with
--- - TOML fenced with
+++ - JSON fenced with
;;;or a top-of-file JSON object ({ ... })
Examples:
word-counter --section split -p examples/yaml-basic.md
word-counter --section per-key -p examples/yaml-basic.md
word-counter --section split-per-key -p examples/yaml-basic.mdJSON output includes a source field (frontmatter or content) to avoid key collisions:
word-counter --section split-per-key --format json -p examples/yaml-content-key.mdExample (trimmed):
{
"section": "split-per-key",
"frontmatterType": "yaml",
"total": 7,
"items": [
{ "name": "content", "source": "frontmatter", "result": { "total": 4 } },
{ "name": "content", "source": "frontmatter", "result": { "total": 2 } },
{ "name": "content", "source": "content", "result": { "total": 5 } }
]
}Output Formats
Select how results are printed with --format:
standard(default) – total plus per-locale breakdown.raw– only the total count (single number).json– machine-readable output; add--prettyfor indentation.
JSON contract reference:
docs/schemas/json-output-contract.md
Examples:
word-counter --format raw "Hello world"
word-counter --format json --pretty "Hello world"Selective Totals (--total-of)
Use --total-of <parts> to override how the displayed total is computed.
Supported parts:
wordsemojisymbolspunctuationwhitespace
Examples:
word-counter --non-words --total-of words "Hi 👋, world!"
word-counter --total-of punctuation --format raw "Hi, world!"
word-counter --total-of words,emoji --format json "Hi 👋, world!"Rules:
- Without
--total-of, behavior stays unchanged. - With
--total-of,--format rawprints the override total only. - In standard output,
Total-of (override: ...)is shown only when override total differs from the base total. - If selected parts require non-word data (for example
emojiorpunctuation), non-word collection is enabled internally as needed. --total-ofdoes not implicitly enable non-word display mode: baseTotal ...labeling and non-word breakdown visibility still follow explicit flags (--non-words,--include-whitespace,--misc).- Alias/normalization is tolerant for common variants:
word->wordssymbol->symbolspunction->punctuation
JSON output adds override metadata when --total-of is provided:
- single input and merged batch:
meta.totalOf,meta.totalOfOverride - per-file batch (
--per-file):- top-level:
meta.totalOf,meta.aggregateTotalOfOverride - per entry:
files[i].meta.totalOf,files[i].meta.totalOfOverride - applies to both sectioned and non-sectioned per-file JSON results
- top-level:
Example JSON (trimmed):
{
"total": 5,
"meta": {
"totalOf": ["words", "emoji"],
"totalOfOverride": 3
}
}Non-Word Collection
Use --non-words (or nonWords: true in the API) to collect emoji, symbols, and punctuation as separate categories. When enabled, the total includes both words and non-words.
word-counter --non-words "Hi 👋, world!"Example: total = words + emoji + symbols + punctuation when enabled.
Standard output labels this as Total count to reflect the combined total; --format raw still prints a single number.
Include whitespace-like characters in the non-words bucket (API: includeWhitespace: true):
word-counter --include-whitespace "Hi\tthere\n"
word-counter --misc "Hi\tthere\n"In the CLI, --include-whitespace implies --non-words (same behavior as --misc). --non-words alone does not include whitespace. When enabled, whitespace counts appear under nonWords.whitespace, and total = words + nonWords (emoji + symbols + punctuation + whitespace). JSON output also includes top-level counts when nonWords is enabled. See docs/schemas/whitespace-categories.md for how whitespace is categorized.
Example JSON (trimmed):
{
"total": 5,
"counts": { "words": 2, "nonWords": 3, "total": 5 },
"breakdown": {
"mode": "chunk",
"items": [
{
"locale": "und-Latn",
"words": 2,
"nonWords": {
"counts": { "emoji": 0, "symbols": 0, "punctuation": 0, "whitespace": 3 },
"whitespace": { "spaces": 1, "tabs": 1, "newlines": 1, "other": 0 }
}
}
]
}
}[!Note] Text-default symbols (e.g. ©) count as
symbolsunless explicitly emoji-presented (e.g. ©️ with VS16).
Locale Tag Detection Notes
- Detection is regex/script based, not statistical language-ID.
- Ambiguous Latin defaults to
und-Latn; Han fallback defaults tound-Hani. --detector wasmis optional and conservative; it only runs for ambiguous chunks that meet minimum script-bearing length thresholds.- In
--detector wasmmode, ambiguous Latin stays onund-Latnfor detector eligibility first, then built-in/custom Latin rules and explicit Latin hints are applied only if the detector leaves that chunk unresolved. - The current first WASM engine is
whatlang, remapped into this package's public tags. - The npm package ships one portable WASM artifact; users do not install per-OS detector packages.
- Use explicit tag and hint flags when you need deterministic tagging.
- Full notes (built-in heuristics, limitations, and override guidance) are tracked in
docs/locale-tag-detection-notes.md.
Breaking Changes Notes
- Planned deprecations and migration notes are tracked in
docs/breaking-changes-notes.md.
Testing
Run the build before tests so the CJS interop test can load the emitted
dist/cjs/index.cjs bundle:
bun run build
bun testSample Inputs
Try the following mixed-language phrases to see how detection behaves:
"Hello world 你好世界""Bonjour le monde こんにちは 세계""¡Hola! مرحبا Hello"
Each run prints the total word count plus a per-locale breakdown, helping you understand how multilingual text is segmented.
License
This project is licensed under the MIT License — see the LICENSE file for details.
