npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

bun_nltk

v0.9.0

Published

High-performance NLP primitives for Bun/Node with Zig native and WASM backends.

Readme

bun_nltk

Fast NLP primitives in Zig with Bun bindings (Cycle 1).

Package docs

Implemented in this milestone

  • ASCII token counting
  • ASCII unique-token counting (FreqDist-style cardinality)
  • ASCII n-gram counting
  • ASCII unique n-gram counting
  • Hashed frequency distributions for tokens and n-grams
  • Native token materialization and n-gram materialization APIs
  • Top-K bigram PMI collocation scoring (native, with window_size >= 2)
  • Collision-free token ID frequency distribution API (id <-> token)
  • Native windowed bigram stats API (left_id, right_id, count, pmi)
  • Native Porter stemmer (ASCII, lowercasing)
  • Tokenizer parity layer (wordTokenizeSubset, tweetTokenizeSubset)
  • Sentence tokenizer parity subset (sentenceTokenizeSubset) + Python parity harness
  • Trainable Punkt tokenizer/model APIs (trainPunktModel, sentenceTokenizePunkt)
  • Native Zig Punkt sentence-splitting fast path (sentenceTokenizePunktAsciiNative) with WASM equivalent
  • Native normalization pipeline (ASCII fast path with optional stopword filtering)
  • Unicode normalization fallback pipeline (normalizeTokensUnicode)
  • Native POS regex/heuristic tagger baseline (posTagAsciiNative)
  • Native streaming FreqDist/ConditionalFreqDist builder with JSON export (NativeFreqDistStream)
  • Mini WordNet reader with synset lookup, relation traversal, and morphy-style inflection recovery
  • Native Zig morphy accelerator (wordnetMorphyAsciiNative) with WASM equivalent
  • Packed WordNet corpus pipeline (wordnet:pack) with binary loader (loadWordNetPacked)
  • N-gram language model stack (MLE, Lidstone, Kneser-Ney Interpolated) with Python comparison harness
  • Native/WASM LM ID-evaluation hot loop for batched score + perplexity paths
  • Regexp chunk parser primitives with IOB conversion and Python parity harness
  • Native/WASM chunk IOB hot loop for compiled grammar matching
  • CFG grammar parser + chart parser subset with Python parity harness
  • Earley recognizer/parser API for non-CNF grammar recognition (earleyRecognize, earleyParse, parseTextWithEarley)
  • Lightweight dependency parser API (dependencyParse, dependencyParseText)
  • Naive Bayes text classifier with train/predict/evaluate/serialize APIs and Python parity harness
  • Shared sparse text vectorizer (TextFeatureVectorizer) + sparse batch flattening utility
  • Decision tree text classifier APIs (DecisionTreeTextClassifier)
  • Linear text models (LogisticTextClassifier, LinearSvmTextClassifier) with native sparse scoring fast path
  • Corpus reader framework (CorpusReader) with bundled mini corpora
  • Optional external corpus bundle loader + tagged/chunked corpus readers (parseConllTagged, parseBrownTagged, parseConllChunked)
  • Corpus registry manifest loader/downloader with checksum validation (loadCorpusRegistryManifest, downloadCorpusRegistry)
  • SIMD token counting fast path (x86_64 vectorized path + scalar fallback)
  • Shared Zig perceptron inference core used by both native and WASM runtimes
  • Browser-focused WASM API wrapper with memory pool reuse (WasmNltk)
  • WASM target for browser/runtime usage with parity benchmarks
  • Browser WASM benchmark harness (Chromium/Firefox in CI strict mode)
  • Performance regression gate script + CI workflow
  • SLA gate script (p95 latency + memory delta) and NLTK parity tracker artifacts
  • Global parity suite on PRs across tokenizer, punkt, lm, chunk, wordnet, parser, classifier, and tagger
  • Python baseline comparison on the same dataset

Benchmark results (64MB synthetic dataset)

All benchmarks below use bench/datasets/synthetic.txt on this machine.

| Workload | Zig/Bun median sec | Python sec | Faster side | Speedup | Percent faster | |---|---:|---:|---|---:|---:| | Token + unique + ngram + unique ngram (bench:compare) | 2.767 | 10.071 | Zig native | 3.64x | 263.93% | | Top-K PMI collocations (bench:compare:collocations) | 2.090 | 23.945 | Zig native | 11.46x | 1045.90% | | Porter stemming (bench:compare:porter) | 11.942 | 120.101 | Zig native | 10.06x | 905.70% | | WASM token/ngram path (bench:compare:wasm) | 4.150 | 13.241 | Zig WASM | 3.19x | 219.06% | | Native vs Python in wasm suite (bench:compare:wasm) | 1.719 | 13.241 | Zig native | 7.70x | 670.48% | | Sentence tokenizer subset (bench:compare:sentence) | 1.680 | 16.580 | Zig/Bun subset | 9.87x | 886.70% | | Perceptron POS tagger (bench:compare:tagger) | 19.880 | 82.849 | Zig native | 4.17x | 316.75% | | Streaming FreqDist + ConditionalFreqDist (bench:compare:freqdist) | 3.206 | 20.971 | Zig native | 6.54x | 554.17% |

Notes:

  • Sentence tokenizer is a Punkt-compatible subset, not full Punkt parity on arbitrary corpora.
  • Full WordNet corpus parity is not shipped yet; this milestone includes a bundled mini WordNet dataset.
  • Full WordNet corpus can be packed from upstream dict files with bun run wordnet:pack; full upstream dataset is not bundled in npm by default.
  • Fixture parity harnesses are available via bench:parity:sentence and bench:parity:tagger.
  • SIMD fast path benchmark (bench:compare:simd) shows countTokensAscii at 1.22x and normalization no-stopword path at 2.73x over scalar baseline.

Extended benchmark results (8MB gate dataset)

| Workload | Zig/Bun median sec | Python sec | Faster side | Speedup | Percent faster | |---|---:|---:|---|---:|---:| | Punkt tokenizer default path (bench:compare:punkt) | 0.0848 | 1.3463 | Zig native | 15.87x | 1487.19% | | N-gram LM (Kneser-Ney) score+perplexity (bench:compare:lm) | 0.1324 | 2.8661 | Zig/Bun | 21.64x | 2064.19% | | Regexp chunk parser (bench:compare:chunk) | 0.0024 | 1.5511 | Zig/Bun | 643.08x | 64208.28% | | WordNet lookup + morphy workload (bench:compare:wordnet) | 0.0009 | 0.0835 | Zig/Bun | 91.55x | 9054.67% | | CFG chart parser subset (bench:compare:parser) | 0.0088 | 0.3292 | Zig/Bun | 37.51x | 3651.05% | | Naive Bayes text classifier (bench:compare:classifier) | 0.0081 | 0.0112 | Zig/Bun | 1.38x | 38.40% | | PCFG Viterbi chart parser (bench:compare:pcfg) | 0.0191 | 0.4153 | Zig/Bun | 21.80x | 2080.00% | | MaxEnt text classifier (bench:compare:maxent) | 0.0244 | 0.1824 | Zig/Bun | 7.46x | 646.00% | | Sparse linear logits hot loop (bench:compare:linear) | 0.0024 | 2.0001 | Zig native | 840.54x | 83954.04% | | Decision tree text classifier (bench:compare:decision-tree) | 0.0725 | 0.5720 | Zig/Bun | 7.89x | 688.55% | | Earley parser workload (bench:compare:earley) | 0.1149 | 4.6483 | Zig/Bun | 40.47x | 3947.07% |

Build native Zig library

bun run build:zig

Build WASM library

bun run build:wasm

Run tests

bun run test

Generate synthetic dataset

bun run bench:generate

Benchmark vs Python baseline

bun run bench:compare

Benchmark collocations vs Python baseline

bun run bench:compare:collocations

Benchmark Porter stemmer vs Python NLTK

bun run bench:compare:porter

Benchmark Native vs WASM vs Python

bun run bench:compare:wasm

Benchmark sentence tokenizer vs Python

bun run bench:compare:sentence

Benchmark POS tagger vs Python

bun run bench:compare:tagger

Benchmark streaming FreqDist vs Python

bun run bench:compare:freqdist

Benchmark SIMD fast path vs scalar baseline

bun run bench:compare:simd

Benchmark parser vs Python

bun run bench:compare:parser

Benchmark decision tree classifier vs Python

bun run bench:compare:decision-tree

Benchmark Earley parser vs Python

bun run bench:compare:earley

Benchmark classifier vs Python

bun run bench:compare:classifier

Benchmark sparse linear scorer vs Python

bun run bench:compare:linear

Benchmark linear-model training native scoring vs JS scoring

bun run bench:compare:linear-train

Run parity harnesses

bun run fixtures:import:nltk
bun run bench:parity:sentence
bun run bench:parity:punkt
bun run bench:parity:tokenizer
bun run bench:parity:parser
bun run bench:parity:classifier
bun run bench:parity:pcfg
bun run bench:parity:maxent
bun run bench:parity:decision-tree
bun run bench:parity:earley
bun run bench:parity:corpus-imported
bun run bench:parity:imported
bun run bench:parity:wordnet
bun run bench:parity:tagger
bun run bench:parity:all
bun run parity:report

Benchmark trend tracking

bun run bench:trend:check
bun run bench:trend:record

Browser/WASM checks

bun run wasm:size:check
bun run bench:browser:wasm

Pack WordNet corpus

bun run wordnet:pack

Pack Official WordNet + Verify

bun run wordnet:pack:official
bun run wordnet:verify:pack

Run regression gate

bun run bench:gate

Run SLA gate only

bun run sla:gate

Generate parity tracker

bun run parity:tracker

Release readiness check

bun run release:check

Notes

  • Native library output path is native/bun_nltk.{dll|so|dylib}.
  • npm package ships prebuilt native binaries for linux-x64 and win32-x64, plus native/bun_nltk.wasm.
  • Runtime native loading is prebuilt-first with no implicit local native fallback.
  • No install-time lifecycle scripts are used, so bun pm trust is not required for install.
  • Current tokenizer rule is [A-Za-z0-9']+ (lowercased ASCII).
  • This is the first optimization loop and intentionally scoped.