npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@nlptools/nlptools

v0.0.5

Published

Main NLPTools package - Complete suite of NLP algorithms, text distance, similarity, splitting, and tokenization utilities

Downloads

101

Readme

@nlptools/nlptools

npm version npm downloads npm license Contributor Covenant

Main NLPTools package - Complete suite of NLP algorithms and utilities

This is the main NLPTools package (@nlptools/nlptools) that exports all algorithms and utilities from the entire toolkit. It provides a single entry point to access all string distance, similarity algorithms, text splitting, and tokenization utilities.

Features

  • All-in-One: Complete access to all NLPTools algorithms
  • Convenient: Single import for all functionality
  • Text Splitting: Document chunking and text processing utilities
  • Tokenization: Fast text encoding and decoding for LLM models
  • Distance & Similarity: Comprehensive string comparison algorithms
  • Locality-Sensitive Hashing: Fast approximate nearest neighbor search
  • TypeScript First: Full type safety with comprehensive API
  • Easy to Use: Consistent API across all algorithms

Installation

# Install with npm
npm install @nlptools/nlptools

# Install with yarn
yarn add @nlptools/nlptools

# Install with pnpm
pnpm add @nlptools/nlptools

Usage

Basic Setup

import * as nlptools from "@nlptools/nlptools";

// Edit distance
console.log(nlptools.levenshtein("kitten", "sitting")); // 3
console.log(nlptools.levenshteinNormalized("cat", "bat")); // 0.6666666666666666

// Token-based similarity
console.log(nlptools.jaccard("abc", "bcd")); // 0.3333333333333333
console.log(nlptools.cosine("hello", "hallo")); // 0.8
console.log(nlptools.sorensen("abc", "bcd")); // 0.5

Distance vs Similarity

Most algorithms have both distance and normalized versions:

// Distance algorithms (lower is more similar)
const distance = nlptools.levenshtein("cat", "bat"); // 1

// Similarity algorithms (higher is more similar, 0-1 range)
const similarity = nlptools.levenshteinNormalized("cat", "bat"); // 0.6666666666666666

Text Splitting

This package includes text splitters from @nlptools/splitter:

import { RecursiveCharacterTextSplitter } from "@nlptools/nlptools";

const splitter = new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 200,
});

const text = "Your long document text here...";
const chunks = await splitter.splitText(text);
console.log(chunks);

Tokenization

This package includes tokenization utilities from @nlptools/tokenizer:

import { Tokenizer } from "@nlptools/nlptools";

// Load tokenizer from HuggingFace Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(
  `https://huggingface.co/${modelId}/resolve/main/tokenizer.json`,
).then((res) => res.json());
const tokenizerConfig = await fetch(
  `https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`,
).then((res) => res.json());

const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);

// Encode text
const encoded = tokenizer.encode("Hello World");
console.log(encoded.ids); // [9906, 4435]
console.log(encoded.tokens); // ['Hello', 'ĠWorld']

// Get token count
const tokenCount = tokenizer.encode("This is a sentence.").ids.length;
console.log(`Token count: ${tokenCount}`);

Available Algorithm Categories

This package includes all algorithms from @nlptools/distance, @nlptools/splitter, and @nlptools/tokenizer:

Edit Distance

  • levenshtein / levenshteinNormalized - Classic Levenshtein edit distance
  • lcsDistance / lcsNormalized - Longest Common Subsequence distance
  • lcsLength - LCS length
  • lcsPairs - LCS matching index pairs

Token-based Similarity

  • jaccard / jaccardNgram - Jaccard similarity (character / n-gram)
  • cosine / cosineNgram - Cosine similarity (character / n-gram)
  • sorensen / sorensenNgram - Sorensen-Dice coefficient (character / n-gram)

Hash-based Algorithms

  • simhash / SimHasher - Locality-sensitive document fingerprinting
  • hammingDistance / hammingSimilarity - Hamming distance for fingerprint comparison
  • MinHash - MinHash estimator for approximate Jaccard similarity
  • LSH - Locality-Sensitive Hashing index for fast approximate nearest neighbor search

Diff

  • diff - Compute the difference between two sequences

Text Splitters

  • RecursiveCharacterTextSplitter - Splits text recursively using different separators
  • CharacterTextSplitter - Splits text by character count
  • MarkdownTextSplitter - Specialized splitter for Markdown documents
  • TokenTextSplitter - Splits text by token count
  • LatexTextSplitter - Specialized splitter for LaTeX documents

Tokenization Utilities

  • Tokenizer - Main tokenizer class for encoding and decoding text
  • encode() - Convert text to token IDs and tokens
  • decode() - Convert token IDs back to text
  • tokenize() - Split text into token strings
  • AddedToken - Custom token configuration class

License