@nlptools/nlptools
v0.0.5
Published
Main NLPTools package - Complete suite of NLP algorithms, text distance, similarity, splitting, and tokenization utilities
Downloads
101
Maintainers
Readme
@nlptools/nlptools
Main NLPTools package - Complete suite of NLP algorithms and utilities
This is the main NLPTools package (@nlptools/nlptools) that exports all algorithms and utilities from the entire toolkit. It provides a single entry point to access all string distance, similarity algorithms, text splitting, and tokenization utilities.
Features
- All-in-One: Complete access to all NLPTools algorithms
- Convenient: Single import for all functionality
- Text Splitting: Document chunking and text processing utilities
- Tokenization: Fast text encoding and decoding for LLM models
- Distance & Similarity: Comprehensive string comparison algorithms
- Locality-Sensitive Hashing: Fast approximate nearest neighbor search
- TypeScript First: Full type safety with comprehensive API
- Easy to Use: Consistent API across all algorithms
Installation
# Install with npm
npm install @nlptools/nlptools
# Install with yarn
yarn add @nlptools/nlptools
# Install with pnpm
pnpm add @nlptools/nlptoolsUsage
Basic Setup
import * as nlptools from "@nlptools/nlptools";
// Edit distance
console.log(nlptools.levenshtein("kitten", "sitting")); // 3
console.log(nlptools.levenshteinNormalized("cat", "bat")); // 0.6666666666666666
// Token-based similarity
console.log(nlptools.jaccard("abc", "bcd")); // 0.3333333333333333
console.log(nlptools.cosine("hello", "hallo")); // 0.8
console.log(nlptools.sorensen("abc", "bcd")); // 0.5Distance vs Similarity
Most algorithms have both distance and normalized versions:
// Distance algorithms (lower is more similar)
const distance = nlptools.levenshtein("cat", "bat"); // 1
// Similarity algorithms (higher is more similar, 0-1 range)
const similarity = nlptools.levenshteinNormalized("cat", "bat"); // 0.6666666666666666Text Splitting
This package includes text splitters from @nlptools/splitter:
import { RecursiveCharacterTextSplitter } from "@nlptools/nlptools";
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 200,
});
const text = "Your long document text here...";
const chunks = await splitter.splitText(text);
console.log(chunks);Tokenization
This package includes tokenization utilities from @nlptools/tokenizer:
import { Tokenizer } from "@nlptools/nlptools";
// Load tokenizer from HuggingFace Hub
const modelId = "HuggingFaceTB/SmolLM3-3B";
const tokenizerJson = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer.json`,
).then((res) => res.json());
const tokenizerConfig = await fetch(
`https://huggingface.co/${modelId}/resolve/main/tokenizer_config.json`,
).then((res) => res.json());
const tokenizer = new Tokenizer(tokenizerJson, tokenizerConfig);
// Encode text
const encoded = tokenizer.encode("Hello World");
console.log(encoded.ids); // [9906, 4435]
console.log(encoded.tokens); // ['Hello', 'ĠWorld']
// Get token count
const tokenCount = tokenizer.encode("This is a sentence.").ids.length;
console.log(`Token count: ${tokenCount}`);Available Algorithm Categories
This package includes all algorithms from @nlptools/distance, @nlptools/splitter, and @nlptools/tokenizer:
Edit Distance
levenshtein/levenshteinNormalized- Classic Levenshtein edit distancelcsDistance/lcsNormalized- Longest Common Subsequence distancelcsLength- LCS lengthlcsPairs- LCS matching index pairs
Token-based Similarity
jaccard/jaccardNgram- Jaccard similarity (character / n-gram)cosine/cosineNgram- Cosine similarity (character / n-gram)sorensen/sorensenNgram- Sorensen-Dice coefficient (character / n-gram)
Hash-based Algorithms
simhash/SimHasher- Locality-sensitive document fingerprintinghammingDistance/hammingSimilarity- Hamming distance for fingerprint comparisonMinHash- MinHash estimator for approximate Jaccard similarityLSH- Locality-Sensitive Hashing index for fast approximate nearest neighbor search
Diff
diff- Compute the difference between two sequences
Text Splitters
RecursiveCharacterTextSplitter- Splits text recursively using different separatorsCharacterTextSplitter- Splits text by character countMarkdownTextSplitter- Specialized splitter for Markdown documentsTokenTextSplitter- Splits text by token countLatexTextSplitter- Specialized splitter for LaTeX documents
Tokenization Utilities
Tokenizer- Main tokenizer class for encoding and decoding textencode()- Convert text to token IDs and tokensdecode()- Convert token IDs back to texttokenize()- Split text into token stringsAddedToken- Custom token configuration class
