npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@docen/deduplicate

v0.2.9

Published

Multi-layer text deduplication using SimHash, N-gram containment, and sentence-sequence LCS for Tiptap/ProseMirror documents

Readme

@docen/deduplicate

npm version npm downloads npm license

Document deduplication and similarity analysis for Tiptap/ProseMirror JSON content, using SimHash screening + Levenshtein verification.

Features

  • Duplicate detection within a single document
  • Cross-document paragraph comparison with bidirectional coverage
  • Sentence-level matching: SimHash for fast screening, Levenshtein for precise verification
  • No false positives from n-gram containment — all matches verified by edit distance
  • Multilingual support (Chinese, English, etc.)

Installation

pnpm add @docen/deduplicate

Quick Start

import { findDuplicates } from "@docen/deduplicate";

const document = {
  type: "doc",
  content: [
    { type: "paragraph", content: [{ type: "text", text: "机器学习是人工智能的一个重要分支。" }] },
    { type: "paragraph", content: [{ type: "text", text: "机器学习是人工智能的一个重要分支。" }] },
    { type: "paragraph", content: [{ type: "text", text: "深度学习是机器学习的子领域。" }] },
  ],
};

const duplicates = findDuplicates(document, { threshold: 0.85 });
// [{ index: 0, text: "机器学习是...", duplicates: [1], similarities: [1.0] }]

API Reference

extractParagraphs(doc)

Extracts all paragraph/heading text from a Tiptap JSON document. Consecutive paragraph nodes are merged when the first does not end with sentence-ending punctuation.

import { extractParagraphs } from "@docen/deduplicate";

const paragraphs = extractParagraphs(document);
// ["第一段。", "第二段。"]

splitSentences(text)

Splits text into sentences. Chinese-aware (supports 。!?;.!?;).

import { splitSentences } from "@docen/deduplicate";

const sentences = splitSentences("第一句。第二句!第三句?");
// ["第一句。", "第二句!", "第三句?"]

calculateSimilarity(text1, text2)

Calculates similarity using Levenshtein normalized distance.

import { calculateSimilarity } from "@docen/deduplicate";

calculateSimilarity("你好世界", "你好世界"); // 1.0
calculateSimilarity("你好世界", "你好地球"); // ~0.5
calculateSimilarity("你好", "再见"); // ~0.0

findDuplicates(doc, options?)

Finds duplicate/similar paragraphs within a single document.

import { findDuplicates } from "@docen/deduplicate";

const duplicates = findDuplicates(document, {
  threshold: 0.85, // Minimum similarity (0-1), default: 0.6
});

compareDocuments(doc1, doc2, options?)

Compares two documents and returns per-paragraph comparisons with bidirectional sentence-level coverage.

import { compareDocuments } from "@docen/deduplicate";

const result = compareDocuments(doc1, doc2, {
  threshold: 0.6, // Noise floor below which → "none"
  hammingThreshold: 10, // SimHash screening distance
  levenshteinThreshold: 0.6, // Sentence-level verification threshold
});

result.paragraphs.forEach((pc) => {
  console.log(`[${pc.matchKind}] ${(pc.similarity * 100).toFixed(0)}%`);
  console.log(
    `  covA=${(pc.coverage.covA * 100).toFixed(0)}% covB=${(pc.coverage.covB * 100).toFixed(0)}%`,
  );
});

findBestMatch (re-export from @nlptools/distance)

One-shot fuzzy search: find the best matching string from candidates.

import { findBestMatch } from "@docen/deduplicate";

const result = findBestMatch("kitten", ["sitting", "kit", "mitten"]);
// { item: "kit", score: 0.5, index: 1 }

Options

interface DeduplicateOptions {
  /** Minimum similarity threshold (0-1). @default 0.6 */
  threshold?: number;
  /** SimHash hamming distance for candidate screening. @default 10 */
  hammingThreshold?: number;
  /** Levenshtein similarity for sentence verification. @default 0.6 */
  levenshteinThreshold?: number;
  /** Minimum sentence length for SimHash fingerprinting. @default 15 */
  minSentenceLength?: number;
  /** Custom sentence splitter (Chinese & English aware by default). */
  splitter?: (text: string) => string[];
}

Result Types

interface DocumentResult {
  paragraphs: ParagraphComparison[];
  coverage: number; // Average of paragraph covA
}

interface ParagraphComparison {
  fromDoc1: { index: number; text: string };
  fromDoc2: { index: number; text: string } | null;
  coverage: { covA: number; covB: number };
  matchKind: "contained" | "similar" | "weakOverlap" | "none";
  similarity: number; // max(covA, covB)
}

interface DuplicateMatch {
  index: number;
  text: string;
  duplicates: number[];
  similarities: number[];
}

Match Classification

| Kind | Condition | Meaning | | ------------- | ---------------------------- | ------------------------------------------- | | contained | max(covA, covB) >= 0.8 | One paragraph mostly contained in the other | | similar | min(covA, covB) >= 0.6 | High bidirectional overlap | | weakOverlap | max(covA, covB) >= threshold | Partial overlap (only when threshold < 0.6) | | none | max(covA, covB) < threshold | No meaningful match |

How It Works

  1. Extract paragraphs from Tiptap JSON, split into sentences
  2. SimHash fingerprinting for sentences >= minSentenceLength characters
  3. Two-phase matching per paragraph pair:
    • Phase 1: SimHash hamming distance screens candidates (fast)
    • Phase 2: Levenshtein normalized similarity verifies matches (precise)
    • Unmatched short sentences: direct Levenshtein comparison
  4. No containment fallback — eliminates false positives from n-gram coincidence
  5. Noise floor controlled by threshold — matches below this are classified as "none"

License

MIT © Demo Macro