socialgood
v1.0.0
Published
Detect duplicate records using Hamming distance
Maintainers
Readme
socialgood
Detect duplicate records using Hamming distance. Zero dependencies.
Works with plain strings or objects (compares selected fields). Returns a similarity score and lets you set a threshold for what counts as a duplicate.
Install
npm install socialgoodQuick Start
const { isDuplicate, findDuplicates, deduplicate } = require('socialgood');
// Compare two strings
isDuplicate('hello world', 'hello worlt');
// { isDuplicate: true, similarity: 0.909091, distance: 1 }
// Compare two records
isDuplicate(
{ name: 'John Smith', email: '[email protected]' },
{ name: 'John Smtih', email: '[email protected]' }
);
// { isDuplicate: true, similarity: ..., distance: 1 }
// Find all duplicate pairs in a list
findDuplicates([
{ name: 'Alice', email: '[email protected]' },
{ name: 'alice', email: '[email protected]' },
{ name: 'Bob', email: '[email protected]' }
]);
// [{ indexA: 0, indexB: 1, similarity: 1, distance: 0 }]
// Remove duplicates, keeping the first occurrence
deduplicate(['hello', 'hello', 'world']);
// ['hello', 'world']API
hammingDistance(a, b)
Returns the number of positions where characters differ between two strings. Length differences count as mismatches.
similarity(a, b)
Returns a normalized score from 0 (completely different) to 1 (identical).
isDuplicate(a, b, [options])
Compare two records (strings or objects) and determine if they are duplicates.
Options:
| Option | Type | Default | Description |
|---|---|---|---|
| threshold | number | 0.85 | Similarity at or above this value is considered a duplicate (0-1). |
| fields | string[] | all keys | Which object keys to compare. |
| caseSensitive | boolean | false | If true, comparisons are case-sensitive. |
Returns: { isDuplicate: boolean, similarity: number, distance: number }
findDuplicates(records, [options])
Find all duplicate pairs in an array. Takes the same options as isDuplicate.
Returns: Array<{ indexA: number, indexB: number, similarity: number, distance: number }>
deduplicate(records, [options])
Remove duplicates from an array, keeping the first occurrence. Takes the same options as isDuplicate.
Returns: Array with duplicates removed.
How It Works
Hamming distance counts the number of positions where corresponding characters differ between two strings. This library:
- Serializes object records into strings (using selected fields, joined with a null separator to prevent collisions)
- Normalizes by lowercasing and trimming (unless
caseSensitiveis set) - Pads strings to equal length
- Computes Hamming distance and normalizes it to a 0-1 similarity score
- Compares against the threshold to determine duplicates
License
Apache 2.0 -- If you use this package, you must give credit by:
- Including the
LICENSEandNOTICEfiles in any redistribution - Stating any changes you made to the source
- Preserving all copyright and attribution notices
See LICENSE for full terms.
