have-we-met
v0.1.1
Published
Identity resolution library for Node.js - match, deduplicate, and merge records across datasets with deterministic, probabilistic, and ML-based matching
Maintainers
Readme
have-we-met
Identity resolution library for Node.js - Match, deduplicate, and merge records with confidence
have-we-met helps you identify, match, and merge duplicate records across datasets. Built for production use in healthcare, finance, CRM, and any domain where data quality matters.
Features
- 🎯 Three Matching Paradigms: Deterministic rules, probabilistic scoring, and ML-based matching
- 🔀 Multi-Source Consolidation: Match and merge records from multiple databases with different schemas
- ⚡ Blazing Fast: Blocking strategies reduce O(n²) to linear time - process 100k records in seconds
- 🔧 Flexible Configuration: Fluent API with full TypeScript support and type inference
- 💾 Database Native: First-class adapters for Prisma, Drizzle, and TypeORM
- 👥 Human-in-the-Loop: Built-in review queue for ambiguous matches
- 🔄 Golden Record Management: Configurable merge strategies with full provenance tracking
- 🧠 ML Integration: Pre-trained models included, train custom models from your data
- 🔌 Extensible: Plugin architecture for external validation and data enrichment services
- 📊 Production Ready: Comprehensive error handling, metrics, and monitoring
Quick Start
Installation
npm install have-we-metBasic Usage
import { HaveWeMet } from 'have-we-met'
interface Person {
firstName: string
lastName: string
email: string
dateOfBirth: string
}
// Configure the resolver
const resolver = HaveWeMet.create<Person>()
.schema((schema) =>
schema
.field('firstName', { type: 'name', component: 'first' })
.field('lastName', { type: 'name', component: 'last' })
.field('email', { type: 'email' })
.field('dateOfBirth', { type: 'date' })
)
// Blocking reduces comparisons by 95-99%
.blocking((block) => block.onField('lastName', { transform: 'soundex' }))
// Weighted probabilistic matching
.matching((match) =>
match
.field('email')
.strategy('exact')
.weight(20)
.field('firstName')
.strategy('jaro-winkler')
.weight(10)
.threshold(0.85)
.field('lastName')
.strategy('jaro-winkler')
.weight(10)
.threshold(0.85)
.field('dateOfBirth')
.strategy('exact')
.weight(10)
.thresholds({ noMatch: 20, definiteMatch: 45 })
)
.build()
// Find matches
const results = resolver.resolve(newRecord, existingRecords)
// Three possible outcomes:
// - definite-match: High confidence match (score >= 45)
// - potential-match: Needs human review (score 20-45)
// - no-match: New record (score < 20)
results.forEach((result) => {
console.log(result.outcome) // 'definite-match' | 'potential-match' | 'no-match'
console.log(result.score.totalScore) // Numeric score
console.log(result.explanation) // Field-by-field breakdown
})See full quick start example →
Key Use Cases
1. Real-Time Duplicate Detection
Check for duplicates at the point of entry (e.g., new customer registration):
import { prismaAdapter } from 'have-we-met/adapters/prisma'
const resolver = HaveWeMet.create<Customer>()
.schema((schema) => /* ... */)
.blocking((block) => /* ... */)
.matching((match) => /* ... */)
.adapter(prismaAdapter(prisma, { tableName: 'customers' }))
.build()
// Check database for matches before inserting
const matches = await resolver.resolveWithDatabase(newCustomer)
if (matches[0]?.outcome === 'definite-match') {
return { error: 'Customer already exists', id: matches[0].record.id }
}
// Safe to create new record
await prisma.customer.create({ data: newCustomer })See database integration example →
2. Batch Deduplication
Clean up legacy data or deduplicate imported datasets:
// Find all duplicates in a dataset
const result = resolver.deduplicateBatch(records)
console.log(`Found ${result.stats.definiteMatchesFound} duplicates`)
console.log(`${result.stats.potentialMatchesFound} need human review`)
// Batch deduplicate from database
const dbResult = await resolver.deduplicateBatchFromDatabase({
batchSize: 1000,
persistResults: true,
})See batch deduplication example →
3. Human Review Workflow
Queue ambiguous matches for human review:
// Auto-queue potential matches
const results = await resolver.resolve(newRecord, {
autoQueue: true,
queueContext: { source: 'import', userId: 'admin' },
})
// Review queue items
const pending = await resolver.queue.list({ status: 'pending', limit: 10 })
// Make decisions
await resolver.queue.confirm(itemId, {
selectedMatchId: matchId,
notes: 'Verified by phone number',
decidedBy: '[email protected]',
})
// Monitor queue health
const stats = await resolver.queue.stats()
console.log(`Pending: ${stats.byStatus.pending}`)4. ML-Enhanced Matching
Combine rule-based and ML matching for best accuracy:
const resolver = HaveWeMet.create<Person>()
.schema((schema) => /* ... */)
.blocking((block) => /* ... */)
.matching((match) => /* ... */)
.ml((ml) =>
ml
.usePretrained() // Use built-in model
.mode('hybrid') // Combine ML + probabilistic
.mlWeight(0.4) // 40% ML, 60% probabilistic
)
.build()
// Results include ML predictions
results.forEach(result => {
console.log(result.mlPrediction?.probability) // 0.92 (92% match)
console.log(result.mlPrediction?.confidence) // 'high'
})5. Multi-Source Consolidation
Match and merge records from multiple databases with different schemas:
// Consolidate customers from 3 product databases
const result = await HaveWeMet.consolidation<UnifiedCustomer>()
.source(
'crm',
(source) =>
source
.adapter(crmAdapter)
.mapping((map) =>
map
.field('email')
.from('email_address')
.field('firstName')
.from('first_name')
.field('lastName')
.from('last_name')
)
.priority(2) // CRM is most trusted
)
.source('billing', (source) =>
source
.adapter(billingAdapter)
.mapping((map) =>
map
.field('email')
.from('contact_email')
.field('firstName')
.from('fname')
.field('lastName')
.from('lname')
)
.priority(1)
)
.source('support', (source) =>
source
.adapter(supportAdapter)
.mapping((map) =>
map
.field('email')
.from('email')
.field('firstName')
.from('first')
.field('lastName')
.from('last')
)
.priority(1)
)
.matchingScope('within-source-first')
.conflictResolution((cr) =>
cr
.useSourcePriority(true)
.defaultStrategy('preferNonNull')
.fieldStrategy('email', 'preferNewer')
)
.outputAdapter(unifiedAdapter)
.build()
.consolidate()
console.log(`Created ${result.stats.goldenRecords} unified records`)
console.log(`Found ${result.stats.crossSourceMatches} cross-source matches`)Why have-we-met?
The Problem
Every organization accumulates duplicate records over time:
- Multiple customer accounts for the same person
- Patient records split across systems
- Vendor duplicates with slight variations in names
- Legacy data imports with inconsistent formats
Manual deduplication doesn't scale. Simple exact-match queries miss fuzzy duplicates. You need intelligent matching that handles:
- Typos and spelling variations
- Different email addresses
- Formatting differences
- Incomplete data
- Ambiguous cases requiring human judgment
The Solution
have-we-met provides production-grade identity resolution:
✅ Handles Fuzzy Matches: Uses advanced string similarity algorithms (Jaro-Winkler, Levenshtein, phonetic encoding)
✅ Scales to Millions: Blocking strategies reduce O(n²) complexity to near-linear performance
✅ Works with Your Database: Native adapters query your database efficiently without loading everything into memory
✅ Learns from Feedback: ML models improve over time by learning from human review decisions
✅ Production Ready: Built for real-world use with error handling, monitoring, and comprehensive testing
Matching Paradigms
Deterministic Matching
Rules-based matching where specific field combinations definitively identify a match:
// If SSN matches exactly, it's the same person
if (record1.ssn === record2.ssn) {
return 'definite-match'
}Best for: Unique identifiers, high-confidence business rules
Probabilistic Matching
Weighted scoring across multiple fields based on Fellegi-Sunter theory:
// Each field contributes to total score
email match: +20 points
phone match: +15 points
name fuzzy match: +10 points
address mismatch: -5 points
-------------------------
Total: 40 points (potential match)Best for: General identity resolution, tunable for your data
ML-Based Matching
Machine learning models that learn patterns from data:
// ML model considers complex patterns
ML prediction: 87% match probability
Features: email domain similarity, name nickname patterns,
address component overlap, temporal patternsBest for: Complex patterns, learning from historical decisions
Blocking Strategies
Blocking is essential for scaling to large datasets. Instead of comparing every record to every other record (O(n²)), blocking groups similar records together:
// Without blocking: 100k records = 5 billion comparisons
// With blocking: 100k records = 50 million comparisons (99% reduction!)
.blocking((block) =>
block
.onField('lastName', { transform: 'soundex' }) // Group by phonetic codes
.onField('dateOfBirth', { transform: 'year' }) // Group by birth year
)Result: Process 100k records in seconds instead of hours.
Database Adapters
Work directly with your existing database:
Prisma
import { PrismaClient } from '@prisma/client'
import { prismaAdapter } from 'have-we-met/adapters/prisma'
const prisma = new PrismaClient()
const resolver = HaveWeMet.create<Customer>()
.adapter(prismaAdapter(prisma, { tableName: 'customers' }))
.build()Drizzle
import { drizzle } from 'drizzle-orm/node-postgres'
import { drizzleAdapter } from 'have-we-met/adapters/drizzle'
const db = drizzle(pool)
const resolver = HaveWeMet.create<Customer>()
.adapter(drizzleAdapter(db, { table: customersTable }))
.build()TypeORM
import { DataSource } from 'typeorm'
import { typeormAdapter } from 'have-we-met/adapters/typeorm'
const dataSource = new DataSource({...})
const resolver = HaveWeMet.create<Customer>()
.adapter(typeormAdapter(dataSource, { entity: Customer }))
.build()Database adapter documentation →
Documentation
Getting Started
Matching
- Probabilistic Matching
- Tuning Guide - Configure weights and thresholds
- String Similarity Algorithms
- Examples and Recipes
Blocking
Data Preparation
- Normalizers Overview
- Name Normalizer
- Email Normalizer
- Phone Normalizer
- Address Normalizer
- Date Normalizer
- Custom Normalizers
Database Integration
- Database Adapters
- Prisma Adapter
- Drizzle Adapter
- TypeORM Adapter
- Performance Optimization
- Migration Guide
Human Review
Golden Record
ML Matching
External Services
API Reference
Performance
have-we-met is designed for production scale:
| Dataset Size | Batch Deduplication Time | Memory Usage | Comparison Reduction | | ------------ | ------------------------ | ------------ | -------------------- | | 10k records | ~1 second | < 100MB | 97% | | 100k records | ~15 seconds | < 500MB | 98% | | 1M records | ~3 minutes | < 2GB | 99%+ |
- Real-time matching: < 100ms per query
- ML predictions: < 10ms per comparison
- Blocking efficiency: 95-99%+ comparison reduction
Requirements
- Node.js: 18+ (ESM and CommonJS supported)
- TypeScript: 5.0+ (optional, but recommended)
- Database: Optional, but recommended for production use
- Prisma 5+
- Drizzle ORM 0.28+
- TypeORM 0.3+
Contributing
Contributions are welcome! Please see CONTRIBUTING.md for guidelines.
License
MIT © Matt Barrett
Support
Roadmap
Current Version: 0.1.0 (Initial Release)
Completed Features:
- ✅ Core matching engine (deterministic, probabilistic, ML)
- ✅ String similarity algorithms (Levenshtein, Jaro-Winkler, Soundex, Metaphone)
- ✅ Data normalizers (name, email, phone, address, date)
- ✅ Blocking strategies (standard, sorted neighbourhood)
- ✅ Database adapters (Prisma, Drizzle, TypeORM)
- ✅ Review queue with human-in-the-loop workflow
- ✅ Golden record management with provenance
- ✅ External service integration
- ✅ ML matching with pre-trained models
- ✅ Comprehensive documentation and examples
Future Plans:
- Multi-language name handling
- Additional phonetic algorithms for non-English names
- UI components for review queue
- CLI tool for batch operations
- Performance visualization
Acknowledgments
Built with inspiration from:
- Fellegi-Sunter record linkage theory
- Duke (Java deduplication engine)
- Python Record Linkage Toolkit
- Dedupe.io
Made with ❤️ for data quality
