@factorypure/matching-engine
v1.0.8
Published
Robust data matching system for product catalog with cross-field identifier matching
Readme
@factorypure/matching-engine
Robust data matching system for product catalogs with unreliable identifier fields and cross-field matching capabilities.
The Problem
When integrating product data from multiple sources (web scraping, supplier catalogs, internal databases), identifier fields are often:
- Missing: SKU, UPC, model number fields are frequently empty
- Inconsistent: Same value appears in different field names across sources
- Polluted: UPC values stored in SKU fields, model numbers in UPC fields, etc.
Example of the core challenge:
const variant = {
sku: 'TEST-123',
upc: null,
title: 'Acme X200 Motor',
}
const scrapeResult = {
sku: null,
upc: 'TEST-123', // SAME VALUE, DIFFERENT FIELD
title: 'ACME X200 MOTOR',
}
// Traditional field-specific matching would miss this!
// Our cross-field matching finds it.Solution
This package provides:
- Identifier Extraction - Finds identifiers in structured fields AND unstructured text (titles, descriptions)
- Cross-Field Matching - Compares ALL identifiers from source A against ALL identifiers from source B regardless of field names
- Fuzzy Matching - Handles typos, formatting variations, normalization
- Confidence Scoring - Multi-signal scoring that adapts to available data quality
- Type Classification - Determines if a value is a UPC, SKU, model number, etc.
Installation
npm install @factorypure/matching-engineQuick Start
import { findMatches } from '@factorypure/matching-engine'
const sourceProduct = {
id: 1,
sku: 'X200-5HP',
title: 'Acme X200 Motor 5HP',
vendor: 'Acme Corp',
}
const candidates = [
{
id: 2,
upc: 'X 200 5HP', // Cross-field match (different field name)
title: 'ACME X200 MOTOR',
brand: 'Acme Corp',
},
{
id: 3,
title: 'Different Product',
brand: 'Other Brand',
},
]
const result = findMatches(sourceProduct, candidates, {
minConfidence: 70,
autoMatchThreshold: 90,
enableFuzzyMatching: true,
})
console.log(result.candidates)
// [
// {
// candidateId: 2,
// confidence: 87.5,
// recommendation: 'review',
// matchReasons: [
// {
// type: 'identifier_match',
// fieldA: 'sku',
// fieldB: 'upc', // CROSS-FIELD!
// value: 'X200-5HP',
// score: 100,
// details: 'Cross-field match'
// },
// { type: 'title_similarity', score: 85 },
// { type: 'brand_match', score: 100 }
// ],
// identifierMatches: [/* detailed match info */]
// }
// ]Key Features
1. Identifier Extraction
Extracts identifiers from multiple sources:
import { extractIdentifiers } from '@factorypure/matching-engine'
const product = {
id: 1,
sku: 'X200-5HP', // Field SKU
upc: null,
title: 'Acme Motor (P/N: ABC-123) 012345678', // Embedded identifiers
}
const pool = extractIdentifiers(product, 'variant')
console.log(pool.identifiers)
// [
// { value: 'X200-5HP', type: 'sku', source: 'sku', confidence: 0.8 },
// { value: 'ABC-123', type: 'part_number', source: 'title_parsed', confidence: 0.85 },
// { value: '012345678', type: 'upc', source: 'title_parsed', confidence: 0.8 }
// ]2. Cross-Field Matching
Compares ALL identifiers regardless of field name:
import { findIdentifierMatches } from '@factorypure/matching-engine'
const variantPool = extractIdentifiers(variant, 'variant')
const resultPool = extractIdentifiers(scrapeResult, 'scrape_result')
const matches = findIdentifierMatches(variantPool, resultPool, true)
// Finds matches even when:
// - variant.sku === scrapeResult.upc
// - variant.model === scrapeResult.sku
// - Embedded in title vs structured field3. Fuzzy Matching
Handles variations and typos:
import { normalizeIdentifier } from '@factorypure/matching-engine'
normalizeIdentifier('X-200-5HP') // "x2005hp"
normalizeIdentifier('X 200 5HP') // "x2005hp"
normalizeIdentifier('X_200_5HP') // "x2005hp"
// All normalize to same value → MATCH4. Confidence Scoring
Adapts to available data:
const result = findMatches(source, candidates, {
weights: {
identifierMatch: 0.5, // 50% - Highest priority
titleSimilarity: 0.25, // 25%
brandMatch: 0.15, // 15%
specMatch: 0.1, // 10%
},
})
// When identifiers are missing, title+brand can still produce matches
// When identifiers are present, confidence is boosted5. Type Classification
Determines what each value represents:
import { classifyIdentifier } from '@factorypure/matching-engine'
classifyIdentifier('012345678901', 'upc_field')
// { type: 'upc', confidence: 0.95 }
classifyIdentifier('012345678901', 'sku_field') // Wrong field name!
// { type: 'upc', confidence: 0.70 } // Still classifies correctly by pattern
classifyIdentifier('X200-5HP', 'model')
// { type: 'model', confidence: 0.85 }Configuration
import { DEFAULT_MATCHING_CONFIG } from '@factorypure/matching-engine'
const customConfig = {
...DEFAULT_MATCHING_CONFIG,
minConfidence: 75, // Minimum confidence to include in results
autoMatchThreshold: 95, // Auto-approve above this
reviewThreshold: 75, // Manual review between review and auto
maxCandidates: 10, // Max results to return
enableFuzzyMatching: true, // Allow fuzzy identifier matching
brandRequired: true, // Penalize brand mismatches
weights: {
identifierMatch: 0.5,
titleSimilarity: 0.25,
brandMatch: 0.15,
specMatch: 0.1,
},
}API Reference
findMatches(sourceProduct, candidates, config?)
Main matching function.
Parameters:
sourceProduct: Product to find matches forcandidates: Array of candidate productsconfig: Optional configuration (merges with defaults)
Returns: MatchResult with candidates sorted by confidence
extractIdentifiers(product, sourceType)
Extract all identifiers from a product.
Parameters:
product: Product data objectsourceType: 'variant' | 'scrape_result' | 'company_product'
Returns: IdentifierPool with all extracted identifiers
findIdentifierMatches(sourcePool, candidatePool, enableFuzzy)
Find matching identifiers between two pools (cross-field).
Parameters:
sourcePool: Source identifier poolcandidatePool: Candidate identifier poolenableFuzzy: Enable fuzzy matching
Returns: Array of IdentifierMatch objects
Real-World Example
// Scrape result with limited data
const scrapeResult = {
title: 'ACME X200 MOTOR 5 HP 220V',
price: 1399.99,
sku: null, // Missing!
upc: 'X-200', // Actually a model number, wrong field!
brand: null, // Missing!
}
// Internal variant with good data
const variant = {
sku: 'FP-X200-5HP',
title: 'Acme X200 5HP Motor 220V',
vendor: 'Acme Corp',
upc: '012345678901',
model_number: 'X200',
}
const result = findMatches(scrapeResult, [variant], {
minConfidence: 65, // Lower threshold for missing data
})
// Matches because:
// 1. scrapeResult.upc ("X-200") matches variant.model_number ("X200") - CROSS-FIELD
// 2. Titles are very similar (token overlap)
// 3. Extracts "X200" and "5HP" from titles
// → Confidence: ~75% → Recommendation: reviewIntegration with Database
See the fpdash-server integration for database-backed matching with caching:
product_matching_attempts- Logs all match attemptsproduct_matches- Stores confirmed matchesproduct_identifiers- Caches extracted identifiers for fast lookupsmatching_rules- Configuration storage
Testing
npm testSee src/index.test.ts for comprehensive test coverage including:
- Cross-field matching scenarios
- Missing identifier handling
- Identifier pollution cases
- Fuzzy matching
- Confidence scoring
License
ISC
