@factorypure/matching-engine

v1.0.10

Published

a month ago

Robust data matching system for product catalog with cross-field identifier matching

0High
0Medium
0Low

oliviafactorypure

akokkofactorypure

geofffactorypure

matching fuzzy-matching product-matching identifier-extraction

@factorypure/matching-engine

Robust data matching system for product catalogs with unreliable identifier fields and cross-field matching capabilities.

The Problem

When integrating product data from multiple sources (web scraping, supplier catalogs, internal databases), identifier fields are often:

Missing: SKU, UPC, model number fields are frequently empty
Inconsistent: Same value appears in different field names across sources
Polluted: UPC values stored in SKU fields, model numbers in UPC fields, etc.

Example of the core challenge:

const variant = {
    sku: 'TEST-123',
    upc: null,
    title: 'Acme X200 Motor',
}

const scrapeResult = {
    sku: null,
    upc: 'TEST-123', // SAME VALUE, DIFFERENT FIELD
    title: 'ACME X200 MOTOR',
}

// Traditional field-specific matching would miss this!
// Our cross-field matching finds it.

Solution

This package provides:

Identifier Extraction - Finds identifiers in structured fields AND unstructured text (titles, descriptions)
Cross-Field Matching - Compares ALL identifiers from source A against ALL identifiers from source B regardless of field names
Fuzzy Matching - Handles typos, formatting variations, normalization
Confidence Scoring - Multi-signal scoring that adapts to available data quality
Type Classification - Determines if a value is a UPC, SKU, model number, etc.

Installation

npm install @factorypure/matching-engine

Quick Start

import { findMatches } from '@factorypure/matching-engine'

const sourceProduct = {
    id: 1,
    sku: 'X200-5HP',
    title: 'Acme X200 Motor 5HP',
    vendor: 'Acme Corp',
}

const candidates = [
    {
        id: 2,
        upc: 'X 200 5HP', // Cross-field match (different field name)
        title: 'ACME X200 MOTOR',
        brand: 'Acme Corp',
    },
    {
        id: 3,
        title: 'Different Product',
        brand: 'Other Brand',
    },
]

const result = findMatches(sourceProduct, candidates, {
    minConfidence: 70,
    autoMatchThreshold: 90,
    enableFuzzyMatching: true,
})

console.log(result.candidates)
// [
//   {
//     candidateId: 2,
//     confidence: 87.5,
//     recommendation: 'review',
//     matchReasons: [
//       {
//         type: 'identifier_match',
//         fieldA: 'sku',
//         fieldB: 'upc',     // CROSS-FIELD!
//         value: 'X200-5HP',
//         score: 100,
//         details: 'Cross-field match'
//       },
//       { type: 'title_similarity', score: 85 },
//       { type: 'brand_match', score: 100 }
//     ],
//     identifierMatches: [/* detailed match info */]
//   }
// ]

Key Features

1. Identifier Extraction

Extracts identifiers from multiple sources:

import { extractIdentifiers } from '@factorypure/matching-engine'

const product = {
    id: 1,
    sku: 'X200-5HP', // Field SKU
    upc: null,
    title: 'Acme Motor (P/N: ABC-123) 012345678', // Embedded identifiers
}

const pool = extractIdentifiers(product, 'variant')

console.log(pool.identifiers)
// [
//   { value: 'X200-5HP', type: 'sku', source: 'sku', confidence: 0.8 },
//   { value: 'ABC-123', type: 'part_number', source: 'title_parsed', confidence: 0.85 },
//   { value: '012345678', type: 'upc', source: 'title_parsed', confidence: 0.8 }
// ]

2. Cross-Field Matching

Compares ALL identifiers regardless of field name:

import { findIdentifierMatches } from '@factorypure/matching-engine'

const variantPool = extractIdentifiers(variant, 'variant')
const resultPool = extractIdentifiers(scrapeResult, 'scrape_result')

const matches = findIdentifierMatches(variantPool, resultPool, true)

// Finds matches even when:
// - variant.sku === scrapeResult.upc
// - variant.model === scrapeResult.sku
// - Embedded in title vs structured field

3. Fuzzy Matching

Handles variations and typos:

import { normalizeIdentifier } from '@factorypure/matching-engine'

normalizeIdentifier('X-200-5HP') // "x2005hp"
normalizeIdentifier('X 200 5HP') // "x2005hp"
normalizeIdentifier('X_200_5HP') // "x2005hp"

// All normalize to same value → MATCH

4. Confidence Scoring

Adapts to available data:

const result = findMatches(source, candidates, {
    weights: {
        identifierMatch: 0.5, // 50% - Highest priority
        titleSimilarity: 0.25, // 25%
        brandMatch: 0.15, // 15%
        specMatch: 0.1, // 10%
    },
})

// When identifiers are missing, title+brand can still produce matches
// When identifiers are present, confidence is boosted

5. Type Classification

Determines what each value represents:

import { classifyIdentifier } from '@factorypure/matching-engine'

classifyIdentifier('012345678901', 'upc_field')
// { type: 'upc', confidence: 0.95 }

classifyIdentifier('012345678901', 'sku_field') // Wrong field name!
// { type: 'upc', confidence: 0.70 }  // Still classifies correctly by pattern

classifyIdentifier('X200-5HP', 'model')
// { type: 'model', confidence: 0.85 }

Configuration

import { DEFAULT_MATCHING_CONFIG } from '@factorypure/matching-engine'

const customConfig = {
    ...DEFAULT_MATCHING_CONFIG,
    minConfidence: 75, // Minimum confidence to include in results
    autoMatchThreshold: 95, // Auto-approve above this
    reviewThreshold: 75, // Manual review between review and auto
    maxCandidates: 10, // Max results to return
    enableFuzzyMatching: true, // Allow fuzzy identifier matching
    brandRequired: true, // Penalize brand mismatches
    weights: {
        identifierMatch: 0.5,
        titleSimilarity: 0.25,
        brandMatch: 0.15,
        specMatch: 0.1,
    },
}

API Reference

`findMatches(sourceProduct, candidates, config?)`

Main matching function.

Parameters:

sourceProduct: Product to find matches for
candidates: Array of candidate products
config: Optional configuration (merges with defaults)

Returns: MatchResult with candidates sorted by confidence

`extractIdentifiers(product, sourceType)`

Extract all identifiers from a product.

Parameters:

product: Product data object
sourceType: 'variant' | 'scrape_result' | 'company_product'

Returns: IdentifierPool with all extracted identifiers

`findIdentifierMatches(sourcePool, candidatePool, enableFuzzy)`

Find matching identifiers between two pools (cross-field).

Parameters:

sourcePool: Source identifier pool
candidatePool: Candidate identifier pool
enableFuzzy: Enable fuzzy matching

Returns: Array of IdentifierMatch objects

Real-World Example

// Scrape result with limited data
const scrapeResult = {
    title: 'ACME X200 MOTOR 5 HP 220V',
    price: 1399.99,
    sku: null, // Missing!
    upc: 'X-200', // Actually a model number, wrong field!
    brand: null, // Missing!
}

// Internal variant with good data
const variant = {
    sku: 'FP-X200-5HP',
    title: 'Acme X200 5HP Motor 220V',
    vendor: 'Acme Corp',
    upc: '012345678901',
    model_number: 'X200',
}

const result = findMatches(scrapeResult, [variant], {
    minConfidence: 65, // Lower threshold for missing data
})

// Matches because:
// 1. scrapeResult.upc ("X-200") matches variant.model_number ("X200") - CROSS-FIELD
// 2. Titles are very similar (token overlap)
// 3. Extracts "X200" and "5HP" from titles
// → Confidence: ~75% → Recommendation: review

Integration with Database

See the fpdash-server integration for database-backed matching with caching:

product_matching_attempts - Logs all match attempts
product_matches - Stores confirmed matches
product_identifiers - Caches extracted identifiers for fast lookups
matching_rules - Configuration storage

Testing

npm test

See src/index.test.ts for comprehensive test coverage including:

Cross-field matching scenarios
Missing identifier handling
Identifier pollution cases
Fuzzy matching
Confidence scoring

License

ISC

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@factorypure/matching-engine

The Problem

Solution

Installation

Quick Start

Key Features

1. Identifier Extraction

2. Cross-Field Matching

3. Fuzzy Matching

4. Confidence Scoring

5. Type Classification

Configuration

API Reference

findMatches(sourceProduct, candidates, config?)

extractIdentifiers(product, sourceType)

findIdentifierMatches(sourcePool, candidatePool, enableFuzzy)

Real-World Example

Integration with Database

Testing

License

`findMatches(sourceProduct, candidates, config?)`

`extractIdentifiers(product, sourceType)`

`findIdentifierMatches(sourcePool, candidatePool, enableFuzzy)`