@entryscape/entrywarden
v0.9.0
Published
Duplicate detection and case management for RDF entity registries in EntryStore.
Downloads
134
Readme
EntryWarden
A library for detecting and managing duplicate entities in registries. EntryWarden identifies potential duplicates through configurable similarity detection, presents them for human review, and blocks duplicates from entering the system during harvest.
How it works
During harvest, the harvester sends each entity's RDF graph to the warden. The warden processes it internally:
- Partition matching — the entity is assigned to a partition (e.g., restaurants or hotels). Entities that match no partition are ignored.
- Candidate selection — the similarity index finds entities in the same partition that share indexed tokens, geohash cells, or exact values.
- Scoring — each candidate is scored against the entity across all configured fields. Pairs above the partition's threshold become matches.
- Case management — matches are grouped into cases. New cases are created, existing cases are expanded, and a primary entity is auto-selected as the reference point.
The warden returns a keep or block verdict to the harvester based on whether the entity is blocked in any active case.
Cases start as pending and are presented for human review via EntryStore. A reviewer confirms the duplicates, decides which entity to keep as primary, and which to block. Once a case is set to active, its blocking decisions are enforced during subsequent harvests.
Example
Consider a registry with two partitions — hotels and restaurants — each with their own fields and thresholds:
- The hotels partition compares entities by geographic position and name.
- The restaurants partition does the same, but independently.
During harvest, two hotel entities arrive:
| Entity | Type | Name | Location |
|---|---|---|---|
| urn:hotel-a | Hotel | Hotel Grand | 48.856°N, 2.352°E |
| urn:hotel-b | Hotel | Grand Hotel | 48.856°N, 2.352°E |
Both are assigned to the hotels partition. The scoring algorithm finds high similarity — same location, near-identical name (token set ratio handles the word reordering). A case is created linking the two. After human review, urn:hotel-b is blocked in favour of urn:hotel-a.
Later, a restaurant entity arrives:
| Entity | Type | Name | Location |
|---|---|---|---|
| urn:restaurant-a | Restaurant | Hotel Grand | 48.856°N, 2.352°E |
Despite having the same name and location, this entity is assigned to the restaurants partition. It is only compared against other restaurants — never against hotels. No match is found, so it passes through without case being created.
Getting started
Run warden seed to build a similarity index and warden report to inspect the duplicate-detection report. See Usage for the full CLI reference and for embedding EntryWarden programmatically.
Key concepts
- Case — a group of entities flagged as potential duplicates, progressing through
pending,active, anddisabledstates - Partition — a subset of entities compared with each other, based on type (e.g.,
rdf:type) - Primary entity — the preferred entity in a case; all evidence is stored as comparisons against it
- Evidence — per-field similarity details explaining why entities were matched
Documentation
- Usage — CLI and programmatic usage
- Configuration — partitions, fields, locale, thresholds, CLI config block
- Rules — scoring, case manager, and enforcement rules
- Terminology — definitions of all key terms
- Data model — structure of cases, entity references, and evidence
- Vocabulary — RDF vocabulary for expressing cases
- Architecture — technical decisions and constraints
Testing
pnpm test # unit tests
pnpm test:integration # integration tests (requires Docker)