@vaicli/vai-workflow-collection-overlap-audit
v1.0.0
Published
Compare two collections to find semantic duplicates. Samples documents from the source collection and searches for near-matches in the target.
Downloads
45
Maintainers
Readme
vai-workflow-collection-overlap-audit
Organizations often ingest data from multiple sources into separate collections. Over time, the same content ends up in multiple collections, bloating storage costs and producing duplicate search results.
Install
vai workflow install vai-workflow-collection-overlap-auditHow It Works
- Sample source — Retrieve a representative sample from the source collection
- Cross-search — For each sampled document, search the target collection for near-matches
- Filter matches — Filter results above a similarity threshold to identify likely duplicates
- Report — Generate a structured overlap report with recommendations
Execution Plan
Layer 1: sample_source
Layer 2: cross_search → filter_duplicates
Layer 3: overlap_reportExample Usage
vai workflow run vai-workflow-collection-overlap-audit \
--input source_collection="engineering_docs_v1" \
--input target_collection="engineering_docs_v2" \
--input similarity_threshold=0.90What This Teaches
- The
filtertool enables threshold-based decision making within a workflow - Cross-collection search is achieved by sampling from one and searching the other
- The
similarity_thresholdinput makes the workflow configurable without editing JSON - LLM-generated reports transform raw search results into actionable recommendations
License
MIT © 2026 Michael Lynn
