@mineygg/optic-dedupe
v0.1.3
Published
Perceptual image deduplicator - handles static and animated images, cross/within-folder, configurable similarity threshold
Maintainers
Readme
Perceptual image deduplicator powered by sharp.
Table of Contents
- Install
- CLI Usage
- Library API
- How It Works
- Threshold Guide
- Cache
- Hash Features
- highest-quality Strategy
- Native Rust Acceleration
- Supported Formats
Install
npm install @mineygg/optic-dedupe # library
npm install -g @mineygg/optic-dedupe # global CLIRequires Node.js >= 22.
CLI Usage
The primary CLI command is opticdd. The aliases optic-dedupe and @mineygg/optic-dedupe also work.
opticdd scan ./photos
opticdd dedupe ./photos ./archive
opticdd cachescan
Scans one or more folders and prints a duplicate report. Does not modify any files.
# Scan a single folder
opticdd scan ./photos
# Compare two folders against each other
opticdd scan ./photos ./archive --mode cross --threshold 0.95
# Find duplicates within each subfolder independently
opticdd scan ./photos --mode within
# Skip folders, use 8 workers, show timing breakdown
opticdd scan ./photos -e ./photos/raw,./photos/temp -w 8 --debugWhen duplicates are found you will be asked whether to open a browser-based cleanup UI to review and act on the results visually.
dedupe
Scans and then acts on the results. Without --yes it launches an interactive wizard.
# Interactive wizard
opticdd dedupe ./photos ./archive
# Non-interactive: move duplicates, keep the highest-quality original
opticdd dedupe ./photos ./archive \
--mode cross \
--threshold 0.95 \
--strategy highest-quality \
--action move \
--yesWhen --action move is used without --move-dir, duplicates are moved into a folder named .hashednamed_<timestamp> inside the first scan root. The original directory structure is mirrored inside that folder so relative paths are preserved. If a filename collision occurs a numeric suffix is added automatically.
cache
opticdd cache # show location, entry count, and size on disk
opticdd cache clear # delete the cacheFlag Reference
All flags work on both scan and dedupe unless noted.
| Flag | Default | Description |
|---|---|---|
| -m, --mode <mode> | cross | cross — compare all files across all folders together. within — only compare files that share the same immediate parent directory. |
| -t, --threshold <n> | 0.8 | Similarity threshold 0.01–1.0. Higher = stricter. See Threshold Guide. |
| -e, --exclude <paths> | — | Comma-separated folder paths to skip entirely. |
| --no-animated | — | Skip animated images (GIF). |
| -v, --videos | false | Include video files. Requires FFmpeg on your PATH. |
| --vo, --videos-only | false | Scan only video files, ignore all images. |
| --no-cache | — | Skip cache entirely — rehash every file and write nothing back. |
| --no-chunk | — | Use a single 32×32 global hash instead of the default 4×4 tiled hashing. Faster but less accurate for images that differ only in a localised region. |
| --no-flip | — | Don't compute flip hashes. Horizontally mirrored duplicates won't be detected. If a full-feature cache entry already exists for a file it is reused — no rehash. |
| --no-colorhash | — | Skip the 4×4 color-grid verification. Removes the false-positive filter. |
| --quick | false | Faster profile: disables flip hashing, uses 3 animated samples (first/middle/last), and 5 video samples (instead of 10). |
| -w, --workers <n> | CPU count | Number of worker threads for parallel hashing. |
| --no-bucket-pair | false | Disable candidate bucketing before pair matching. By default bucketing is enabled for faster matching on large sets. |
| -d, --debug | false | Print a timing breakdown after scanning: time in sharp manipulation vs. hash math, per-file average, bottleneck. |
| -s, --strategy <s> | highest-quality | How to pick the "keep" copy within each duplicate group. Choices: oldest, newest, highest-quality, largest. See highest-quality Strategy. |
| -a, --action <a> | report | dedupe only. report — print only. move — move duplicates to a folder. delete — permanently delete duplicates. |
| --move-dir <dir> | auto .hashednamed_<ts> | dedupe only. Custom destination folder for moved files. |
| -y, --yes | false | dedupe only. Skip all interactive prompts and use flag values directly. |
Library API
import { scan, deduplicate } from "@mineygg/optic-dedupe"; // ESM
const { scan, deduplicate } = require("@mineygg/optic-dedupe"); // CJSCore Library Usage
Use the core library when you want to call dedupe logic directly from your app or scripts instead of using the CLI.
import { scan, deduplicate } from "@mineygg/optic-dedupe";
// 1) Analyze duplicates only (no file changes)
const scanResult = await scan({
folders: ["./photos", "./archive"],
mode: "cross",
threshold: 0.8,
includeAnimated: true,
cache: "use",
});
// 2) Apply actions (move/delete/report) programmatically
const run = await deduplicate({
folders: ["./photos", "./archive"],
mode: "cross",
threshold: 0.8,
includeAnimated: true,
cache: "use",
originalStrategy: "highest-quality",
action: "move",
moveTargetDir: "./duplicates",
});
console.log(scanResult.groups.length, run.actions.length);For callback-based progress reporting, see Progress Events. For full option shapes, see Types.
Core Output Shapes
scan() returns a ScanResult object:
{
scanned: number;
groups: DuplicateGroup[];
errors: { path: string; error: string }[];
durationMs: number;
cacheStats: { hits: number; misses: number };
}deduplicate() returns:
{
scan: ScanResult;
actions: {
original: string;
duplicates: {
path: string;
action: "none" | "moved" | "deleted";
destination?: string;
error?: string;
}[];
}[];
}Example shape:
const out = await deduplicate(opts);
console.log(out.scan.scanned);
console.log(out.scan.groups.length);
console.log(out.scan.errors.length);
console.log(out.actions[0]?.original);
console.log(out.actions[0]?.duplicates[0]?.action);scan()
Scans folders and returns duplicate groups. No files are modified.
import { scan } from "@mineygg/optic-dedupe";
const result = await scan({
folders: ["./photos", "./archive"],
exclude: ["./photos/raw"],
mode: "cross", // "cross" | "within"
threshold: 0.95, // 0.01 – 1.0
includeAnimated: true,
includeVideos: false,
includeVideosOnly: false,
cache: "use", // "use" | "ignore"
workers: 4,
hashFeatures: {
chunk: true, // 4×4 tiled hashing
flip: true, // detect horizontally mirrored duplicates
color: true, // color grid false-positive filter
},
hashSampling: {
animatedFrameSamples: 5, // default animated/GIF sample count
videoFrameSamples: 10, // default video sample count
},
});
console.log(result.scanned); // total files processed
console.log(result.groups); // DuplicateGroup[]
console.log(result.errors); // { path: string; error: string }[]
console.log(result.durationMs); // wall-clock time in ms
console.log(result.cacheStats); // { hits: number; misses: number }deduplicate()
Scans and applies an action to the duplicates found.
import { deduplicate } from "@mineygg/optic-dedupe";
const { scan: scanResult, actions } = await deduplicate({
// all ScanOptions fields above, plus:
originalStrategy: "highest-quality", // "oldest" | "newest" | "highest-quality" | "largest"
action: "move", // "report" | "move" | "delete"
moveTargetDir: "./duplicates", // optional, auto-named if omitted
});
for (const group of actions) {
console.log("original:", group.original);
for (const dup of group.duplicates) {
console.log(dup.action, dup.path, dup.destination ?? "");
}
}Progress Events
Both functions accept an optional onProgress callback.
import { scan, type ProgressEvent } from "@mineygg/optic-dedupe";
await scan(opts, (evt: ProgressEvent) => {
switch (evt.type) {
case "scan-start":
// evt.total: number of files found
// evt.workers: worker thread count
break;
case "scan-progress":
// evt.done, evt.total, evt.path, evt.cached (boolean)
break;
case "scan-error":
// evt.path, evt.error
break;
case "group-start":
// evt.imageCount: images being compared
break;
case "group-progress":
// evt.done, evt.total (pairs compared)
break;
case "group-done":
// evt.groups: number of duplicate groups found
break;
case "hash-debug":
// evt.filesHashed, evt.manipulationMs, evt.hashingMs,
// evt.totalMs, evt.manipulationPerFileMs, evt.hashingPerFileMs, evt.totalPerFileMs
break;
case "action-start":
// evt.groups: number of groups being acted on
break;
case "action-done":
// evt.results: ActionResult[]
break;
}
});Types
interface ScanOptions {
folders: string[];
exclude: string[];
mode: "cross" | "within";
threshold: number;
includeAnimated: boolean;
includeVideos?: boolean;
includeVideosOnly?: boolean;
cache?: "use" | "ignore";
workers?: number;
hashFeatures?: Partial<HashFeatures>;
hashSampling?: Partial<HashSampling>;
bucketPair?: boolean;
}
interface HashFeatures {
chunk: boolean;
flip: boolean;
color: boolean;
}
interface HashSampling {
animatedFrameSamples: number;
videoFrameSamples: number;
}
interface ScanResult {
scanned: number;
groups: DuplicateGroup[];
errors: { path: string; error: string }[];
durationMs: number;
cacheStats: { hits: number; misses: number };
}
interface DuplicateGroup {
original: ImageInfo; // the chosen "keep" copy
duplicates: ImageInfo[]; // all other members of the group
similarity: number; // average pairwise similarity, 0–1
}
interface ImageInfo {
path: string;
hash: string; // hex XOR of all tile pHashes
pHash: bigint[]; // per-tile perceptual hashes (16 values when chunk=true, 1 when false)
pHashFlipped?: bigint[]; // same, computed on the horizontally mirrored image
colors: number[]; // 4×4 RGB color grid — 48 values (16 tiles × R,G,B)
width: number;
height: number;
size: number; // file size in bytes
format: string;
mtime: number; // modification time, ms since epoch
frames: number; // >1 means animated
}
// Returned by deduplicate(), one entry per duplicate group
interface ActionResult {
original: string;
duplicates: Array<{
path: string;
action: "reported" | "moved" | "deleted";
destination?: string; // set when action is "moved"
error?: string;
}>;
groupSimilarity: number;
}How It Works
1. Scanning
Each provided folder is walked recursively. In within mode, files are grouped by their immediate parent directory and only files sharing the same folder are compared against each other.
2. Hashing (parallel worker threads)
Each image goes through this pipeline in a worker thread:
- Decoded and resized via sharp.
- When
chunk=true(default): resized to 128×128 and split into a 4×4 grid of 32×32 tiles. Whenchunk=false: resized to 32×32 as a single tile. - For each tile, a 2D DCT is computed. The top-left 8×8 DCT coefficients (excluding DC) are compared against their median to produce a 64-bit perceptual hash.
- When
flip=true(default): the same DCT process runs on the horizontally mirrored pixel data for each tile, producing a second set of hashes used for mirror-duplicate detection. - When
color=true(default): the image is downsampled to a 4×4 RGB grid (48 values) stored alongside the hashes and used as a false-positive filter at comparison time. - For animated images (GIF): 5 frames are sampled at 0%, 25%, 50%, 75%, and 100% of the animation. Per-tile hashes across frames are combined via bitwise majority vote — each output bit is 1 if more than half the frames had that bit set. This avoids XOR cancellation where an even number of identical frames would zero out the hash.
- For video files: same majority-vote approach with 10 frames sampled from 0% to 95% of duration (capped at 95% to avoid empty frames near the end).
- With
--quick: animated sampling is reduced to 3 frames (0%, 50%, 100%) and video sampling is reduced to 5 frames.
Cache hits are resolved on the main thread and never sent to workers. Workers are spawned once at pool creation and kept alive until all files are processed, with a task queue to keep all threads saturated.
When the native Rust binary is present, tile extraction + DCT + hash math runs in Rust rather than JavaScript.
3. Grouping
All hashed images are compared pairwise using Hamming distance on their tile hash arrays. For each pair, the minimum distance across normal and mirrored orientations is computed. When color=true, any pair whose 4×4 color grids have an average channel difference > 30 or a max single-tile difference > 45 is rejected regardless of pHash distance — this prevents false positives between structurally similar images (e.g. line art) with completely different color palettes.
Pairs within the threshold are merged using Union-Find for transitive closure: if A matches B and B matches C, all three end up in one group.
When the native Rust binary is available the entire O(n²) comparison loop runs in Rust.
When --no-bucket-pair is disabled, a conservative candidate-bucketing pass is used before exact pair checks. This improves speed on larger datasets at the cost of possible recall loss on borderline matches.
4. Original selection
Within each group one image is kept and the rest are duplicates. See highest-quality Strategy for how that strategy works. oldest, newest, and largest sort purely by mtime or file size.
5. Action
Duplicates are reported, moved, or deleted depending on your configuration. Moved files mirror their original directory structure inside the target folder.
Threshold Guide
| Threshold | What it catches |
|---|---|
| 1.0 | Perceptually identical only |
| 0.95 | Same image, different compression or very minor quality loss |
| 0.90 | Resized, lightly edited, slight crops |
| 0.80 | Moderate edits, heavy compression, more aggressive crops (default) |
| 0.70 | Balanced — good starting point for mixed photo sets |
| 0.50 | Very loose — different subjects can match |
Start at 0.8 and tighten toward 0.95 if you are getting false positives, or loosen toward 0.6 if obvious duplicates are being missed.
Cache
After the first scan, hashes are written to a cache file on disk. On subsequent scans any file whose path + size + mtime are unchanged is read from cache instead of being rehashed.
Smart feature reuse: a cache entry whose features are a superset of what is requested is reused — the extra data is just ignored. Concretely:
- Cached with
flip=true, run with--no-flip→ cache hit, flip hashes ignored, no rehash. - Cached with
flip=false, run withflip=true→ cache miss, rehash only those files. - Same logic applies to
chunkandcolor.
Automatic stale-entry eviction: when a folder is scanned, any cache entries for files under that folder that no longer exist on disk are removed and the cache file is rewritten. The cache does not grow unbounded as files are deleted.
Both the superset lookup and eviction run through the native Rust cache layer when available.
opticdd cache # show location, entry count, size on disk
opticdd cache clear # wipe the cache
opticdd scan ./photos --no-cache # skip cache for this run onlyHash Features
All three flags are enabled by default. The real bottleneck is always sharp decode + resize, not the math — disabling flip or color saves very little wall-clock time. Disabling chunk saves more because the resize target drops from 128×128 to 32×32.
| Flag | What it does | Performance impact |
|---|---|---|
| chunk | Hashes each of 16 tiles (4×4 grid) independently. Catches images that differ only in a localised region (e.g. different text on a shared background). Off = single 32×32 global hash. | Moderate — smaller resize target when off |
| flip | Computes a second hash set for the horizontally mirrored image. Catches mirror duplicates. | Small — image is already decoded, this is another DCT pass on the same data |
| color | Stores a 4×4 RGB color grid per image. Used after a pHash match to reject false positives. | Negligible |
highest-quality Strategy
highest-quality is not a simple pixel-count sort. It applies a tiered comparison:
- Resolution dominance — if one image has more than 1.5× the pixel count of the other, it wins outright.
- Format tier — lossless/pristine formats (PNG, TIFF, BMP, RAW) beat lossy ones (JPEG, WebP, AVIF, etc.) regardless of size.
- File size dominance — within the same format tier, if one file is more than 1.25× larger it wins (indicating higher bitrate / less compression).
- Resolution — higher pixel count wins.
- File size — larger file wins on identical resolution.
- Age — older file wins on exact ties (likely the original source).
Native Rust Acceleration
Prebuilt binaries are included for:
| Platform | Architectures | |---|---| | Windows | x64, arm64, ia32 | | Linux (glibc) | x64, arm64, armv7 | | Linux (musl) | x64, arm64 | | macOS | x64, arm64 |
The native module handles three things:
- Hash math — tile extraction, DCT, and pHash generation in Rust.
- Pairwise comparison — the O(n²) Hamming distance loop in compiled Rust with no GC pressure.
- Cache layer — the NDJSON cache backed by a Rust
HashMap, including superset-compatible lookup and stale-entry eviction.
If no matching binary is found for your platform the library falls back silently to the pure TypeScript implementation.
Supported Formats
Static images: JPEG, PNG, WebP, AVIF, TIFF, BMP
Animated images: GIF
Video (requires FFmpeg on PATH): MP4, MKV, AVI, MOV, WEBM, FLV, M4V
