@mineygg/optic-dedupe

v0.1.3

Published

12 days ago

Perceptual image deduplicator - handles static and animated images, cross/within-folder, configurable similarity threshold

Perceptual image deduplicator powered by sharp.

Install

npm install @mineygg/optic-dedupe        # library
npm install -g @mineygg/optic-dedupe     # global CLI

Requires Node.js >= 22.

CLI Usage

The primary CLI command is opticdd. The aliases optic-dedupe and @mineygg/optic-dedupe also work.

opticdd scan ./photos
opticdd dedupe ./photos ./archive
opticdd cache

scan

Scans one or more folders and prints a duplicate report. Does not modify any files.

# Scan a single folder
opticdd scan ./photos

# Compare two folders against each other
opticdd scan ./photos ./archive --mode cross --threshold 0.95

# Find duplicates within each subfolder independently
opticdd scan ./photos --mode within

# Skip folders, use 8 workers, show timing breakdown
opticdd scan ./photos -e ./photos/raw,./photos/temp -w 8 --debug

When duplicates are found you will be asked whether to open a browser-based cleanup UI to review and act on the results visually.

dedupe

Scans and then acts on the results. Without --yes it launches an interactive wizard.

# Interactive wizard
opticdd dedupe ./photos ./archive

# Non-interactive: move duplicates, keep the highest-quality original
opticdd dedupe ./photos ./archive \
  --mode cross \
  --threshold 0.95 \
  --strategy highest-quality \
  --action move \
  --yes

When --action move is used without --move-dir, duplicates are moved into a folder named .hashednamed_<timestamp> inside the first scan root. The original directory structure is mirrored inside that folder so relative paths are preserved. If a filename collision occurs a numeric suffix is added automatically.

cache

opticdd cache            # show location, entry count, and size on disk
opticdd cache clear      # delete the cache

Flag Reference

All flags work on both scan and dedupe unless noted.

| Flag | Default | Description | |---|---|---| | -m, --mode <mode> | cross | cross — compare all files across all folders together. within — only compare files that share the same immediate parent directory. | | -t, --threshold <n> | 0.8 | Similarity threshold 0.01–1.0. Higher = stricter. See Threshold Guide. | | -e, --exclude <paths> | — | Comma-separated folder paths to skip entirely. | | --no-animated | — | Skip animated images (GIF). | | -v, --videos | false | Include video files. Requires FFmpeg on your PATH. | | --vo, --videos-only | false | Scan only video files, ignore all images. | | --no-cache | — | Skip cache entirely — rehash every file and write nothing back. | | --no-chunk | — | Use a single 32×32 global hash instead of the default 4×4 tiled hashing. Faster but less accurate for images that differ only in a localised region. | | --no-flip | — | Don't compute flip hashes. Horizontally mirrored duplicates won't be detected. If a full-feature cache entry already exists for a file it is reused — no rehash. | | --no-colorhash | — | Skip the 4×4 color-grid verification. Removes the false-positive filter. | | --quick | false | Faster profile: disables flip hashing, uses 3 animated samples (first/middle/last), and 5 video samples (instead of 10). | | -w, --workers <n> | CPU count | Number of worker threads for parallel hashing. | | --no-bucket-pair | false | Disable candidate bucketing before pair matching. By default bucketing is enabled for faster matching on large sets. | | -d, --debug | false | Print a timing breakdown after scanning: time in sharp manipulation vs. hash math, per-file average, bottleneck. | | -s, --strategy <s> | highest-quality | How to pick the "keep" copy within each duplicate group. Choices: oldest, newest, highest-quality, largest. See highest-quality Strategy. | | -a, --action <a> | report | dedupe only. report — print only. move — move duplicates to a folder. delete — permanently delete duplicates. | | --move-dir <dir> | auto .hashednamed_<ts> | dedupe only. Custom destination folder for moved files. | | -y, --yes | false | dedupe only. Skip all interactive prompts and use flag values directly. |

Library API

import { scan, deduplicate } from "@mineygg/optic-dedupe";   // ESM
const { scan, deduplicate } = require("@mineygg/optic-dedupe"); // CJS

Core Library Usage

Use the core library when you want to call dedupe logic directly from your app or scripts instead of using the CLI.

import { scan, deduplicate } from "@mineygg/optic-dedupe";

// 1) Analyze duplicates only (no file changes)
const scanResult = await scan({
  folders: ["./photos", "./archive"],
  mode: "cross",
  threshold: 0.8,
  includeAnimated: true,
  cache: "use",
});

// 2) Apply actions (move/delete/report) programmatically
const run = await deduplicate({
  folders: ["./photos", "./archive"],
  mode: "cross",
  threshold: 0.8,
  includeAnimated: true,
  cache: "use",
  originalStrategy: "highest-quality",
  action: "move",
  moveTargetDir: "./duplicates",
});

console.log(scanResult.groups.length, run.actions.length);

For callback-based progress reporting, see Progress Events. For full option shapes, see Types.

Core Output Shapes

scan() returns a ScanResult object:

{
  scanned: number;
  groups: DuplicateGroup[];
  errors: { path: string; error: string }[];
  durationMs: number;
  cacheStats: { hits: number; misses: number };
}

deduplicate() returns:

{
  scan: ScanResult;
  actions: {
    original: string;
    duplicates: {
      path: string;
      action: "none" | "moved" | "deleted";
      destination?: string;
      error?: string;
    }[];
  }[];
}

Example shape:

const out = await deduplicate(opts);

console.log(out.scan.scanned);
console.log(out.scan.groups.length);
console.log(out.scan.errors.length);
console.log(out.actions[0]?.original);
console.log(out.actions[0]?.duplicates[0]?.action);

scan()

Scans folders and returns duplicate groups. No files are modified.

import { scan } from "@mineygg/optic-dedupe";

const result = await scan({
  folders: ["./photos", "./archive"],
  exclude: ["./photos/raw"],
  mode: "cross",           // "cross" | "within"
  threshold: 0.95,         // 0.01 – 1.0
  includeAnimated: true,
  includeVideos: false,
  includeVideosOnly: false,
  cache: "use",            // "use" | "ignore"
  workers: 4,
  hashFeatures: {
    chunk: true,           // 4×4 tiled hashing
    flip: true,            // detect horizontally mirrored duplicates
    color: true,           // color grid false-positive filter
  },
  hashSampling: {
    animatedFrameSamples: 5, // default animated/GIF sample count
    videoFrameSamples: 10,   // default video sample count
  },
});

console.log(result.scanned);      // total files processed
console.log(result.groups);       // DuplicateGroup[]
console.log(result.errors);       // { path: string; error: string }[]
console.log(result.durationMs);   // wall-clock time in ms
console.log(result.cacheStats);   // { hits: number; misses: number }

deduplicate()

Scans and applies an action to the duplicates found.

import { deduplicate } from "@mineygg/optic-dedupe";

const { scan: scanResult, actions } = await deduplicate({
  // all ScanOptions fields above, plus:
  originalStrategy: "highest-quality",  // "oldest" | "newest" | "highest-quality" | "largest"
  action: "move",                        // "report" | "move" | "delete"
  moveTargetDir: "./duplicates",         // optional, auto-named if omitted
});

for (const group of actions) {
  console.log("original:", group.original);
  for (const dup of group.duplicates) {
    console.log(dup.action, dup.path, dup.destination ?? "");
  }
}

Progress Events

Both functions accept an optional onProgress callback.

import { scan, type ProgressEvent } from "@mineygg/optic-dedupe";

await scan(opts, (evt: ProgressEvent) => {
  switch (evt.type) {
    case "scan-start":
      // evt.total: number of files found
      // evt.workers: worker thread count
      break;
    case "scan-progress":
      // evt.done, evt.total, evt.path, evt.cached (boolean)
      break;
    case "scan-error":
      // evt.path, evt.error
      break;
    case "group-start":
      // evt.imageCount: images being compared
      break;
    case "group-progress":
      // evt.done, evt.total (pairs compared)
      break;
    case "group-done":
      // evt.groups: number of duplicate groups found
      break;
    case "hash-debug":
      // evt.filesHashed, evt.manipulationMs, evt.hashingMs,
      // evt.totalMs, evt.manipulationPerFileMs, evt.hashingPerFileMs, evt.totalPerFileMs
      break;
    case "action-start":
      // evt.groups: number of groups being acted on
      break;
    case "action-done":
      // evt.results: ActionResult[]
      break;
  }
});

Types

interface ScanOptions {
  folders: string[];
  exclude: string[];
  mode: "cross" | "within";
  threshold: number;
  includeAnimated: boolean;
  includeVideos?: boolean;
  includeVideosOnly?: boolean;
  cache?: "use" | "ignore";
  workers?: number;
  hashFeatures?: Partial<HashFeatures>;
  hashSampling?: Partial<HashSampling>;
  bucketPair?: boolean;
}

interface HashFeatures {
  chunk: boolean;
  flip: boolean;
  color: boolean;
}

interface HashSampling {
  animatedFrameSamples: number;
  videoFrameSamples: number;
}

interface ScanResult {
  scanned: number;
  groups: DuplicateGroup[];
  errors: { path: string; error: string }[];
  durationMs: number;
  cacheStats: { hits: number; misses: number };
}

interface DuplicateGroup {
  original: ImageInfo;      // the chosen "keep" copy
  duplicates: ImageInfo[];  // all other members of the group
  similarity: number;       // average pairwise similarity, 0–1
}

interface ImageInfo {
  path: string;
  hash: string;             // hex XOR of all tile pHashes
  pHash: bigint[];          // per-tile perceptual hashes (16 values when chunk=true, 1 when false)
  pHashFlipped?: bigint[];  // same, computed on the horizontally mirrored image
  colors: number[];         // 4×4 RGB color grid — 48 values (16 tiles × R,G,B)
  width: number;
  height: number;
  size: number;             // file size in bytes
  format: string;
  mtime: number;            // modification time, ms since epoch
  frames: number;           // >1 means animated
}

// Returned by deduplicate(), one entry per duplicate group
interface ActionResult {
  original: string;
  duplicates: Array<{
    path: string;
    action: "reported" | "moved" | "deleted";
    destination?: string;  // set when action is "moved"
    error?: string;
  }>;
  groupSimilarity: number;
}

How It Works

1. Scanning

Each provided folder is walked recursively. In within mode, files are grouped by their immediate parent directory and only files sharing the same folder are compared against each other.

2. Hashing (parallel worker threads)

Each image goes through this pipeline in a worker thread:

Decoded and resized via sharp.
When chunk=true (default): resized to 128×128 and split into a 4×4 grid of 32×32 tiles. When chunk=false: resized to 32×32 as a single tile.
For each tile, a 2D DCT is computed. The top-left 8×8 DCT coefficients (excluding DC) are compared against their median to produce a 64-bit perceptual hash.
When flip=true (default): the same DCT process runs on the horizontally mirrored pixel data for each tile, producing a second set of hashes used for mirror-duplicate detection.
When color=true (default): the image is downsampled to a 4×4 RGB grid (48 values) stored alongside the hashes and used as a false-positive filter at comparison time.
For animated images (GIF): 5 frames are sampled at 0%, 25%, 50%, 75%, and 100% of the animation. Per-tile hashes across frames are combined via bitwise majority vote — each output bit is 1 if more than half the frames had that bit set. This avoids XOR cancellation where an even number of identical frames would zero out the hash.
For video files: same majority-vote approach with 10 frames sampled from 0% to 95% of duration (capped at 95% to avoid empty frames near the end).
With --quick: animated sampling is reduced to 3 frames (0%, 50%, 100%) and video sampling is reduced to 5 frames.

Cache hits are resolved on the main thread and never sent to workers. Workers are spawned once at pool creation and kept alive until all files are processed, with a task queue to keep all threads saturated.

When the native Rust binary is present, tile extraction + DCT + hash math runs in Rust rather than JavaScript.

3. Grouping

All hashed images are compared pairwise using Hamming distance on their tile hash arrays. For each pair, the minimum distance across normal and mirrored orientations is computed. When color=true, any pair whose 4×4 color grids have an average channel difference > 30 or a max single-tile difference > 45 is rejected regardless of pHash distance — this prevents false positives between structurally similar images (e.g. line art) with completely different color palettes.

Pairs within the threshold are merged using Union-Find for transitive closure: if A matches B and B matches C, all three end up in one group.

When the native Rust binary is available the entire O(n²) comparison loop runs in Rust.

When --no-bucket-pair is disabled, a conservative candidate-bucketing pass is used before exact pair checks. This improves speed on larger datasets at the cost of possible recall loss on borderline matches.

4. Original selection

Within each group one image is kept and the rest are duplicates. See highest-quality Strategy for how that strategy works. oldest, newest, and largest sort purely by mtime or file size.

5. Action

Duplicates are reported, moved, or deleted depending on your configuration. Moved files mirror their original directory structure inside the target folder.

Threshold Guide

| Threshold | What it catches | |---|---| | 1.0 | Perceptually identical only | | 0.95 | Same image, different compression or very minor quality loss | | 0.90 | Resized, lightly edited, slight crops | | 0.80 | Moderate edits, heavy compression, more aggressive crops (default) | | 0.70 | Balanced — good starting point for mixed photo sets | | 0.50 | Very loose — different subjects can match |

Start at 0.8 and tighten toward 0.95 if you are getting false positives, or loosen toward 0.6 if obvious duplicates are being missed.

Cache

After the first scan, hashes are written to a cache file on disk. On subsequent scans any file whose path + size + mtime are unchanged is read from cache instead of being rehashed.

Smart feature reuse: a cache entry whose features are a superset of what is requested is reused — the extra data is just ignored. Concretely:

Cached with flip=true, run with --no-flip → cache hit, flip hashes ignored, no rehash.
Cached with flip=false, run with flip=true → cache miss, rehash only those files.
Same logic applies to chunk and color.

Automatic stale-entry eviction: when a folder is scanned, any cache entries for files under that folder that no longer exist on disk are removed and the cache file is rewritten. The cache does not grow unbounded as files are deleted.

Both the superset lookup and eviction run through the native Rust cache layer when available.

opticdd cache            # show location, entry count, size on disk
opticdd cache clear      # wipe the cache
opticdd scan ./photos --no-cache  # skip cache for this run only

Hash Features

All three flags are enabled by default. The real bottleneck is always sharp decode + resize, not the math — disabling flip or color saves very little wall-clock time. Disabling chunk saves more because the resize target drops from 128×128 to 32×32.

| Flag | What it does | Performance impact | |---|---|---| | chunk | Hashes each of 16 tiles (4×4 grid) independently. Catches images that differ only in a localised region (e.g. different text on a shared background). Off = single 32×32 global hash. | Moderate — smaller resize target when off | | flip | Computes a second hash set for the horizontally mirrored image. Catches mirror duplicates. | Small — image is already decoded, this is another DCT pass on the same data | | color | Stores a 4×4 RGB color grid per image. Used after a pHash match to reject false positives. | Negligible |

highest-quality Strategy

highest-quality is not a simple pixel-count sort. It applies a tiered comparison:

Resolution dominance — if one image has more than 1.5× the pixel count of the other, it wins outright.
Format tier — lossless/pristine formats (PNG, TIFF, BMP, RAW) beat lossy ones (JPEG, WebP, AVIF, etc.) regardless of size.
File size dominance — within the same format tier, if one file is more than 1.25× larger it wins (indicating higher bitrate / less compression).
Resolution — higher pixel count wins.
File size — larger file wins on identical resolution.
Age — older file wins on exact ties (likely the original source).

Native Rust Acceleration

Prebuilt binaries are included for:

| Platform | Architectures | |---|---| | Windows | x64, arm64, ia32 | | Linux (glibc) | x64, arm64, armv7 | | Linux (musl) | x64, arm64 | | macOS | x64, arm64 |

The native module handles three things:

Hash math — tile extraction, DCT, and pHash generation in Rust.
Pairwise comparison — the O(n²) Hamming distance loop in compiled Rust with no GC pressure.
Cache layer — the NDJSON cache backed by a Rust HashMap, including superset-compatible lookup and stale-entry eviction.

If no matching binary is found for your platform the library falls back silently to the pure TypeScript implementation.

Supported Formats

Static images: JPEG, PNG, WebP, AVIF, TIFF, BMP

Animated images: GIF

Video (requires FFmpeg on PATH): MP4, MKV, AVI, MOV, WEBM, FLV, M4V