idedupebox

v1.0.0

Published

3 months ago

image deduper cli tool: use perceptual hashing (phash) to deduplicate a directory of images

0High
0Medium
0Low

sbrl

image dedupe deduplication hashing perceptual hashing deduping cli cli tool tool

idedupebox

Image deduplication cli tool

idedupebox is a simple tool that uses perceptual hashing (phash) for deduplicating images recursively in a directory, using the library sharp-phash.

Full parallelisation is on the todo list, but the tool is reasonably efficient as it is (tested on ~166K images, which were handled in ~half an hour or so on a Ryzen 7900). If this is a deal breaker for you, please get in touch and I'll add better parallelisation (or even better, PR?).

System requirements

Node.js + npm (bundled by default)
Some command-line knowledge
Some images in a directory to deduplicates

Installation

Install this package via npm:

npm install -g idedupebox

...you may need run the above command with sudo.

This will expose a command idedupebox.

From source

If you'd rather install from source, do so by cloning this directory:

git clone https://codeberg.org/sbrl/idedupebox.git;
cd idedupebox;

Then, install dependencies:

npm install

Now follow the getting started instructions below, replacing idedupebox with src/index.mjs - don't forget to be cded into the repository's directory.

Getting started

This tool has 3 subcommands:

dedupe: Walks a directory recursively, hashing all images as it goes. Spits out a list of deduplicated clusters in a .jsonl file.
visualise: Uses the .jsonl file from dedupe to create a subdirectory that hard-links all the clusters together into 1 folder for manual review
delete: Deletes duplicates, leaving 1 image left per cluster (careful to have a backup!)

These subcommands should be used in this order.

To get detailed help, run this command:

idedupebox --help

Examples

Some example command invocations are shown below.

Generate a duplicates file for a directory:

idedupebox dedupe --dirpath path/to/dir --output /tmp/x/20301120-duplicates.jsonl

Visualise an existing duplicates file:

idedupebox visualise --verbose --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl

Backup a directory:

tar -caf 20301120-backup.tar.gz path/to/same_dir_as_above

Dry-run a deletion of duplicates:

idedupebox delete --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl

(note: add --force to actually delete the duplicates)

[!NOTE] A built-in check ensures that the last file existing on disk in each cluster is never deleted. In case a deletion is required, which file in a given duplicates cluster that is deleted is undefined. The candidates are shuffled with fisher-yates¹ ² algorithm

Output format

Aside from ascii, there are a number of possible output formats. Their names (section headings) and example output structures are given below.

JSONL (default)

{ "id": number, "filepaths": string[] }
...

TSV

filepath	cluster	phash
path/to/cat.jpg	0	base64_here
...

Contributing

Contributions are very welcome - both issues and pull requests! Please mention in your pull request that you release your work under the AGPL-3.0 (see below).

Licence

idedupebox is released under the GNU Affero General Public License 3.0. The full license text is included in the LICENSE file in this repository. Tldr legal have a great summary of the license if you're interested.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

idedupebox

System requirements

Installation

From source

Getting started

Examples

Output format

JSONL (default)

TSV

Contributing

Licence