idedupebox
v1.0.0
Published
image deduper cli tool: use perceptual hashing (phash) to deduplicate a directory of images
Maintainers
Readme
idedupebox
Image deduplication cli tool
idedupebox is a simple tool that uses perceptual hashing (phash) for deduplicating images recursively in a directory, using the library sharp-phash.
Full parallelisation is on the todo list, but the tool is reasonably efficient as it is (tested on ~166K images, which were handled in ~half an hour or so on a Ryzen 7900). If this is a deal breaker for you, please get in touch and I'll add better parallelisation (or even better, PR?).
System requirements
- Node.js +
npm(bundled by default) - Some command-line knowledge
- Some images in a directory to deduplicates
Installation
Install this package via npm:
npm install -g idedupebox...you may need run the above command with sudo.
This will expose a command idedupebox.
From source
If you'd rather install from source, do so by cloning this directory:
git clone https://codeberg.org/sbrl/idedupebox.git;
cd idedupebox;Then, install dependencies:
npm installNow follow the getting started instructions below, replacing idedupebox with src/index.mjs - don't forget to be cded into the repository's directory.
Getting started
This tool has 3 subcommands:
dedupe: Walks a directory recursively, hashing all images as it goes. Spits out a list of deduplicated clusters in a .jsonl file.visualise: Uses the.jsonlfile fromdedupeto create a subdirectory that hard-links all the clusters together into 1 folder for manual reviewdelete: Deletes duplicates, leaving 1 image left per cluster (careful to have a backup!)
These subcommands should be used in this order.
To get detailed help, run this command:
idedupebox --helpExamples
Some example command invocations are shown below.
Generate a duplicates file for a directory:
idedupebox dedupe --dirpath path/to/dir --output /tmp/x/20301120-duplicates.jsonlVisualise an existing duplicates file:
idedupebox visualise --verbose --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonlBackup a directory:
tar -caf 20301120-backup.tar.gz path/to/same_dir_as_aboveDry-run a deletion of duplicates:
idedupebox delete --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl(note: add --force to actually delete the duplicates)
[!NOTE] A built-in check ensures that the last file existing on disk in each cluster is never deleted. In case a deletion is required, which file in a given duplicates cluster that is deleted is undefined. The candidates are shuffled with fisher-yates¹² algorithm
Output format
Aside from ascii, there are a number of possible output formats. Their names (section headings) and example output structures are given below.
JSONL (default)
{ "id": number, "filepaths": string[] }
...TSV
filepath cluster phash
path/to/cat.jpg 0 base64_here
...Contributing
Contributions are very welcome - both issues and pull requests! Please mention in your pull request that you release your work under the AGPL-3.0 (see below).
Licence
idedupebox is released under the GNU Affero General Public License 3.0. The full license text is included in the LICENSE file in this repository. Tldr legal have a great summary of the license if you're interested.
