npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

idedupebox

v1.0.0

Published

image deduper cli tool: use perceptual hashing (phash) to deduplicate a directory of images

Readme

idedupebox

Image deduplication cli tool

idedupebox is a simple tool that uses perceptual hashing (phash) for deduplicating images recursively in a directory, using the library sharp-phash.

Full parallelisation is on the todo list, but the tool is reasonably efficient as it is (tested on ~166K images, which were handled in ~half an hour or so on a Ryzen 7900). If this is a deal breaker for you, please get in touch and I'll add better parallelisation (or even better, PR?).

System requirements

  • Node.js + npm (bundled by default)
  • Some command-line knowledge
  • Some images in a directory to deduplicates

Installation

Install this package via npm:

npm install -g idedupebox

...you may need run the above command with sudo.

This will expose a command idedupebox.

From source

If you'd rather install from source, do so by cloning this directory:

git clone https://codeberg.org/sbrl/idedupebox.git;
cd idedupebox;

Then, install dependencies:

npm install

Now follow the getting started instructions below, replacing idedupebox with src/index.mjs - don't forget to be cded into the repository's directory.

Getting started

This tool has 3 subcommands:

  1. dedupe: Walks a directory recursively, hashing all images as it goes. Spits out a list of deduplicated clusters in a .jsonl file.
  2. visualise: Uses the .jsonl file from dedupe to create a subdirectory that hard-links all the clusters together into 1 folder for manual review
  3. delete: Deletes duplicates, leaving 1 image left per cluster (careful to have a backup!)

These subcommands should be used in this order.

To get detailed help, run this command:

idedupebox --help

Examples

Some example command invocations are shown below.

Generate a duplicates file for a directory:

idedupebox dedupe --dirpath path/to/dir --output /tmp/x/20301120-duplicates.jsonl

Visualise an existing duplicates file:

idedupebox visualise --verbose --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl

Backup a directory:

tar -caf 20301120-backup.tar.gz path/to/same_dir_as_above

Dry-run a deletion of duplicates:

idedupebox delete --dirpath path/to/same_dir_as_above --filepath /tmp/x/20301120-duplicates.jsonl

(note: add --force to actually delete the duplicates)

[!NOTE] A built-in check ensures that the last file existing on disk in each cluster is never deleted. In case a deletion is required, which file in a given duplicates cluster that is deleted is undefined. The candidates are shuffled with fisher-yates¹² algorithm

Output format

Aside from ascii, there are a number of possible output formats. Their names (section headings) and example output structures are given below.

JSONL (default)

{ "id": number, "filepaths": string[] }
...

TSV

filepath	cluster	phash
path/to/cat.jpg	0	base64_here
...

Contributing

Contributions are very welcome - both issues and pull requests! Please mention in your pull request that you release your work under the AGPL-3.0 (see below).

Licence

idedupebox is released under the GNU Affero General Public License 3.0. The full license text is included in the LICENSE file in this repository. Tldr legal have a great summary of the license if you're interested.