npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

json-hashify

v1.0.0

Published

JSON-Hashify is a library for hashing JSON objects and arrays into compact signatures (sketches) that can be used to compare the similarity of JSON objects.

Downloads

11

Readme

JSON-Hashify

npm package

JSON Structural Hashing.

Everyone has JSON! Do you need to know if your JSON is structurally and faintly semantically similar to other JSON? Not just === identical, but close in shape and content? We got you!

This utility takes any JSON object/array, analyzes its structure and content (paths, values, subtrees), generates k-shingles from these features, and then applies Grouped One Permutation Hashing (Grouped-OPH) to produce a compact signature ("sketch").

Compare sketches to estimate Jaccard similarity. Fast and effective for detecting structural likeness, perfect for use in an Approximate Nearest Neighbor graph for ASTs, Code Similarity, or More.

Usage

Simple:

import { JSONHashify, generateJSONHashifySketch, compareJSONHashifySketches, estimateJaccardSimilarity } from 'json-hashify'; 

// Your JSONs
const json1 = { a: 1, b: { c: 2, d: [3, 4] }, e: "hello" };
const json2 = { a: 1, b: { c: 99, d: [3, 4] }, e: "world" }; // similar structure, different values
const json3 = { x: true, y: false, z: null }; // totally different

// Make a hasher instance (or don't, use the utility fns)
const hasher = new JSONHashify({
  shingleSize: 5,         // Default: 5. Size of k-shingles for path:value strings.
  subtreeDepth: 2,        // Default: 2. How deep to look into subtrees.
  frequencyThreshold: 1,  // Default: 1. Min times a shingle must appear.
  numHashFunctions: 128,  // Default: 128. Total hashes in the sketch.
  numGroups: 4,           // Default: 4. Groups for GOPH. numHashFunctions must be divisible by this.
  preserveArrayOrder: true, // Default: true. `arr[0]` vs `arr[1]`. If false, array elements are like a bag.
  ignoreKeys: ['position'], // Default: []. Keys to completely ignore.
  enableNodeStringCache: true, // Default: false. Cache shingle sets for node strings? Speeds up repeats.
  nodeStringCacheSize: 5000  // Default: 1000. Max items in node string cache if enabled.
});

const sketch1 = hasher.generateSketch(json1);
const sketch2 = hasher.generateSketch(json2);
const sketch3 = hasher.generateSketch(json3);

// Or use the quick util fn
const sketch1_alt = generateJSONHashifySketch(json1, { numHashFunctions: 128 });


console.log('Sketch 1:', sketch1);

// How similar are they? (0.0 to 1.0)
const similarity12 = hasher.compareSketches(sketch1, sketch2);
console.log('Similarity json1 vs json2:', similarity12); // Should be kinda high

const similarity13 = compareJSONHashifySketches(sketch1, sketch3); // Util fn for comparison too
console.log('Similarity json1 vs json3:', similarity13); // Should be pretty low

// You can also get the raw shingle set before GOPH if you're curious
const shingleSet1 = hasher.generateShingleSet(json1);
// console.log('Shingles for json1:', shingleSet1);

// If you're using the cache and processing lots of similar stuff, clear it sometimes:
hasher.clearNodeStringCache();

// The estimateJaccardSimilarity is also exported if you have sketches from elsewhere
// and know they were made with compatible GOPH params.
// const directSim = estimateJaccardSimilarity(sketch1, sketch2);

API

new JSONHashify(options?)

Creates a new JSONHashify instance.

  • options (Object, optional):
    • shingleSize (Number, default: 5): Size of k-shingles.
    • subtreeDepth (Number, default: 2): Depth for subtree extraction.
    • frequencyThreshold (Number, default: 1): Minimum shingle frequency.
    • numHashFunctions (Number, default: 128): Total hashes in the sketch (must be divisible by numGroups).
    • numGroups (Number, default: 4): Number of groups for GOPH.
    • preserveArrayOrder (Boolean, default: true): Distinguish array elements by index.
    • ignoreKeys (Array, default: []): Keys to ignore.
    • enableNodeStringCache (Boolean, default: false): Enable an LRU cache for node string shingle sets. Useful if processing many identical sub-structures or the same JSON repeatedly.
    • nodeStringCacheSize (Number, default: 1000): Max size of the node string cache if enabled.

hasher.generateSketch(json)

Generates a GOPH sketch (an array of numbers) for the input json.

hasher.generateShingleSet(json)

Generates the set of unique shingle hashes (integers) for the input json after frequency thresholding but before GOPH.

hasher.compareSketches(sketch1, sketch2, estimationOptions?)

Estimates Jaccard similarity (0 to 1) between two sketches.

  • sketch1 (Array): First MinHash sketch.
  • sketch2 (Array): Second MinHash sketch.
  • estimationOptions (Object, optional): Options for Jaccard similarity estimation, passed to the underlying grouped-oph library.
    • similarityThreshold (number): The Jaccard similarity threshold (0 to 1) for early termination. If the algorithm can confidently determine that the true similarity is above or below this threshold with an error probability less than errorTolerance, it may return an approximate result early (typically 0.0 or 1.0).
    • errorTolerance (number): The acceptable probability (0 to 1, e.g., 0.01 for 1%) of making an incorrect early termination decision when similarityThreshold is used.
    • Note: numGroups (from the hasher instance) is automatically provided to the estimation function when these options are used.

hasher.clearNodeStringCache()

Clears the internal node string shingle cache if it was enabled.

generateJSONHashifySketch(json, options?)

Utility function. Creates a temporary JSONHashify instance with options and returns hasher.generateSketch(json).

compareJSONHashifySketches(sketch1, sketch2, constructorOptions?, estimationOptions?)

Utility function. Creates a temporary JSONHashify instance with constructorOptions and returns hasher.compareSketches(sketch1, sketch2, estimationOptions).

estimateJaccardSimilarity(sketch1, sketch2, options?)

Directly estimates Jaccard similarity from two sketches. Assumes sketches are compatible. This is re-exported from grouped-oph. See grouped-oph documentation for details on its options for approximation.

Performance

Benchmarks are run with node bench/random-json.js.

Sketch Generation Performance

"Stateful" uses enableNodeStringCache: true and it will memoize recurring subtrees to speed up your hashing. "Stateless" creates a new hasher or uses one with the cache disabled/cleared for each operation on different random JSONs.

| Benchmark Configuration | Mode | HPS (Higher is Better) | Per Call Duration | |---------------------------------|-----------|------------------------|-------------------| | JSON (Depth 2, Max Children 3) | Stateless | 30790.41 | ~32.5 μs | | JSON (Depth 2, Max Children 3) | Stateful | 35432.14 | ~28.2 μs | | JSON (Depth 3, Max Children 5) | Stateless | 4862.18 | ~206 μs | | JSON (Depth 3, Max Children 5) | Stateful | 2895.05 | ~345 μs | | JSON (Depth 4, Max Children 5) | Stateless | 1579.90 | ~633 μs | | JSON (Depth 4, Max Children 5) | Stateful | 1480.12 | ~676 μs | | JSON (Depth 3, Max Children 8) | Stateless | 1647.91 | ~607 μs | | JSON (Depth 3, Max Children 8) | Stateful | 1055.50 | ~947 μs | | JSON (Depth 5, Max Children 3) | Stateless | 3107.03 | ~322 μs | | JSON (Depth 5, Max Children 3) | Stateful | 3353.01 | ~298 μs |

Note on Sketch Generation Cache: The enableNodeStringCache option is beneficial when processing the exact same JSON multiple times or when JSON objects share many identical sub-structures (leading to identical path:value strings for nodes). For highly diverse JSON inputs without repeated sub-structures, the overhead of cache management might slightly reduce performance compared to stateless generation.

Underlying Improvements

JSONHashify relies on the grouped-oph library for its core signature generation (Grouped One Permutation Hashing) and Jaccard similarity estimation. This provides a robust and mathematically sound basis for the sketches.

  • Efficient Hashing: grouped-oph employs efficient hashing mechanisms (like MurmurHash3) for processing shingle data.
  • Optimized Sketching: The GOPH technique itself is designed to produce compact and effective sketches for Jaccard similarity estimation.

Why?

There's a lot of cases where you want a vector to roughly compare two objects. For instance, in deduplication, or in the clustering of structural features. If you wanted to find code duplication, then you could calculate the AST of the codebase, then recursively JSONHashify the resulting AST and quickly find duplication much faster than any deterministic approach. Similarly, if you were to encode the neighborhood tree of a node in a graph, you could find similar structures much more rapidly than if you used any graph analysis algorithms. Due to the nature of the shingling this is content sensitive as well. Structures with keys in common will cluster closer than identical structures without keys in common. This makes it ideal for a lot of common "Similarity" use cases.

Install

npm install json-hashify

License

MIT. 2023