npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

embedding-search-node

v0.0.6

Published

A vector search utility written purely in javascript with no dependencies. Allows for the searching/ranking of a list of javascript objects with 'embedding' field defined.

Downloads

16

Readme

About

This repository contains an implementation of a JavaScript vector search algorithm, along with an analysis of its running times for different search space sizes. Allows for the searching/ranking of a list of javascript objects with "embedding" fields.

| Notes | Changelog |Test | Data |-----------------|-----------------------|---------------------|--------------|

Usage

Either clone this repository to your project or use npm i embedding-search-node

In a new file...


    //bring in functionalities you need
    const { 
        GetRankedEmbeddingSearch,
        smartVector, 
        EmbeddingInterface 
    } = require("embedding-search-node");

    //initialize the embedding interface
    const embeddingsInterface = new EmbeddingInterface(
        "<Your Openai Api Key>", //required
        "<Your openai org> ", //optional
    );


    //predict the cost of embedding some text
    const exampleText = "Some random text to be embedded by the interface "
    const { cost , tokens } = embeddingsInterface.predictCost(exampleText)
    console.log(`Embedding Prediction:\n# Tokens: ${tokens}\nCost: $${cost.toFixed(4)}`)


    //get the embedding of the text
    const {embedding,error} = await embeddingsInterface.getEmbedding(exampleText);
    
    //check for errors
    if(error) return console.error(error)
    
    //define some 'embeddedItems' or fetch from some database
    const items = [...embeddedItems]

    //optional some parameters to fine tune the search
    let n = 20 //default result cap is 10 here we are upping that to 20 
    let threshold = 0.75 //only vectors with similarity to vector within threshold will be considered in the ranking

    //use the example text as a query on the embeddedItems
    const result = await GetRankedEmbeddingSearch(
        items,
        embedding,
        threshold,
        n
    );
    console.log(result)


Most if not all of the functions are documented with JSDoc as well!


Updates

v0.0.6 - 😬 Fixed the creation of 'smartVectors' which were ultimately a poor understanding of embeddings. They are already normalized so just compute the dot product instead dot products / |a|*|b| nonsense.

v0.0.5 - 🧬 Created a wrapper for the openai embedding api. Generates embeddings using the 'text-embedding-ada-002' model.

v0.0.5 - 💰 Implemented a function to estimate price & token count. Based on OpenAI's Pricing

v0.0.4 - ❌ Removed the "anomaly" feature on vector search as it seemed out of scope for the ranking function.

v0.0.4 - ✨ Refactored the return value of the search results array, to include context from the original array instead of the old [score,index] format

v0.0.3 - 💪 Max size of vector space increased to 160,300 vectors of length 1536. (Roughly 7% increase)

v0.0.3 - 👨🏻‍💻 Changed implementation to use a maxheap for search spaces of sizes > 20,000 , opting for simple array and builtin sort for smaller spaces.

v0.0.3 - 🚀 Implemented a "smartVector" data structure that stores a vector along with its magnitude. This dramatically improved the performance of the cosineSimilarity function by preventing recalculation of magnitude for the query vector. The trade off is a slightly longer time to create a new vector, but seems worthwhile for the search improvements

v0.0.2 - 🚮 Switched to using a maxheap for larger inputs. Introduced a threshold parameter to avoid sorting clearly irrelevant embeddings. Lots of time spent in computation of magnitude.

v0.0.1 - ⚖️ A max capacity of 150,000 vectors of length 1536 established.

v0.0.1 - 💩 A basic implementation of calculating cosine similarity to search a vector space. Utilized mergesort.


Notes

Avoiding senseless sorting

Previously, every comparison result was stored to be sorted, however by incorporating a "threshold" parameter, we can omit clearly irrelevant results from being included in the list to be sorted. For n < 20,000 items, the builtin javascript sort function proved performant enough.

Maxheap for large n

for search spaces with more than 20,000 items, a maxheap is utilized. Introducing parameter "m" as the max number of results, instead of sorting through all "n" items, we simply perform "m" heap extractions.

Performance gains with "smart vectors"

In the prior implementation, magnitudes of vectors were recomputed each time a vector comparison was performed. In order to optimize this, I implemented a "smartVector" function that takes a regular number[] and converts it into an object with the format

{
    v: number[] //the original vector array
    m: number // the magnitude of the vector 
}

By computing the magnitude of the vector at creation, we save the comparison function from 2n calls to magnitude (previously each comparison performed 2 calls to magnitude for each of the n vectors).

Experiment

This experiment aimed to assess the performance of a vector comparison and ranking function that utilizes cosine similarity for determining the similarity between embedding vectors. The function was evaluated on various search space sizes, ranging from 10 to 163,000 items. Average execution times were measured for each search space size, and the distribution of computation time between vector comparisons and the merge sort step was analyzed. Currently the maximum searchable vectors in one call of the function is 163000 vectors of length 1536 ( without running out of memory and the process crashing ).

The experiement was run on my 2020 M1 Macbook Air, using Node v19.8.1

Data

The provided data includes average running times (in milliseconds) for the algorithm's execution on various search space sizes. The collected data points are as follows:

| Search Space | Running Time (ms) | |--------------|------------------| | 10 | 0.026 | | 100 | 0.185 | | 1000 | 2.610 | | 2500 | 6.108 | | 5000 | 12.216 | | 10000 | 23.942 | | 25000 | 52.884 | | 30000 | 67.020 | | 40000 | 192.223 | | 50000 | 126.527 | | 80000 | 160.684 | | 130000 | 284.091 | | 163000 | 344.077 |

Running Time v0.0.3 - v0.0.5

Version 0.0.3 Running Time(ms) over search space size (n). Improvements include implementation of "smartVector" format to reduce calls to expensive "magnitude" function. For small n (n ≤ 20000), use builtin sort to prevent heap overhead.

  • v0.0.1 and v0.0.2 took almost 800ms compared to the new version (v0.0.3+) taking an average of around 350ms for the same (n=150,000) test size.
  • Further optimizations may result in support for even larger search space sizes.

Insights

  • Variability: Some data points deviate from the overall increasing trend, indicating that factors other than search space alone can influence the algorithm's performance. This can likely be attributed to the fact that the sorting step is determined by the number of "over threshold" similarity elements included in the search space.

This library is a rough work in progress. Mostly just experimental but you are welcome to use it in any of your software, please be sure to leave feedback on what features you would find useful to add, or any general suggestions for improvements.