npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

inverted-index

v1.7.0

Published

inverted-index for level with pagination, sift3/cosine distance and tf-idf ranking

Downloads

23

Readme

inverted-index

NPM version Build Status Dependency Status Coverage Status

install

npm install [--save/--save-dev] inverted-index

api

var inverted = require('inverted-index')

inverted(db[, options[, getter]])

var level = require('level')('/path/to/my/db')
var sublevel = require('sublevel')

var index = inverted(sublevel(db, 'index'), {
  idf: true,
  stem: true,
  rank: true,
  rank_algorithm: 'cosine',
  facets: true
}, function(id, options, fn){
  level.get(id, options, fn)
})

db

Any level API-compatible instance is accepted.

options

The exemplified options is the default configuration.

idf

When idf is flagged as true, for each token indexed an idf (term frequency–inverse document frequency) is calculated. When querying the index, the terms with lowest idf are fetched first. Example:

"Julie loves me more than Linda loves me"
[
  {
    "word": "julie",
    "idf": 1.791759469228055
  },
  {
    "word": "linda",
    "idf": 1.791759469228055
  },
  {
    "word": "loves",
    "idf": 1.0986122886681098
  }
]

Notice that "me", "more" and "than" are not indexed, because those are considered stopwords.

stem

Whether the text should be stemmed or not. When true, the text is stemmed with the Porter stemming algorithm using NaturalNode/natural. Example:

"Fishing is a way of catching cats, he argued in his arguments"

is tokenized into:

["fishing", "is", "a", "way", "of", "catching", "cats", "he", "argued", "in", "his", "arguments"]

and stemmed into:

["fish", "is", "a", "wai", "of", "catch", "cat", "he", "argued", "in", "his", "argum"]
rank

With ranking enabled, when querying it ranks the results based on a defined algorithm. The rank is done AFTER the fetch, so it only ranks using the result set (that can be parcial depending on the size of matching results) comparing the query with the original indexed text, to the tokens.

So, idf is used to fetch tokens ordered by idf and then ranking is done with the original text of each token's correspondent document comparing with the query text. The "problem" with ranking is that if you have 100000 tokens that match the query tokens, only 100 (can be set on the query options) are fetched for each page and THEN the rank is done. Example:

{
  "1": "Fishing is a way of catching cats, Julie argued in her arguments",
  "2": "Julie loves me more than Linda loves me"
}

querying Julie loves would fetch:

[
  {
    "word": "loves",
    "idf": 1.0986122886681098,
    "id": "2"
  },
  {
    "word": "julie",
    "idf": 1.791759469228055,
    "id": "2"
  },
  {
    "word": "julie",
    "idf": 2.4849066497880004,
    "id": "1"
  }
]

and then rank them:

["2", "1"]
rank_algorithm

Only takes effect when rank is set to true. Valid options are cosine or sift3 using ramitos/cosine and ramitos/sift3.

Haven't made any benchmarks on that, but sift3 should be faster. Will get data on that soon.

facets

Enabling facets is useful to query based on types of models. Example:

{
  "1": {
    "text": "Hank Green",
    "facets": ["user"]
  },
  "2": {
    "text": "John Green",
    "facets": ["user"]
  },
  "3": {
    "text": "Johnnie Walker",
    "facets": ["user"]
  },
  "b": {
    "text": "Johnnie Walker",
    "facets": ["brand"]
  }
}

You can then query "Johnnie" with facets ["brand"] and only get:

["b"]

Notice how the result don't include the user 3 because it doesn't have the brand facet.

You can also combine facets with id's to provide property based queries:

{
  "3": {
    "text": "Johnnie Walker [email protected]",
    "facets": ["user"]
  },
  "3-name": {
    "text": "Johnnie Walker",
    "facets": ["user-name"]
  },
  "3-email": {
    "text": "[email protected]",
    "facets": ["user-email"]
  }
}

And then query the facets ["user-name"] with the text "johnnie" and get:

["3-name"]

And with that you can just split the results to get the id's.

getter

For ranking results, we need to store the original text. When indexing large amounts of data this can have an impact on disk usage. To prevent that, a function can be passed that receives id, options, and callback as the arguments to fetch the original indexed text for that id.

index(text, id[, facets], callback)

put(text, id[, facets], callback)

link(text, id[, facets], callback)

index.index('john green', 1, ['user'], function(err){
  assert(!err)
})
index.put('Fishing is a way of catching cats, he argued in his arguments', 'b', function(err){
  assert(!err)
})
index.link('Julie loves me more than Linda loves me', '1436ebc684b-c1039c76bdb2b054670f3a1256c98650', ['message'], function(err){
  assert(!err)
})

remove(id, callback)

del(id, callback)

unlink(id, callback)

index.remove(1, function(err){
  assert(!err)
})
index.del('b', function(err){
  assert(!err)
})
index.unlink('1436ebc684b-c1039c76bdb2b054670f3a1256c98650', function(err){
  assert(!err)
})

index.search(query[, facets[, options]], callback)

index.query(query[, facets[, options]], callback)

index.search('Fishing', function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})
index.query('Green', ['user'], function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})
index.search('Green', 'user', function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})
index.search('Green', {
  limit: 100, 
  ttl: 1000 * 60 * 60
}, function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})
index.search({
  last: '1436ec2e069-bf55e1ed64540b925e13d6bfd21a543c'
}, function(err, result){
  assert(!err)
  assert(result.last)
  assert(result.results)
})

pagination

Every query returns a last parameter. That can be passed to the query/search function to get the next results. When you pass last, you don't need to pass the search query again, because it is saved in the db.

Note that pagination expires in 1h, so if you do a query now, and 2 hours later you want to retrieve the next page, you'll get an error.

The ttl can, however, be tuned in the query options.

license

MIT