@picosearch/picosearch

v3.0.0-rc.8

Published

a day ago

Minimalistic full-text search, zero dependencies, local-first, browser-compatible.

Downloads

233

0High
0Medium
0Low

olastor

picosearch

Minimalistic full-text search implemented in Typescript.

🔎 Full text search using the BM25F algorithm for multi-field matching 🈯 Fully typed with TypeScript 🧐 Benchmark tests in CI/CD ♻️ JSON-serializable indexes

Installation

yarn add @picosearch/picosearch

Quick Start

import { Picosearch } from '@picosearch/picosearch';

type MyDoc = {
  id: string;
  text: string;
  additionalText: string;
};

const documents: MyDoc[] = [
  { id: '1', text: 'The quick brown fox', additionalText: 'A speedy canine' },
  { id: '2', text: 'Jumps over the lazy dog', additionalText: 'High leap' },
  { id: '3', text: 'Bright blue sky', additionalText: 'Clear and sunny day' },
];

const pico = new Picosearch<MyDoc>();
pico.insertMultipleDocuments(documents);
console.log(pico.searchDocuments('fox'));
// returns
//[
//  {
//    "id": "1",
//    "score": 0.5406145489041012,
//    "doc": {
//      "id": "1",
//      "text": "The quick brown fox",
//      "additionalText": "A speedy canine"
//    }
//  }
//]

Please note that currently, a document must be flat, can only contain string values, and needs an id field (also a string)!

Syncing

Picosearch natively supports syncing with a local storage and a remote file server (read only). Both of these components are optional.

TODO: add docs

Language-specific Preprocessing

By default, only a generic preprocessing is being done (simple regex tokenizer + lowercasing). It is highly recommended to replace this with language-specific options. Currently, the following languages have an additional package for pre-processing:

English (@picosearch/language-english)
German (@picosearch/language-german)

After installing it, use it like this:

import { Picosearch } from '@picosearch/picosearch';
import * as englishOptions from '@picosearch/language-english';
const pico = new Picosearch<Doc>({ ...englishOptions });

Create an issue if you need another language!

Custom Preprocessing

You can also provide a custom tokenizer (for splitting a document into words/tokens) and analyzer (processing a single token before indexing it). Just implement the types Tokenizer and Analyzer and provide these implementations to the constructor. Example:

import {
  Picosearch,
  type Analyzer,
  type Tokenizer,
} from '@picosearch/picosearch';

const myTokenizer: Tokenizer = (doc: string): string[] => doc.split(' ');

const myAnalyzer: Analyzer = (token: string): string =>
  // when the analyzer returns '', it is removed
  ['and', 'I'].includes(token) ? '' : token.toLowerCase();

const pico = new Picosearch({
  tokenizer: myTokenizer,
  analyzer: myAnalyzer,
});

JSON Serialization

Indexes can be exported to and imported from JSON. This is useful, for example, for performing the more compute-heavy indexing offline when the search runtime is in the browser. It is very important that you pass the same tokenizer and analyzer in the new instance and don't change any other constructor options. Here's an example:

import { Picosearch } from '@picosearch/picosearch';
import * as englishOptions from '@picosearch/language-english';
const pico = new Picosearch<Doc>({ ...englishOptions, keepDocuments: true });
// ...index documents

const jsonIndex = pico.toJSON() 

const fromSerialized = new Picosearch<Doc>({ ...englishOptions, jsonIndex });

Beware of the keepDocuments option! You might want to change it to false if you only need the index for search and can get individual documents at runtime via their ID another way.

Benchmark

The CI/CD pipeline includes a benchmarking step to ensure there are no performance regressions. It currently validates against three datasets of the BEIR benchmark. The performance is checked to be the same or slightly higher (due to multi-field matching) compared to the BM25 baseline.

| | scidocs | nfcorpus | scifact | |-------------------------------------|---------|----------|---------| | Picosearch+English (BM25F) | 15.6% | 32.9% | 69.0% | | Baseline (BM25) [1] | 15.8% | 32.5% | 66.5% |

[1] Thakur, N., Reimers, N., Rücklé, A., Srivastava, A., & Gurevych, I. (2021). BEIR: A heterogeneous benchmark for zero-shot evaluation of information retrieval models. Ubiquitous Knowledge Processing Lab (UKP-TUDA), Department of Computer Science, Technische Universität Darmstadt. Retrieved from https://arxiv.org/pdf/2104.08663

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme