@chcaa/text-search-lite

v0.5.0

Published

11 days ago

Full-text search engine for node.js

Downloads

583

0High
0Medium
0Low

full-text search search engine search aggregations

Text Search Lite

A full-text search engine with support for phrase, prefix and fuzzy searches using the bm25f scoring algorithm. A build in mini query language is provided for advanced search features as well as a programmatic query interface. Aggregations and filters are supported as well.

Installation

npm install @chcaa/text-search-lite

Getting Started

Any POJO object with an id property (>= 1) can be indexed by text-search-lite. Documents (objects) are indexed in a SearchIndex instance which provides the main interface for adding, updating, deleting and searching documents of the index. When creating a new SearchIndex the fields to search, sort, filter or aggregate on must be defined in a schema definition for the SearchIndex to handle them correctly. All fields must define which type it should be indexed/stored as which determines how the values of the fields will be processed for searches, filtering, and aggregations. Additional options can be configured varying on the type of the field to further control what should be indexed and stored and how values should be processed. This will be discussed in detail in the Document Schema chapter.

import { SearchIndex } from '@chcaa/text-search-lite';

let persons = [
  { id: 1, name: 'Jane', gender: 'female', age: 54, hobbies: ['Cycling', 'Swimming'] },
  { id: 2, name: 'John', gender: 'male', age: 34, hobbies: ['Swimming'] },
  { id: 3, name: 'Rose', gender: 'female', age: 37, hobbies: [] }
];

let personsIndex = new SearchIndex([
  { name: 'name', type: SearchIndex.fieldType.TEXT },
  { name: 'gender', type: SearchIndex.fieldType.KEYWORD },
  { name: 'age', type: SearchIndex.fieldType.NUMBER },
  { name: 'hobbies', type: SearchIndex.fieldType.TAG, array: true }
]);

for (let person of persons) {
  personsIndex.add(person);
}

When the search index is created and have some documents added it can be searched using the search() method. The query to search for can either be expressed in a string based query-string language or as a combination of different query objects. The query-string language will be used in the following examples.

// find all persons named "Jane", case does not matter
let janes = personsIndex.search('jane');

// find all female persons who can swim, "+" means the term must be present
let femalesWhoCanSwim = personsIndex.search('+female +swimming');

// narrow the search to only target specific fields
let femalesWhoCanSwimPrecise = personsIndex.search('+gender:(female) +hobbies:(swimming)');

// prefix search, wildcard single character, fuzzy search
let proximitySearch = personsIndex.search('J* 3? cyclist~');

The result of the queries will include an array of matching results with the id of the document and the relevance score of the document in relation to the query. If there are more than 10 results, only the first 10 results will be included (this can be controlled using the pagination option).

{
  results: [
    {
      id: 1,
      score: 0.4458314786416938
    }
  ],
  sorting: {
    field: "_score",
    order: "desc"
  },
  pagination: {
    offset: 0,
    limit: 10,
    total: 1
  },
  query: {
    queryString: "jane",
    errors: []
  }
}

To include the source object and/or a highlighted version of the source object the highlight and includeSource query options can be set. For includeSource or highlight to be able to resolve the source objects the idToSourceResolver function must be set as well.

let persons = [
  { id: 1, name: 'Jane', gender: 'female', age: 54, hobbies: ['Cycling', 'Swimming'] },
  { id: 2, name: 'John', gender: 'male', age: 34, hobbies: ['Swimming'] },
  { id: 3, name: 'Rose', gender: 'female', age: 37, hobbies: [] }
];
let personsById = new Map(); // this could be from a db/repository
persons.forEach(p => personsById.set(p.id, p));

// find all persons named "Jane" and highlight them
let janes = personsIndex.search('jane', {
  highlight: { enabled: true },
  idToSourceResolver: ids => ids.map(id => personsById.get(id))
});

The result will for each document include a highlight.source property where the terms matching the search will be enclosed in html <mark> elements.

{
  results: [
    {
      id: 1,
      score: 0.4458314786416938,
      highlight: {
        source: {
          id: 1,
          name: "<mark>Jane</mark>",
          gender: "female",
          age: 54,
          hobbies: ["Cycling", "Swimming"]
        }
      }
    }
  ],
  // ...
}

Aggregations about all non text fields can be collected using the aggregations part of the queryOptions. (See more in the Aggregations chapter).

import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chca/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
  aggregations: [
    termAggregation('gender'),
    termAggregation('hobbies'),
    rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
  ]
});

When aggregations are requested, the result includes an array of aggregations with each of the requested aggregations. Term aggregations are sorted by docCount:DESC, term:ASC.

{
  results: [/*... */],
  aggregations: [
    {
      name: 'gender',
      fieldName: 'gender',
      type: "term",
      fieldType: "keyword",
      buckets: [
        { key: 'female', docCount: 2 },
        { key: 'male', docCount: 1 }
      ],
      missingDocCount: 0
    },
    {
      name: 'hobbies',
      fieldName: 'hobbies',
      type: "term",
      fieldType: "tag",
      buckets: [
        { key: 'swimming', docCount: 2 },
        { key: 'cycling', docCount: 1 }
      ],
      missingDocCount: 1 // person with id=3 does not have any hobbies
    },
    {
      name: 'age',
      fieldName: 'age',
      type: "range",
      fieldType: "number",
      buckets: [
        { key: '0-20', from: 0, to: 20, docCount: 0 },
        { key: '20-40', from: 20, to: 40, docCount: 2 },
        { key: '40-60', from: 40, to: 60, docCount: 1 },
        { key: '60-80', from: 60, to: 80, docCount: 0 },
        { key: '80-100', from: 80, to: 100, docCount: 0 }
      ],
      missingDocCount: 0
    }
  ]
}

The aggregations are only collected for the documents matching the search query and filters (if applied), so if we search for "jane" we only get aggregations for the documents matching this query.

// get aggregations about the documents matching the query
let all = personsIndex.search('jane', {
  aggregations: [
    termAggregation('gender'),
    termAggregation('hobbies'),
    rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
  ]
});

{
  results: [/*... */],
  aggregations: [
    {
      name: 'gender',
      fieldName: 'gender',
      type: "term",
      fieldType: "keyword",
      buckets: [
        { key: 'female', docCount: 1 }
      ],
      missingDocCount: 0
    },
    {
      name: 'hobbies',
      fieldName: 'hobbies',
      type: "term",
      fieldType: "tag",
      buckets: [
        { key: 'cycling', docCount: 1 },
        { key: 'swimming', docCount: 1 }
      ],
      missingDocCount: 0
    },
    {
      name: 'age',
      fieldName: 'age',
      type: "range",
      fieldType: "number",
      buckets: [
        { key: '0-20', from: 0, to: 20, docCount: 0 },
        { key: '20-40', from: 20, to: 40, docCount: 0 },
        { key: '40-60', from: 40, to: 60, docCount: 1 },
        { key: '60-80', from: 60, to: 80, docCount: 0 },
        { key: '80-100', from: 80, to: 100, docCount: 0 }
      ],
      missingDocCount: 0
    }
  ]
}

Filters, e.g., coming from user selected facets (created from the aggregations) can be applied using the filters part of the queryOptions. Multiple filters most be combined into a single composite filter using a BooleanFilter which determines how the results each filter should be combined. Filters can be nested using BooleanFilter's in as many levels as needed.

import { greaterThanOrEqualFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35
let all = personsIndex.search('', {
  filter: greaterThanOrEqualFilter('age', 35)
});

As we did a match all query (empty string) and only narrowed the results using a filter, the score for all documents is 0, as filters do not score results, only queries do. Furthermore, a match all query also changes the default sorting to id instead of _score because of this.

{
  results: [
    { id: 1, score: 0 },
    { id: 3, score: 0 }
  ],
  sorting: { field: 'id', order: 'asc' },
  pagination: { offset: 0, limit: 10, total: 2 },
  query: { queryString: '' }
}

The two filters below is combined using AND logic, meaning that a document must pass both filters to be included.

import { andFilter, greaterThanOrEqualFilter, termFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35 who can swim
let all = personsIndex.search('', {
  filter: andFilter([
    greaterThanOrEqualFilter('age', 35),
    termFilter('hobbies', 'swimming')
  ])
});

Only a single person ("Jane") matches the filters.

{
  results: [
    { id: 1, score: 0 }
  ],
  // '''
}

Document Schema

Each document field to index for searching or/and to use for filtering, sorting and aggregations must be defined as a part of the document schema for the SearchIndex. A field is defined as an object which always must have a name and a type and then depending on type can have a set of additional properties set to further specify how the values for the field should be processed and stored.

The fields are passed to the SearchIndex as array including all the fields the SearchIndex should know about. Additionally, a schema options object for advanced configuration of search index can be passed as a second argument.

let personsIndex = new SearchIndex([
  { name: 'name', type: SearchIndex.fieldType.TEXT, },
  { name: 'gender', type: SearchIndex.fieldType.KEYWORD, index: false },
  { name: 'age', type: SearchIndex.fieldType.NUMBER, index: true },
  { name: 'hobbies', type: SearchIndex.fieldType.TAG, array: true, docValues: false }
], { // general config of the schema
  analyzer: SearchIndex.analyzer.LANGUAGE_ENGLISH,
  score: { // ADVANCED change the default settings of scoring algorithm
    k1: 1.5,
  }
});

Schema options

The schema options can be used to change the default general settings of the schema. All properties are optional, so the argument is not required if no change is needed.

analyzer: string - The name of the default analyzer to use for text fields. text-search-lite has a set of common analyzers built-in for different languages which is accessible through SearchIndex.analyzer or a custom analyzer can be installed and used. Defaults to 'standard'.
score: object - Scoring config.
- k1: number - k1 for the bm25f scoring algorithm. k1. The default value to use for all analyzers not having k1 assigned specifically. Defaults to 1.2.
- analyzerK1: object - k1 for individual analyzers. Register k1 for a specific analyzer by using the name of the analyzer as property name and set k1 as the value.

Field settings

The following properties are available for all field types:

name: string - The path name of the field. E.g. "author.name" (for arrays of values the brackets should be excluded, e.g. "authors.name").
type: ('text'|'keyword'|'tag'|'number'|'date'|'boolean') - The type of the field.
index?: boolean - Set to true if the field should be searchable.
docValues?: boolean - Set to true if the field should be available in filters or be used for aggregations.
array?: boolean - Set to true if the property is an array or if it's a descendant of an array. Defaults to false.
boost?: number - The relevance of the field when scoring it in a search. Must be >= 1. Defaults to 1.
prefix?: object - Prefix config. Only relevant if index=true.
- eagerLoad?: boolean - Set to true if prefix mappings should be eager loaded. If false prefixes will first be loaded when queried on the field. Defaults to false.
- partitionDepth?: number - The maximum partition depth of the prefix tree. In most cases, the default is fine only in cases where, e.g., all analyzed terms start with the same prefix such as "000-SomeValue" this should be set to a higher number. Defaults to 3.
fuzzy?: object - Fuzzy config. Only relevant if index=true.
- enabled?: boolean - true if fuzzy queries should be supported. Default to true.
score?: object - The scoring paramters to use when calculating the score of the field. Only relevant if index=true.
- b?: number - The hyperparameter b of bm25f. Defaults to 0.75.
docStats?: object - Document statistics config. Only relevant if index=true.
- length?: boolean - Should doc field length be stored. Defaults to false.
- termFrequencies?: boolean - Should doc field length be stored. Defaults to false.
- termPositions?: boolean - Should doc term positions be stored. Only relevant for text fields. This is required for doing phrase searches. Defaults to false.

Some properties are only available for specific field types or have another value than the default. The additional fields and different default values are described for each field below.

A note on docStats
Even though docStats can be enabled for all field types it does not make sense for other fields than text as all other field types are not tokenized. The only exception is if a field has array=true and the number of elements in the array should be taken into account when scoring the document.
E.g. if we have documents with a hobbies array where doc-1 has ["bicycling"] and doc-2 has ["bicycling", "climbing"] and documents with more hobbies should score lower than documents with fewer hobbies then docStats.length could be se to true as the number of hobbies then will be used in calculating how relevant the match is.

`text` Fields

A text field is the primary field to use for full-text searches. A text field is analyzed (normalized and tokenized) when indexed which makes it ideal for efficient lookup of terms and phrases in the text of the field.

docValues not allowed
A text field cannot have docValues=true because it is tokenized. Therefore, a text field cannot be used for filtering, aggregations and sorting.

Default settings override

index: true
docStats
- length: true
- termFrequencies: true
- termPositions: true

Text field specific settings

indexExact?: boolean - Set to true for text fields to enable phrase searches and more precise matching. Defaults to true.
analyzer?: string - The name of the analyzer to use for this field. Defaults to undefined which resolves to the default analyzer configured for the search-index.

`keyword` Fields

A keyword field is indexed as is without applying any form of analysis. To match the value of a keyword field the same string as when the field was indexed must be used.

Default settings override

index: true
docValues: true

`tag` Fields

A tag field is indexed in the same way as a keyword field except that lowercase is applied making the value of the field case-insensitive.

Default settings override

index: true
docValues: true

`number` Fields

A number field is used to store numeric values such as age, weight, length and other measures and is typically used for filtering and sorting but can be indexed as well (disabled by default).

Default settings override

index: false
docValues: true
fuzzy
- enabled: false

number fields should in most cases only be indexed if the vocabulary is relatively small and made up of integers. A large vocabulary either because of floating point numbers or large scale integers will be hard to match in a search and would furthermore result in a large inverted index.

`date` Fields

A date field is used to store date and date-time values and is typically used for filtering and sorting but can be indexed as well (disabled by default).

Default settings override

index: false
docValues: true
fuzzy
- enabled: false

Date field specific settings

format: string - A format string in one of the formats yyyy, yyyy-MM-dd or yyyy-MM-dd'T'HH-mm-ssZ.

Document value types

The value of the document field can express a date in one of the following ways:

number - An integer in epoch millis. Negative values are allowed.
string - A date string if field.format is defined.

Dates in BC time can for all string formats be defined as negative values with 6-year digits: -yyyyyyy. E.g. -000001-01-01

When format is defined both epoch millis and date strings in the defined format is allowed as values for the field. If docValues is enabled for the field the date will be converted to the epoch millis version before storing it, which then will be used for filtering, aggregations and sorting. If index is enabled for the field the date will be converted to the defined string format before indexing the date, so the date can be searched for using the given format.

A date fields can only be indexed if the format precision is set to yyyy or yyyy-MM-dd. A large vocabulary because of minute, second or even millisecond precision will be hard to match in a search and would furthermore result in a large inverted index.

Regarding Time Zones
Dates will internally always be stored as UTC. If date inputs include time using the yyyy-MM-dd'T'HH-mm-ssZ format and no time zone is present, the date will be parsed as UTC.

`boolean` Fields

A boolean field is used to store the boolean values true|false and is typically used for filtering and sorting but can be indexed as well (disabled by default).

Default settings override

index: false
docValues: true
fuzzy
- enabled: false

`docId` Fields

A special field type used for storing the id of a document in an optimized way. The field type cannot be configured on user defined fields but can still be encountered as the id field is publicly available.

Create, Update and Delete Documents

// TODO kort beskrivelse af crud operations, beskriv som metoder #### Method: xxx

Searching

Searching the index is done using the search() method. The query part of the search can be expressed in the built-in query string language or as a combination of query objects. Furthermore, filters, aggregations, sorting and pagination can be applied/requested through the optional queryOptions object which can be passed as a second argument to search().

Method: `search(query, [queryOptions])`

Searches the index.

Parameters:

query: string|Query|Query[] - The query to search for.
queryOptions?: object - Query options.
- fields?: string[] - The name of the fields to search. Defaults to all user-created indexed fields if not defined.
- pagination?: object - The pagination to apply.
  - offset?: number - The pagination offset. Defaults to 0.
  - limit?: number - The pagination limit. Defaults to 10.
- sorting?: object - The sorting to apply.
  - field?: string - The field to sort by or "_score".
  - order?: ('asc'|'desc') - The sorting order.
- filter?: Filter - The filter to apply.
- aggregations?: Aggregation[] - The aggregations to generate.
- highlight?: boolean|object - Highlight options. Defaults to false.
  - enabled: boolean - Should highlight be enabled.
- includeSource?: boolean - - Should the source object be included in the result. Defaults to false.
- idToSourceResolver?: function(number[]):{id:number}[] - A function for resolving source objects from an array of ids.
- queryString?: object - Query string options.
  - parseOptions: object - Query string parse options. Enable/disable which query-string expressions to parse.

Returns:

object - The result of the search.
- results: object[] - Information about each document matching the search and applied filters and pagination.
  - results[].id: number - The id of the document.
  - results[].score: number - The relevance score of the document. (This will be 0 in the case of sorting on something else than _score or if a match-all query is performed).
  - results[].source: object - The source object if requested in the queryOptions.
  - results[].highlight: object - Highlight information if requested in the queryOptions.
  - results[].highlight.source: object - A highlighted version of the source object.
  - sorting: object - The sorting applied to the result.
  - pagination: object - The pagination applied to the result and the total number of matches.
    - offset: number
    - limit: number
    - total: number - The total match count.
  - aggregations: object - The aggregation results. (See aggregations for the different result object structures).
  - query: object - The query-string and possible errors. This is only available of the query was performed using a query-string.
    - queryString: string - The query-string used for the search.
    - errors: object[] - The parse errors, if any, which occurred during parsing of the query-string.

Query String Language

Text-search-lite has a built-in query-string mini language for expressing text-based queries with support for expressing the same types og queries as the programmatic API does, such as boolean modifiers, phrases, wildcards, targeting specific fields and grouping of statements. The query-string parser automatically converts any unparsable part of the query to regular "text" making it safe to expose the query-string language directly to the end-user.

The following modifiers and expressions are supported and can be turned on/off individually to limit what should be parsed and what should just be treated as regular text.

Phrases "A phrase"

A phrase is one or more terms in a specific form which should be present in a particular order.

search for "a full sentence" or for a "single" specific spelling of a term

Must Operator +

The term, phrase or group content must be present in the document for it to match.

+peace in the +world

Must Not Operator -

The term, phrase or group content must not be present in the document for it to match.

peace not -war

Boost Operator ^NUMBER Boost the relevance of the term, phrase or the content of a group.

peace^10 "love not war"^2

Prefix Operator *

The term must start with one or more characters, but the ending is undetermined. Prefix queries take the difference in length between the match and the prefix string into account when scoring is calculated.

love and pea*

Wildcard Operator ?, *

The term can have single and multiple character spans which are undetermined. The single character wildcard is expressed by ? and multiple character wildcard is expressed by *. Wildcard queries take the difference in length between the match and the prefix string into account when scoring is calculated.

love and p?a*e

Fuzzy Operator ~, ~[0, 1, 2]

The term must match other terms within a maximum edit distance. When the edit distance is not defined specifically by one of [0, 1, 2] the edit distance is calculated based of the length of the term.

length < 3: maxEdits = 0
length < 6: maxEdits = 1
length >= 6: maxEdits = 2

Fuzzy queries take the edit distance between the term and the result into account when scoring is calculated.

love~ and peace~2

Groups ()

Terms and phrases can be grouped together and boolean operators and boost can be applied to a group making it possible to express more complex queries.

+peace +(world earth) (love solidarity)^10

Field Groups FIELD1:FIELD2:()

Field groups offer the same possibilities as groups and additionally targets one or more fields where the match must occur. Multiples fields must be separated by a colon (:).

Field Groups cannot be nested.

title:(world earth) title:description:(love solidarity)

Query String Options

The parsing of the query-string language can be configured in the queryOptions of search() and parseQueryStringToQueryObjects(). Where each language feature can be enabled/disabled. All features are enabled by default.

queryString: object - The Query string options.
- parseOptions: object - Enable/disable which query-string expressions to parse.
  - quote: boolean - Toggle parsing of "exact strings and phrases".
  - group: boolean - Toggle parsing of (terms in group).
  - fieldGroup: boolean - Toggle parsing of title:(terms in field group).
  - mustOperator: boolean - Toggle parsing of +mustOperator.
  - mustNotOperator: boolean - Toggle parsing of -mustNotOperator.
  - prefixOperator: boolean - Toggle parsing of prefix*.
  - wildcardOperator: boolean - Toggle parsing of wil_c*d.
  - fuzzyOperator: boolean - Toggle parsing of fuzzy~1.
  - boostOperator: boolean - Toggle parsing of boost^10.

Parse Errors

When using the query-string language the search() and parseQueryStringToQueryObjects() methods includes information about any parse errors and their exact location in the string. The parse errors are structured in the following format.

errors: object[] - An array of error objects.
- errors[].type: string - The type of the error.
- errors[].message: string - A user-friendly message.
- errors[].startIndex: number - The start index in the source string where the reported error occurs.
- errors[].spanSize: number - The character span of the reported error.

The query-string language can also be validated directly using the validateQueryString() method which e.g., could be used for user feedback while typing a query.

Method: `validateQueryString(queryString, [parseOptions])`

Validates the query string. Any problems with the query string will be reported in the errors array of the returned object.

Parameters:

queryString: string - The query string to validate.
parseOptions: object - Options for configuring which parts of the query string language should be enabled. (See Query String Options).

Returns:

object - The result of the validation.
- status: ('success'|'error') - The status of the validation.
- errors: object[] - The parse errors which occurred during parsing. (See above).
- queryString: string - The query-string which was validated.

Query Objects

// TODO beskriv factory methods i detalje og nævn a klasser også kan importeres direkte til brug af typer og til instantiering med "new".

TODO Generelt querying, query language eller object based queries.

... TermQuery can additionally accept number and boolean types which then will be transformed to the correct indexed version of the value before querying. (boolean: true -> "true"), (number: 1000 -> "1000"), (date: 0 -> "1970-01-01")

Filters

// TODO ... range filtre virker også på strenge --- term filters virker også med numbers og dates. Dates accepterer epoc_millis som input eller en date string i det givne format for feltet.

Aggregations

Aggregations can be used to collect aggregated statistics about the result of a query. This could, e.g., be:

the top 10 hobbies of documents
document counts grouped by age ranges
document counts grouped by birth-year decade
etc.

Multiple aggregations can be requested at the same time, and aggregations can be nested to create drill-down detail hierarchies.

To request one or more aggregation include the aggregations as part of the queryOptions object.

import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chcaa/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
  aggregations: [
    termAggregation('gender'),
    termAggregation('hobbies', 2),
    rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
  ]
});

The results of the requested aggregations are included as an array on the result object from the query. All aggregation results have the same set of base properties where only the bucket objects differ depending on the type of aggregation requested.

{
  results: [/*... */],
  aggregations: [
    {
      name: 'gender',
      fieldName: 'gender',
      type: "term",
      fieldType: "keyword",
      buckets: [
        { key: 'female', docCount: 2 },
        { key: 'male', docCount: 1 }
      ],
      missingDocCount: 0
    },
    {
      name: 'hobbies',
      fieldName: 'hobbies',
      type: "term",
      fieldType: "tag",
      buckets: [
        { key: 'swimming', docCount: 2 },
        { key: 'cycling', docCount: 1 }
      ],
      missingDocCount: 1 // person with id=3 does not have any hobbies
    },
    {
      name: 'age',
      fieldName: 'age',
      type: "range",
      fieldType: "number",
      buckets: [
        { key: '0-20', from: 0, to: 20, docCount: 0 },
        { key: '20-40', from: 20, to: 40, docCount: 2 },
        { key: '40-60', from: 40, to: 60, docCount: 1 },
        { key: '60-80', from: 60, to: 80, docCount: 0 },
        { key: '80-100', from: 80, to: 100, docCount: 0 }
      ],
      missingDocCount: 0
    }
  ]
}

Peculiarities of aggregation buckets
As documents with array fields can occur more than once in the aggregated statistics the sum of the counted document values may exceed to total number of documents in the query. This is expected.

Factory methods for creating the different kinds of aggregations are exported from the @chcaa/text-search-lite/aggregation package along with the aggregation classes the factory methods produces. The factory methods are the suggested way for creating aggregation requests where the classes can be used for type definitions.

Function: `termAggregation(fieldName, [maxSize], [options])`

Creates a new TermAggregation for collecting statistics about keyword, tag, number, date, and boolean fields. The occurrence of each distinct value will be counted once per document and returned descending with the value with most documents at the top.

Parameters:

fieldName: string - The name of the field to aggregate on.
maxSize?: number - The maximum number of buckets. Defaults to 10.
options?: object - Config options.

returns:

TermAggregation

Bucket results

Buckets are sorted by docCount:DESC, term:ASC.

  {
  // name, fieldName, etc...
  buckets: [
    { key: 'female', docCount: 2 },
    { key: 'male', docCount: 1 }
  ],
  missingDocCount: 0
}

Function: `rangeAggregation(fieldName, ranges, [options])`

Creates a new RangeAggregation for collecting statistics about number, keyword and tag fields.

Parameters:

fieldName: string - The name of the field to aggregate on.
ranges: object[] - The ranges to create buckets for.
- ranges[].from: number|string - The lower limit of the bucket, inclusive. Optional for the first range if no lower limit is required.
- ranges[].to: number|string -The upper limit of the bucket, exclusive. Optional for the last range if no upper limit is required.
options?: object - Config options.

returns:

RangeAggregation

Bucket results

Buckets are sorted by the order they were requested.

  {
  // name, fieldName, etc...
  buckets: [
    { key: '0-20', from: 0, to: 20, docCount: 0 },
    { key: '20-40', from: 20, to: 40, docCount: 2 },
    { key: '40-60', from: 40, to: 60, docCount: 1 },
  ],
  missingDocCount: 0
}

Additionally, a set of convenience functions is supplied:

rangeAggregationWithIntegerAutoBuckets(fieldName, bucketCount, min, max, [options]) - Creates a new RangeAggregation where the buckets are auto generated based on the input parameters.
rangeAggregationWithIntegerAutoBucketsOpenEnded(fieldName, bucketCount, min, max, [options]) - Creates a new open-ended RangeAggregation where the buckets are auto generated based on the input parameters. The first bucket will only have to defined and the last bucket only from defined and the bucket ranges is thus open-ended.
rangeAggregationWithNumberAutoBuckets(fieldName, bucketCount, min, max, [options]) - Creates a new RangeAggregation where the buckets are auto generated based on the input parameters.
rangeAggregationWithNumberAutoBucketsOpenEnded(fieldName, bucketCount, min, max, [options]) - Creates a new open-ended RangeAggregation where the buckets are auto generated based on the input parameters. The first bucket will only have to defined and the last bucket only from defined and the bucket ranges is thus open-ended.

Function: `dateRangeAggregation(fieldName, ranges, [options])`

Creates a new DateRangeAggregation for collecting statistics about date fields. Date range aggregations work in the same way as range aggregations except that the bucket ranges can be expressed in a string date format.

Parameters:

fieldName: string - The name of the field to aggregate on.
format: string - The date format of the ranges. One of yyyy, yyyy-MM-dd or yyyy-MM-dd'T'HH-mm-ssZ.
ranges: object[] - The ranges to create buckets for.
- ranges[].from: number|string - The lower limit of the bucket, inclusive. Optional for the first range if no lower limit is required.
- ranges[].to: number|string -The upper limit of the bucket, exclusive. Optional for the last range if no upper limit is required.
options?: object - Config options.

returns:

DateRangeAggregation

Bucket results

Buckets are sorted by the order they were requested.

  {
  // name, fieldName, etc...
  buckets: [
    { key: '1940-1950', from: '1940', to: '1950', fromMillis: -946771200000, toMillis: -631152000000, docCount: 0 },
    { key: '1990-2000', from: '1990', to: '2000', fromMillis: 631152000000, toMillis: 946684800000, docCount: 2 },
  ],
  missingDocCount: 0
}

Aggregation Options

All aggregations can additionally be configured to have user defined name and to include nested aggregations using the following options object structure.

name?: string - The name of the aggregation, e.g., to distinguish two aggregations on the same field. If undefined, the field name will be used.
aggregations?: Aggregation[] - Child aggregations to collect for each bucket of the aggregation.

Child aggregations can be requested as follows:

import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chcaa/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
  aggregations: [
    termAggregation('gender', {
      aggregations: [
        termAggregation('hobbies', 2) // Top 2 hobbies for each gender
      ]
    })
  ]
});

The result of the child aggregation will be attached to each parent bucket.

  {
  // name, fieldName, etc...
  buckets: [
    {
      key: 'female', docCount: 2,
      aggregations: [
        {
          // name, fieldName, etc...
          buckets: [
            { key: 'swimming', docCount: 1 },
            { key: 'cycling', docCount: 1 }
          ]
        }
      ]
    },
    {
      key: 'male', docCount: 1,
      aggregations: [
        {
          // name, fieldName, etc...
          buckets: [
            { key: 'swimming', docCount: 1 }
          ]
        }
      ]
    }
  ],
  missingDocCount: 0
}

Bm25f Scoring Algorithm

The scoring algorithm builds on the bm25f algorithm as described in foundations of bm25 review and Okapi bm25. The algorithm groups all the fields of a document (included in the search) with the same analyzer into a one virtual field before scoring the term against the virtual field.

This approach gives typically better results than scoring each field individually and then combing the result after scoring as the importance of a term is considered across all fields instead of each field in isolation.

The boost of a field is integrated into the algorithm by using the boost as multiplier for the term frequency in the given field and thereby making the term "boost"-factor more important in the field.

Formula:

streams/fields: s = 1, ..., S
stream length: sls
stream weight: vs
stream term frequency: tfs,i
avg. stream length across all docs: avsls
term: i
total docs with stream: n
docs with i in stream: dfn,i
stream length relevance: b
term frequency relevance: k1

bm25f

Tuning `k1` and `b` parameters

b determines the impact of the field's length when calculating the score and is as default set to 0.75 and must be in the range [0 - 1]. Lower values mean smaller length impact and vice versa. b can be configured on a per-field basis and should for fields with only short text segments be considered to have a lower value so a change in length by only a few terms doesn't affect the score too much. E.g. could a tittle field have a b of 0.25.

For fields like person.name even a b value of 0.0 should be considered as a search for Andersen should probably yield the same score for both Gillian Andensen and Hans Chrisitan Andersen and not include the length of the name in the score at all. Either the person has the name searched for or not, the length of the total name is not relevant.

k1 determines the impact of the term frequency in matching fields and is in bm25f applied once pr. term to the score for all fields with the same analyzer (see formula above). k1 has a default of 1.2 but can be changed for the whole document index or for each analyzer individually.

It is also possible to change how the term frequency of a document affects the score by turning docStats.termFrequencies off for a field, which will result in the field count always being 1 if the term exists in the field no matter the actual term frequency and 0 if the term does not exist in the field.

docStats.termFrequencies is by default turned off for all other fields than text fields as other fields are not tokenized so counting term frequencies will not make any difference and just consume memory.

Method Summary `SearchIndex`

The SearchIndex exposes the following properties and methods:

docCount - The total number of documents in the index.
indexedFields - Name and type of all indexed fields.
sortingFields - Name and type of all fields that can be used for sorting.
filterFields - Name and type of all fields that can be used for filtering.
hasField(fieldName) - Tests if the field exists.
getFieldType(fieldName) - Returns the type of the field.
add(document) - Adds a document to the index.
update(document) - Updates a document in the index.
deleteById(id) - Removes a document from the index.
delete(document) - Removes a document from the index.
search(query, queryOptions) - Searches the index.
parseQueryStringToQueryObjects(queryString, parseOptions) - Parses the query-string to query objects.
validateAggregation(aggregation) - Validates the aggregation and throws an ValidationError if not valid.
validateFilter(filter) - Validates the filter and throws an ValidationError if not valid.
validateQueryString(queryString, parseOptions) - Validates the query string.
clearCache() - Clears any cached filters, aggregations and sorted results. This is done automatically on any change to the index, so this method is typically not required to be called manually.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Text Search Lite

Installation

Getting Started

Document Schema

text Fields

keyword Fields

tag Fields

number Fields

date Fields

boolean Fields

docId Fields

Create, Update and Delete Documents

Searching

Method: search(query, [queryOptions])

Query String Language

Query String Options

Parse Errors

Method: validateQueryString(queryString, [parseOptions])

Query Objects

Filters

Aggregations

Function: termAggregation(fieldName, [maxSize], [options])

Function: rangeAggregation(fieldName, ranges, [options])

Function: dateRangeAggregation(fieldName, ranges, [options])

Aggregation Options

Bm25f Scoring Algorithm

Tuning k1 and b parameters

Method Summary SearchIndex

`text` Fields

`keyword` Fields

`tag` Fields

`number` Fields

`date` Fields

`boolean` Fields

`docId` Fields

Method: `search(query, [queryOptions])`

Method: `validateQueryString(queryString, [parseOptions])`

Function: `termAggregation(fieldName, [maxSize], [options])`

Function: `rangeAggregation(fieldName, ranges, [options])`

Function: `dateRangeAggregation(fieldName, ranges, [options])`

Tuning `k1` and `b` parameters

Method Summary `SearchIndex`