@chcaa/text-search-lite
v0.5.0
Published
Full-text search engine for node.js
Downloads
583
Readme
Text Search Lite
A full-text search engine with support for phrase, prefix and fuzzy searches using the bm25f scoring algorithm. A build in mini query language is provided for advanced search features as well as a programmatic query interface. Aggregations and filters are supported as well.
Installation
npm install @chcaa/text-search-lite
Getting Started
Any POJO object with an id
property (>= 1) can be indexed by text-search-lite
. Documents (objects) are indexed
in a SearchIndex
instance which provides the main interface for adding, updating, deleting and searching
documents of the index. When creating a new SearchIndex
the fields to search, sort, filter or aggregate on
must be defined in a schema definition for the SearchIndex
to handle them correctly. All fields must define
which type it should be indexed/stored as which determines how the values of the fields will be processed for searches,
filtering, and aggregations. Additional options can be configured varying on the type of the field to further
control what should be indexed and stored and how values should be processed. This will be discussed in detail in
the Document Schema chapter.
import { SearchIndex } from '@chcaa/text-search-lite';
let persons = [
{ id: 1, name: 'Jane', gender: 'female', age: 54, hobbies: ['Cycling', 'Swimming'] },
{ id: 2, name: 'John', gender: 'male', age: 34, hobbies: ['Swimming'] },
{ id: 3, name: 'Rose', gender: 'female', age: 37, hobbies: [] }
];
let personsIndex = new SearchIndex([
{ name: 'name', type: SearchIndex.fieldType.TEXT },
{ name: 'gender', type: SearchIndex.fieldType.KEYWORD },
{ name: 'age', type: SearchIndex.fieldType.NUMBER },
{ name: 'hobbies', type: SearchIndex.fieldType.TAG, array: true }
]);
for (let person of persons) {
personsIndex.add(person);
}
When the search index is created and have some documents added it can be searched using the search()
method. The query to search for can
either be expressed in a string based query-string language or as a combination of different query objects.
The query-string language will be used in the following examples.
// find all persons named "Jane", case does not matter
let janes = personsIndex.search('jane');
// find all female persons who can swim, "+" means the term must be present
let femalesWhoCanSwim = personsIndex.search('+female +swimming');
// narrow the search to only target specific fields
let femalesWhoCanSwimPrecise = personsIndex.search('+gender:(female) +hobbies:(swimming)');
// prefix search, wildcard single character, fuzzy search
let proximitySearch = personsIndex.search('J* 3? cyclist~');
The result of the queries will include an array of matching results with the id of the document and the relevance score of the document in relation to the query. If there are more than 10 results, only the first 10 results will be included (this can be controlled using the pagination option).
{
results: [
{
id: 1,
score: 0.4458314786416938
}
],
sorting: {
field: "_score",
order: "desc"
},
pagination: {
offset: 0,
limit: 10,
total: 1
},
query: {
queryString: "jane",
errors: []
}
}
To include the source object and/or a highlighted version of the source object the highlight
and includeSource
query options can be set. For
includeSource
or highlight
to be able to resolve the source objects the idToSourceResolver
function must be set as well.
let persons = [
{ id: 1, name: 'Jane', gender: 'female', age: 54, hobbies: ['Cycling', 'Swimming'] },
{ id: 2, name: 'John', gender: 'male', age: 34, hobbies: ['Swimming'] },
{ id: 3, name: 'Rose', gender: 'female', age: 37, hobbies: [] }
];
let personsById = new Map(); // this could be from a db/repository
persons.forEach(p => personsById.set(p.id, p));
// find all persons named "Jane" and highlight them
let janes = personsIndex.search('jane', {
highlight: { enabled: true },
idToSourceResolver: ids => ids.map(id => personsById.get(id))
});
The result will for each document include a highlight.source
property where the terms matching the search
will be enclosed in html <mark>
elements.
{
results: [
{
id: 1,
score: 0.4458314786416938,
highlight: {
source: {
id: 1,
name: "<mark>Jane</mark>",
gender: "female",
age: 54,
hobbies: ["Cycling", "Swimming"]
}
}
}
],
// ...
}
Aggregations about all non text
fields can be collected using the aggregations
part of the queryOptions
. (See more in the Aggregations
chapter).
import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chca/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
aggregations: [
termAggregation('gender'),
termAggregation('hobbies'),
rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
]
});
When aggregations are requested, the result includes an array of aggregations with each of the requested aggregations.
Term aggregations are sorted by docCount:DESC, term:ASC
.
{
results: [/*... */],
aggregations: [
{
name: 'gender',
fieldName: 'gender',
type: "term",
fieldType: "keyword",
buckets: [
{ key: 'female', docCount: 2 },
{ key: 'male', docCount: 1 }
],
missingDocCount: 0
},
{
name: 'hobbies',
fieldName: 'hobbies',
type: "term",
fieldType: "tag",
buckets: [
{ key: 'swimming', docCount: 2 },
{ key: 'cycling', docCount: 1 }
],
missingDocCount: 1 // person with id=3 does not have any hobbies
},
{
name: 'age',
fieldName: 'age',
type: "range",
fieldType: "number",
buckets: [
{ key: '0-20', from: 0, to: 20, docCount: 0 },
{ key: '20-40', from: 20, to: 40, docCount: 2 },
{ key: '40-60', from: 40, to: 60, docCount: 1 },
{ key: '60-80', from: 60, to: 80, docCount: 0 },
{ key: '80-100', from: 80, to: 100, docCount: 0 }
],
missingDocCount: 0
}
]
}
The aggregations are only collected for the documents matching the search query and filters (if applied), so if we search for "jane"
we only get aggregations for the documents matching this query.
// get aggregations about the documents matching the query
let all = personsIndex.search('jane', {
aggregations: [
termAggregation('gender'),
termAggregation('hobbies'),
rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
]
});
{
results: [/*... */],
aggregations: [
{
name: 'gender',
fieldName: 'gender',
type: "term",
fieldType: "keyword",
buckets: [
{ key: 'female', docCount: 1 }
],
missingDocCount: 0
},
{
name: 'hobbies',
fieldName: 'hobbies',
type: "term",
fieldType: "tag",
buckets: [
{ key: 'cycling', docCount: 1 },
{ key: 'swimming', docCount: 1 }
],
missingDocCount: 0
},
{
name: 'age',
fieldName: 'age',
type: "range",
fieldType: "number",
buckets: [
{ key: '0-20', from: 0, to: 20, docCount: 0 },
{ key: '20-40', from: 20, to: 40, docCount: 0 },
{ key: '40-60', from: 40, to: 60, docCount: 1 },
{ key: '60-80', from: 60, to: 80, docCount: 0 },
{ key: '80-100', from: 80, to: 100, docCount: 0 }
],
missingDocCount: 0
}
]
}
Filters, e.g., coming from user selected facets (created from the aggregations) can be applied using the filters
part
of the queryOptions
. Multiple filters most be combined into a single composite filter using a BooleanFilter
which determines
how the results each filter should be combined. Filters can be nested using BooleanFilter
's in as many levels as needed.
import { greaterThanOrEqualFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35
let all = personsIndex.search('', {
filter: greaterThanOrEqualFilter('age', 35)
});
As we did a match all query (empty string) and only narrowed the results using a filter, the score for all documents is 0, as filters do not
score results, only queries do.
Furthermore, a match all query also changes the default sorting to id
instead of _score
because of this.
{
results: [
{ id: 1, score: 0 },
{ id: 3, score: 0 }
],
sorting: { field: 'id', order: 'asc' },
pagination: { offset: 0, limit: 10, total: 2 },
query: { queryString: '' }
}
The two filters below is combined using AND
logic, meaning that a document must pass both filters to be included.
import { andFilter, greaterThanOrEqualFilter, termFilter } from "@chcaa/text-search-lite/filter";
// get all persons with age >= 35 who can swim
let all = personsIndex.search('', {
filter: andFilter([
greaterThanOrEqualFilter('age', 35),
termFilter('hobbies', 'swimming')
])
});
Only a single person ("Jane") matches the filters.
{
results: [
{ id: 1, score: 0 }
],
// '''
}
Document Schema
Each document field to index for searching or/and to use for filtering, sorting and aggregations must be defined as a part of
the document schema for the SearchIndex
. A field is defined as an object which always must have a name
and a type
and then depending on type can have a set of additional properties set to further specify how the values for the field should
be processed and stored.
The fields are passed to the SearchIndex
as array including all the fields the SearchIndex
should know about. Additionally,
a schema options object for advanced configuration of search index can be passed as a second argument.
let personsIndex = new SearchIndex([
{ name: 'name', type: SearchIndex.fieldType.TEXT, },
{ name: 'gender', type: SearchIndex.fieldType.KEYWORD, index: false },
{ name: 'age', type: SearchIndex.fieldType.NUMBER, index: true },
{ name: 'hobbies', type: SearchIndex.fieldType.TAG, array: true, docValues: false }
], { // general config of the schema
analyzer: SearchIndex.analyzer.LANGUAGE_ENGLISH,
score: { // ADVANCED change the default settings of scoring algorithm
k1: 1.5,
}
});
Schema options
The schema options can be used to change the default general settings of the schema. All properties are optional, so the argument is not required if no change is needed.
analyzer: string
- The name of the default analyzer to use fortext
fields.text-search-lite
has a set of common analyzers built-in for different languages which is accessible throughSearchIndex.analyzer
or a custom analyzer can be installed and used. Defaults to'standard'
.score: object
- Scoring config.k1: number
-k1
for the bm25f scoring algorithm.k1
. The default value to use for all analyzers not having k1 assigned specifically. Defaults to1.2
.analyzerK1: object
-k1
for individual analyzers. Registerk1
for a specific analyzer by using the name of the analyzer as property name and setk1
as the value.
Field settings
The following properties are available for all field types:
name: string
- The path name of the field. E.g. "author.name" (for arrays of values the brackets should be excluded, e.g. "authors.name").type: ('text'|'keyword'|'tag'|'number'|'date'|'boolean')
- The type of the field.index?: boolean
- Set totrue
if the field should be searchable.docValues?: boolean
- Set totrue
if the field should be available in filters or be used for aggregations.array?: boolean
- Set totrue
if the property is an array or if it's a descendant of an array. Defaults tofalse
.boost?: number
- The relevance of the field when scoring it in a search. Must be >= 1. Defaults to1
.prefix?: object
- Prefix config. Only relevant ifindex=true
.eagerLoad?: boolean
- Set totrue
if prefix mappings should be eager loaded. Iffalse
prefixes will first be loaded when queried on the field. Defaults tofalse
.partitionDepth?: number
- The maximum partition depth of the prefix tree. In most cases, the default is fine only in cases where, e.g., all analyzed terms start with the same prefix such as "000-SomeValue" this should be set to a higher number. Defaults to3
.
fuzzy?: object
- Fuzzy config. Only relevant ifindex=true
.enabled?: boolean
-true
if fuzzy queries should be supported. Default totrue
.
score?: object
- The scoring paramters to use when calculating the score of the field. Only relevant ifindex=true
.b?: number
- The hyperparameterb
of bm25f. Defaults to0.75
.
docStats?: object
- Document statistics config. Only relevant ifindex=true
.length?: boolean
- Should doc field length be stored. Defaults tofalse
.termFrequencies?: boolean
- Should doc field length be stored. Defaults tofalse
.termPositions?: boolean
- Should doc term positions be stored. Only relevant fortext
fields. This is required for doing phrase searches. Defaults tofalse
.
Some properties are only available for specific field types or have another value than the default. The additional fields and different default values are described for each field below.
A note on docStats
Even though
docStats
can be enabled for all field types it does not make sense for other fields thantext
as all other field types are not tokenized. The only exception is if a field hasarray=true
and the number of elements in the array should be taken into account when scoring the document.
E.g. if we have documents with a hobbies array where doc-1 has ["bicycling"] and doc-2 has ["bicycling", "climbing"] and documents with more hobbies should score lower than documents with fewer hobbies thendocStats.length
could be se totrue
as the number of hobbies then will be used in calculating how relevant the match is.
text
Fields
A text
field is the primary field to use for full-text searches. A text
field is analyzed (normalized and tokenized) when indexed which
makes it ideal for efficient lookup of terms and phrases in the text of the field.
docValues not allowed
Atext
field cannot havedocValues=true
because it is tokenized. Therefore, atext
field cannot be used for filtering, aggregations and sorting.
Default settings override
index: true
docStats
length: true
termFrequencies: true
termPositions: true
Text field specific settings
indexExact?: boolean
- Set totrue
fortext
fields to enable phrase searches and more precise matching. Defaults totrue
.analyzer?: string
- The name of the analyzer to use for this field. Defaults toundefined
which resolves to the default analyzer configured for thesearch-index
.
keyword
Fields
A keyword
field is indexed as is without applying any form of analysis. To match the value of a keyword
field the same string as when the field
was indexed must be used.
Default settings override
index: true
docValues: true
tag
Fields
A tag
field is indexed in the same way as a keyword
field except that lowercase is applied making the value of the field case-insensitive.
Default settings override
index: true
docValues: true
number
Fields
A number
field is used to store numeric values such as age, weight, length and other measures and is typically used for filtering and sorting but can
be indexed as well (disabled by default).
Default settings override
index: false
docValues: true
fuzzy
enabled: false
number
fields should in most cases only be indexed if the vocabulary is relatively small and made up of integers.
A large vocabulary either because of floating point numbers or large scale integers will be hard to match
in a search and would furthermore result in a large inverted index.
date
Fields
A date
field is used to store date and date-time values and is typically used for filtering and sorting but can be indexed as well (disabled by default).
Default settings override
index: false
docValues: true
fuzzy
enabled: false
Date field specific settings
format: string
- A format string in one of the formatsyyyy
,yyyy-MM-dd
oryyyy-MM-dd'T'HH-mm-ssZ
.
Document value types
The value of the document field can express a date in one of the following ways:
number
- An integer in epoch millis. Negative values are allowed.string
- A date string iffield.format
is defined.
Dates in BC time can for all string formats be defined as negative values with 6-year digits: -yyyyyyy
. E.g. -000001-01-01
When format
is defined both epoch millis and date strings in the defined format is allowed as values for the field. If docValues
is enabled
for the field the date will be converted to the epoch millis version before storing it, which then will be used for filtering, aggregations and sorting.
If index
is enabled for the field the date will be converted to the defined string format before indexing the date, so the date can be searched for
using the given format.
A date
fields can only be indexed if the format precision is set to yyyy
or yyyy-MM-dd
. A large vocabulary because of minute,
second or even millisecond precision will be hard to match in a search and would furthermore result in a large inverted index.
Regarding Time Zones
Dates will internally always be stored as UTC. If date inputs include time using theyyyy-MM-dd'T'HH-mm-ssZ
format and no time zone is present, the date will be parsed as UTC.
boolean
Fields
A boolean
field is used to store the boolean values true|false
and is typically used for filtering and sorting but can be indexed as well (disabled by
default).
Default settings override
index: false
docValues: true
fuzzy
enabled: false
docId
Fields
A special field type used for storing the id
of a document in an optimized way. The field type cannot be configured on user defined fields but
can still be encountered as the id
field is publicly available.
Create, Update and Delete Documents
// TODO kort beskrivelse af crud operations, beskriv som metoder #### Method: xxx
Searching
Searching the index is done using the search()
method. The query part of the search can be expressed in the
built-in query string language
or as a combination of query objects. Furthermore, filters, aggregations, sorting and pagination can be applied/requested through the optional
queryOptions
object which can be passed as a second argument to search()
.
Method: search(query, [queryOptions])
Searches the index.
Parameters:
query: string|Query|Query[]
- The query to search for.queryOptions?: object
- Query options.fields?: string[]
- The name of the fields to search. Defaults to all user-created indexed fields if not defined.pagination?: object
- The pagination to apply.offset?: number
- The pagination offset. Defaults to0
.limit?: number
- The pagination limit. Defaults to10
.
sorting?: object
- The sorting to apply.field?: string
- The field to sort by or"_score"
.order?: ('asc'|'desc')
- The sorting order.
filter?: Filter
- The filter to apply.aggregations?: Aggregation[]
- The aggregations to generate.highlight?: boolean|object
- Highlight options. Defaults tofalse
.enabled: boolean
- Should highlight be enabled.
includeSource?: boolean
- - Should the source object be included in the result. Defaults tofalse
.idToSourceResolver?: function(number[]):{id:number}[]
- A function for resolving source objects from an array of ids.queryString?: object
- Query string options.parseOptions: object
- Query string parse options. Enable/disable which query-string expressions to parse.
Returns:
object
- The result of the search.results: object[]
- Information about each document matching the search and applied filters and pagination.results[].id: number
- The id of the document.results[].score: number
- The relevance score of the document. (This will be 0 in the case of sorting on something else than_score
or if a match-all query is performed).results[].source: object
- The source object if requested in thequeryOptions
.results[].highlight: object
- Highlight information if requested in thequeryOptions
.results[].highlight.source: object
- A highlighted version of the source object.sorting: object
- The sorting applied to the result.pagination: object
- The pagination applied to the result and the total number of matches.offset: number
limit: number
total: number
- The total match count.
aggregations: object
- The aggregation results. (See aggregations for the different result object structures).query: object
- The query-string and possible errors. This is only available of the query was performed using a query-string.queryString: string
- The query-string used for the search.errors: object[]
- The parse errors, if any, which occurred during parsing of the query-string.
Query String Language
Text-search-lite has a built-in query-string mini language for expressing text-based queries with support for expressing the same types og queries as the programmatic API does, such as boolean modifiers, phrases, wildcards, targeting specific fields and grouping of statements. The query-string parser automatically converts any unparsable part of the query to regular "text" making it safe to expose the query-string language directly to the end-user.
The following modifiers and expressions are supported and can be turned on/off individually to limit what should be parsed and what should just be treated as regular text.
Phrases "A phrase"
A phrase is one or more terms in a specific form which should be present in a particular order.
search for "a full sentence" or for a "single" specific spelling of a term
Must Operator +
The term, phrase or group content must be present in the document for it to match.
+peace in the +world
Must Not Operator -
The term, phrase or group content must not be present in the document for it to match.
peace not -war
Boost Operator ^NUMBER
Boost the relevance of the term, phrase or the content of a group.
peace^10 "love not war"^2
Prefix Operator *
The term must start with one or more characters, but the ending is undetermined. Prefix queries take the difference in length between the match and the prefix string into account when scoring is calculated.
love and pea*
Wildcard Operator ?
, *
The term can have single and multiple character spans which are undetermined. The single character wildcard is expressed by ?
and multiple character wildcard
is expressed by *
. Wildcard queries take the difference in length between the match and the prefix string into account when scoring is calculated.
love and p?a*e
Fuzzy Operator ~
, ~[0, 1, 2]
The term must match other terms within a maximum edit distance. When the edit distance is not defined specifically by one of [0, 1, 2]
the edit distance
is calculated based of the length of the term.
- length < 3: maxEdits = 0
- length < 6: maxEdits = 1
- length >= 6: maxEdits = 2
Fuzzy queries take the edit distance between the term and the result into account when scoring is calculated.
love~ and peace~2
Groups ()
Terms and phrases can be grouped together and boolean operators and boost can be applied to a group making it possible to express more complex queries.
+peace +(world earth) (love solidarity)^10
Field Groups FIELD1:FIELD2:()
Field groups offer the same possibilities as groups and additionally targets one or more fields where the match must occur. Multiples fields
must be separated by a colon (:
).
Field Groups cannot be nested.
title:(world earth) title:description:(love solidarity)
Query String Options
The parsing of the query-string language can be configured in the queryOptions
of search()
and parseQueryStringToQueryObjects()
.
Where each language feature can be enabled/disabled. All features are enabled by default.
queryString: object
- The Query string options.parseOptions: object
- Enable/disable which query-string expressions to parse.quote: boolean
- Toggle parsing of"exact strings and phrases"
.group: boolean
- Toggle parsing of(terms in group)
.fieldGroup: boolean
- Toggle parsing oftitle:(terms in field group)
.mustOperator: boolean
- Toggle parsing of+mustOperator
.mustNotOperator: boolean
- Toggle parsing of-mustNotOperator
.prefixOperator: boolean
- Toggle parsing ofprefix*
.wildcardOperator: boolean
- Toggle parsing ofwil_c*d
.fuzzyOperator: boolean
- Toggle parsing offuzzy~1
.boostOperator: boolean
- Toggle parsing ofboost^10
.
Parse Errors
When using the query-string language the search()
and parseQueryStringToQueryObjects()
methods includes information about any parse errors
and their exact location in the string. The parse errors are structured in the following format.
errors: object[]
- An array of error objects.errors[].type: string
- The type of the error.errors[].message: string
- A user-friendly message.errors[].startIndex: number
- The start index in the source string where the reported error occurs.errors[].spanSize: number
- The character span of the reported error.
The query-string language can also be validated directly using the validateQueryString()
method which e.g., could be used for user feedback while typing a
query.
Method: validateQueryString(queryString, [parseOptions])
Validates the query string. Any problems with the query string will be reported in the errors
array of the returned object.
Parameters:
queryString: string
- The query string to validate.parseOptions: object
- Options for configuring which parts of the query string language should be enabled. (See Query String Options).
Returns:
object
- The result of the validation.status: ('success'|'error')
- The status of the validation.errors: object[]
- The parse errors which occurred during parsing. (See above).queryString: string
- The query-string which was validated.
Query Objects
// TODO beskriv factory methods i detalje og nævn a klasser også kan importeres direkte til brug af typer og til instantiering med "new".
TODO Generelt querying, query language eller object based queries.
... TermQuery
can additionally accept number
and boolean
types which then will be transformed to the correct indexed version of the value
before querying. (boolean: true -> "true"), (number: 1000 -> "1000"), (date: 0 -> "1970-01-01")
Filters
// TODO ... range filtre virker også på strenge --- term filters virker også med numbers og dates. Dates accepterer epoc_millis som input eller en date string i det givne format for feltet.
Aggregations
Aggregations can be used to collect aggregated statistics about the result of a query. This could, e.g., be:
- the top 10 hobbies of documents
- document counts grouped by age ranges
- document counts grouped by birth-year decade
- etc.
Multiple aggregations can be requested at the same time, and aggregations can be nested to create drill-down detail hierarchies.
To request one or more aggregation include the aggregations as part of the queryOptions
object.
import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chcaa/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
aggregations: [
termAggregation('gender'),
termAggregation('hobbies', 2),
rangeAggregationWithIntegerAutoBuckets('age', 5, 0, 100),
]
});
The results of the requested aggregations are included as an array on the result object from the query. All aggregation results have the same set of base properties where only the bucket objects differ depending on the type of aggregation requested.
{
results: [/*... */],
aggregations: [
{
name: 'gender',
fieldName: 'gender',
type: "term",
fieldType: "keyword",
buckets: [
{ key: 'female', docCount: 2 },
{ key: 'male', docCount: 1 }
],
missingDocCount: 0
},
{
name: 'hobbies',
fieldName: 'hobbies',
type: "term",
fieldType: "tag",
buckets: [
{ key: 'swimming', docCount: 2 },
{ key: 'cycling', docCount: 1 }
],
missingDocCount: 1 // person with id=3 does not have any hobbies
},
{
name: 'age',
fieldName: 'age',
type: "range",
fieldType: "number",
buckets: [
{ key: '0-20', from: 0, to: 20, docCount: 0 },
{ key: '20-40', from: 20, to: 40, docCount: 2 },
{ key: '40-60', from: 40, to: 60, docCount: 1 },
{ key: '60-80', from: 60, to: 80, docCount: 0 },
{ key: '80-100', from: 80, to: 100, docCount: 0 }
],
missingDocCount: 0
}
]
}
Peculiarities of aggregation buckets
As documents with array fields can occur more than once in the aggregated statistics the sum of the counted document values may exceed to total number of documents in the query. This is expected.
Factory methods for creating the different kinds of aggregations are exported from the @chcaa/text-search-lite/aggregation
package along
with the aggregation classes the factory methods produces. The factory methods are the suggested way for creating aggregation requests where the classes can be
used
for type definitions.
Function: termAggregation(fieldName, [maxSize], [options])
Creates a new TermAggregation
for collecting statistics about keyword
, tag
, number
, date
, and boolean
fields. The occurrence of
each distinct value will be counted once per document and returned descending with the value with most documents at the top.
Parameters:
fieldName: string
- The name of the field to aggregate on.maxSize?: number
- The maximum number of buckets. Defaults to10
.options?: object
- Config options.
returns:
TermAggregation
Bucket results
Buckets are sorted by docCount:DESC, term:ASC
.
{
// name, fieldName, etc...
buckets: [
{ key: 'female', docCount: 2 },
{ key: 'male', docCount: 1 }
],
missingDocCount: 0
}
Function: rangeAggregation(fieldName, ranges, [options])
Creates a new RangeAggregation
for collecting statistics about number
, keyword
and tag
fields.
Parameters:
fieldName: string
- The name of the field to aggregate on.ranges: object[]
- The ranges to create buckets for.ranges[].from: number|string
- The lower limit of the bucket, inclusive. Optional for the first range if no lower limit is required.ranges[].to: number|string
-The upper limit of the bucket, exclusive. Optional for the last range if no upper limit is required.
options?: object
- Config options.
returns:
RangeAggregation
Bucket results
Buckets are sorted by the order they were requested.
{
// name, fieldName, etc...
buckets: [
{ key: '0-20', from: 0, to: 20, docCount: 0 },
{ key: '20-40', from: 20, to: 40, docCount: 2 },
{ key: '40-60', from: 40, to: 60, docCount: 1 },
],
missingDocCount: 0
}
Additionally, a set of convenience functions is supplied:
rangeAggregationWithIntegerAutoBuckets(fieldName, bucketCount, min, max, [options])
- Creates a newRangeAggregation
where the buckets are auto generated based on the input parameters.rangeAggregationWithIntegerAutoBucketsOpenEnded(fieldName, bucketCount, min, max, [options])
- Creates a new open-endedRangeAggregation
where the buckets are auto generated based on the input parameters. The first bucket will only haveto
defined and the last bucket onlyfrom
defined and the bucket ranges is thus open-ended.rangeAggregationWithNumberAutoBuckets(fieldName, bucketCount, min, max, [options])
- Creates a newRangeAggregation
where the buckets are auto generated based on the input parameters.rangeAggregationWithNumberAutoBucketsOpenEnded(fieldName, bucketCount, min, max, [options])
- Creates a new open-endedRangeAggregation
where the buckets are auto generated based on the input parameters. The first bucket will only haveto
defined and the last bucket onlyfrom
defined and the bucket ranges is thus open-ended.
Function: dateRangeAggregation(fieldName, ranges, [options])
Creates a new DateRangeAggregation
for collecting statistics about date
fields. Date range aggregations work in the same
way as range aggregations except that the bucket ranges can be expressed in a string date format.
Parameters:
fieldName: string
- The name of the field to aggregate on.format: string
- The date format of the ranges. One ofyyyy
,yyyy-MM-dd
oryyyy-MM-dd'T'HH-mm-ssZ
.ranges: object[]
- The ranges to create buckets for.ranges[].from: number|string
- The lower limit of the bucket, inclusive. Optional for the first range if no lower limit is required.ranges[].to: number|string
-The upper limit of the bucket, exclusive. Optional for the last range if no upper limit is required.
options?: object
- Config options.
returns:
DateRangeAggregation
Bucket results
Buckets are sorted by the order they were requested.
{
// name, fieldName, etc...
buckets: [
{ key: '1940-1950', from: '1940', to: '1950', fromMillis: -946771200000, toMillis: -631152000000, docCount: 0 },
{ key: '1990-2000', from: '1990', to: '2000', fromMillis: 631152000000, toMillis: 946684800000, docCount: 2 },
],
missingDocCount: 0
}
Aggregation Options
All aggregations can additionally be configured to have user defined name and to include nested aggregations using the following options object structure.
name?: string
- The name of the aggregation, e.g., to distinguish two aggregations on the same field. If undefined, the field name will be used.aggregations?: Aggregation[]
- Child aggregations to collect for each bucket of the aggregation.
Child aggregations can be requested as follows:
import { rangeAggregationWithIntegerAutoBuckets, termAggregation } from "@chcaa/text-search-lite/aggregation";
// get aggregations about all (empty string = match all) documents' gender and hobbies
let all = personsIndex.search('', {
aggregations: [
termAggregation('gender', {
aggregations: [
termAggregation('hobbies', 2) // Top 2 hobbies for each gender
]
})
]
});
The result of the child aggregation will be attached to each parent bucket.
{
// name, fieldName, etc...
buckets: [
{
key: 'female', docCount: 2,
aggregations: [
{
// name, fieldName, etc...
buckets: [
{ key: 'swimming', docCount: 1 },
{ key: 'cycling', docCount: 1 }
]
}
]
},
{
key: 'male', docCount: 1,
aggregations: [
{
// name, fieldName, etc...
buckets: [
{ key: 'swimming', docCount: 1 }
]
}
]
}
],
missingDocCount: 0
}
Bm25f Scoring Algorithm
The scoring algorithm builds on the bm25f algorithm as described in foundations of bm25 review and Okapi bm25. The algorithm groups all the fields of a document (included in the search) with the same analyzer into a one virtual field before scoring the term against the virtual field.
This approach gives typically better results than scoring each field individually and then combing the result after scoring as the importance of a term is considered across all fields instead of each field in isolation.
The boost of a field is integrated into the algorithm by using the boost as multiplier for the term frequency in the given field and thereby making the term "boost"-factor more important in the field.
Formula:
- streams/fields: s = 1, ..., S
- stream length: sls
- stream weight: vs
- stream term frequency: tfs,i
- avg. stream length across all docs: avsls
- term: i
- total docs with stream: n
- docs with i in stream: dfn,i
- stream length relevance: b
- term frequency relevance: k1
Tuning k1
and b
parameters
b
determines the impact of the field's length when calculating the score and is as default set to 0.75
and must be in the range [0 - 1].
Lower values mean smaller length impact and vice versa. b
can be configured on a per-field basis and should for fields with only short
text segments be considered to have a lower value so a change in length by only a few terms doesn't affect the score too much. E.g. could
a tittle field have a b
of 0.25
.
For fields like person.name
even a b
value of 0.0
should be considered as a search for Andersen
should probably yield the same score for both Gillian Andensen
and Hans Chrisitan Andersen
and not include the length of the name
in the score at all. Either the person has the name searched for or not, the length of the total name is not relevant.
k1
determines the impact of the term frequency in matching fields and is in bm25f
applied once pr. term to the score for all fields with
the same analyzer (see formula above). k1
has a default of 1.2
but can be changed for the whole document index or for each
analyzer individually.
It is also possible to change how the term frequency of a document affects the score by turning docStats.termFrequencies
off for a field, which
will result in the field count always being 1
if the term exists in the field no matter the actual term frequency and 0
if the term does not
exist in the field.
docStats.termFrequencies
is by default turned off for all other fields thantext
fields as other fields are not tokenized so counting term frequencies will not make any difference and just consume memory.
Method Summary SearchIndex
The SearchIndex
exposes the following properties and methods:
docCount
- The total number of documents in the index.indexedFields
- Name and type of all indexed fields.sortingFields
- Name and type of all fields that can be used for sorting.filterFields
- Name and type of all fields that can be used for filtering.hasField(fieldName)
- Tests if the field exists.getFieldType(fieldName)
- Returns the type of the field.add(document)
- Adds a document to the index.update(document)
- Updates a document in the index.deleteById(id)
- Removes a document from the index.delete(document)
- Removes a document from the index.search(query, queryOptions)
- Searches the index.parseQueryStringToQueryObjects(queryString, parseOptions)
- Parses the query-string to query objects.validateAggregation(aggregation)
- Validates the aggregation and throws anValidationError
if not valid.validateFilter(filter)
- Validates the filter and throws anValidationError
if not valid.validateQueryString(queryString, parseOptions)
- Validates the query string.clearCache()
- Clears any cached filters, aggregations and sorted results. This is done automatically on any change to the index, so this method is typically not required to be called manually.