suffixer

v1.0.4

Published

7 months ago

heuristic-based linear-time generalized suffix tree builder with search functionalities

0High
0Medium
0Low

aptivator

`suffixer`

Introduction

Suffix tree is a hierarchical grouping of all possible strings' endings. suffixer adopts Ukkonen's algorithm to build generalized (i.e., multi-string) suffix trees. The latter are cornerstone data structures for text analysis. The library implements a heuristic approach that organizes each tree construction operation topically in accordance with the operation's purpose. Suffix tree algorithms are not the easiest to understand and build. "Intuitive" arrangement is employed to make the implementation more understandable. suffixer is not just an educational or a demonstration utility. The library was written primarily for production and uses a number of optimizations to efficiently build and query suffix trees. Multiple strings are handled without separators and are stored individually. This makes augmenting and searching a tree only a matter of traversal devoid of extraneous delimiter-related steps. For performance reasons, JavaScript Maps, instead of plain objects ({}), are used to store edges' (paths between nodes) and string endings' information. Edges that have a child node and are 1-character long are represented only as a character key in a respective map without other information such as edge's start, stop, and string identifier. suffixer also works with "complex" characters (e.g., emojis [🙂]) that take multiple bytes of storage.

Installation

Installing the Library

To fetch the library, run the following command.

npm install --save suffixer

Distributed Versions

suffixer's default export is either an EcmaScript (ES) or a CommonJS (as a UMD) module that bundles the source code without transpilation. The library makes use of private class methods, latest native methods (e.g., Array's isArray, Object.hasOwn), and data structures such as Set and Map. The defaults are provided as such with the expectation that suffixer will be augmented as a dependency to a host project that, in turn, will be transpiled for some target environment or used, as is, in a browser or server-side environment (e.g., Node 20+) that supports the utilized language features.

For those rare circumstances when suffixer has to be utilized in older backend environments or included in a larger bundle without transpilation (for older browsers), the EcmaScript 5 distributable is available from suffixer\es5.

Usage

Building a Suffix Tree

Instantiating a Suffix Tree

suffixer's default export is Suffixer class. The latter can be instantiated with a string or an array of strings that will compose a suffix tree. Suffixer() can also be called with configurations that affect how query results are presented. Two settings are available: returnStrings and includeIndices. Both of these are true by default. Setting returnStrings to false directs to return a string identifier. includeIndices as false instructs suffixer to exclude string indices at which a matched search string begins. Some use cases may require associating extra information to strings that are in a suffix tree. Such linkages could be more efficient to operationalize using an integer identifier versus a full text.

Instantiating a Blank Suffix Tree

import {Suffixer} from 'suffixer';

let tree = new Suffixer();

Creating a Suffix Tree with a String

let string = 'abracadabra';
let tree = new Suffixer(string);

Initializing a Suffix Tree with an Array of Strings

let strings = ['way', 'ways'];
let tree = new Suffixer(strings);

Invoking a Suffix Tree with Configurations

let configs = {includeIndices: false};
let tree = new Suffixer(configs);

Getting a Suffix Tree with Strings and Configurations

let strings = ['way', 'ways'];
let configs = {returnStrings: false};
let tree = new Suffixer(strings, configs);

Augmenting a Suffix Tree

suffixer is an online implementation and allows suffix trees to be expanded after instantiation. The library makes available addString and addStrings methods to add a string or an array of string to an already-created tree. Both functions return an identifier or an array of identifiers, respectively, of the added string(s).

Adding a String to a Suffix Tree

let configs = {includeIndices: false};
let tree = new Suffixer(configs);
let strId = tree.addString('way'); // 0

Adding Strings to a Suffix Tree

let configs = {includeIndices: false};
let strings = ['way', 'ways'];
let tree = new Suffixer(configs);
let strIds = tree.addStrings(strings); // [0, 1];

Available Instance Data

Three data can be accessed from a suffix tree instance: root, strings, and stringIds. root is a node that is an entry point into a tree. strings is an array of all texts composing a suffix tree. stringIds are internal identifiers of entered strings and are primarily used by excludes() method.

Changing Suffix Tree Configurations

The library provides setConfigs method to override configurations after a suffix tree is instantiated.

Overriding Configurations

let strings = ['way', 'ways'];
let tree = new Suffixer(strings);
tree.setConfigs({returnStrings: false});

Querying a Suffix Tree

Selecting Strings that Include a Substring

includes() searches a suffix tree for a string pattern. If successful, the method returns an array of results. Each result is an array containing a string that includes a searched-for substring plus indices at which a pattern occurs. When nothing is found, the function returns an empty array. includes() also takes an optional configurations' parameter that just for a method's invocation will override a tree's global settings.

Selecting Strings that Include a String Pattern

let strings = ['radar', 'bay'];
let tree = new Suffixer(strings);
let results = tree.includes('a'); // [['radar', [1, 3]], ['bay', [1]]]

Passing an Optional Configurations Object

let strings = ['radar', 'bay'];
let tree = new Suffixer(strings);
let results = tree.includes('a', {returnStrings: false}); // [[ 0, [1, 3]], [1, [1]]]

Returning an Empty Array when There is No Match

let strings = ['radar', 'bay'];
let tree = new Suffixer(strings);
let results = tree.includes('stop'); // []

Finding Strings that Start with a Substring

startsWith() searches for strings that begin with a specified pattern. When a search is fruitful, the function returns an array of matches. No indices are provided. By implication a pattern would start at the index of 0. The method also takes a configurations' parameter that can instruct to return only string identifiers. When no matches are found, an empty array is produced.

Selecting Strings that Start with a Pattern

let strings = ['way', 'ways'];
let tree = new Suffixer(strings);
let results = tree.startsWith('wa'); // ['way', 'ways']
results = tree.startsWith('ay'); // []

Locating Strings that End with a Substring

endsWith() will attempt to locate strings whose ending matches a search string. When there are search results, the method outputs a matched string and an index at which the ending begins. An empty array is returned when no strings match a query. endsWith() can also override default configurations to return strings and indices.

Locating Strings that end With a Pattern

let tree = new Suffixer(['way', 'ways']);
let results = tree.endsWith('ay'); // [['way', 1]]
results = tree.endsWith('ly') // []

Searching Strings that Exclude a Substring

excludes() uses includes() internally to find all strings that match a search string and then removes these strings' identifiers from a set of all string identifiers. Given this approach, excludes() is the second slowest query function. Performance tests show that finding words out of 370,105 that exclude a character a takes a little over a second. Results of course will vary depending on the hardware. The function returns just the strings or string identifiers that do not have a search pattern or an empty array when there are no exclusion results.

Searching Strings that Do not Include a Pattern

let tree = new Suffixer(['way', 'ways', 'silly']);
let results = tree.excludes('a'); // ['silly']
results = tree.excludes('y'); // []

Getting Strings that Equal to a Search String

equals() provides all strings that exactly match the search string. The method returns only the strings, string identifiers or an empty array when there is no match.

Getting Strings that Are an Exact Match

let tree = new Suffixer(['way', 'ways', 'silly']);
let results = tree.equals('way', {returnStrings: false}); // [0]

Determining the Longest Repeating Substring

findLongestRepeating() looks for the longest substring that occurs more than once. The function will error if a tree is built with more than one string. When successful, the method produces an object that contains a repeating substring plus indices at which the substring starts. Running this search on a string without repeats will return undefined.

Looking for the Longest Repeating Substring

let tree = new Suffixer('abracadabra');
let results = tree.findLongestRepeating(); // {indices: [7, 0], repeating: 'abra'}

Locating Single Character Repeats

let tree = new Suffixer('dmitriy');
let results = tree.findLongestRepeating(); // { indices: [2, 5], repeating: 'i' }

Returning undefined When no Repeats Exist

let tree = new Suffixer('abcdef');
let results = tree.findLongestRepeating(); // undefined

Isolating the Longest Common Substring

findLongestCommon() will attempt to find the lengthiest common substring across all strings that are in a suffix tree. If such a substring exists, the function returns an object with strData and common entries. strData is an array of words plus indices for each word at which the common substring begins. It is possible for a word to have more than one lengthiest substring that occurs across all words in a tree. common is the actual substring. findLongestCommon() is the slowest query function. It finds the deepest internal nodes and walks up to their parents until an internal node is encountered under which various suffixes of all strings fall. Performance tests show that finding a common substring across 5000 words takes less than 0.1 seconds. The method outputs undefined when no common substring exists.

Isolating the Longest Shared String

let tree = new Suffixer(['way', 'ways', 'abracadabra', 'saint']);
let results = tree.findLongestCommon();
console.log(results);
/*
  {
    strData: [
      ['abracadabra', [10, 3, 5, 7, 0]],
      ['saint', [1]],
      ['way', [1]],
      ['ways', [1]]
    ],
    common: 'a'
  }
*/

Development

Development Setup

Perform the following steps to setup the repository locally.

git clone https://github.com/aptivator/suffixer.git
cd suffixer
npm install

To start development mode run npm run dev or npm run dev:coverage.

Contributing Changes

The general recommendations for contributions are to use the latest JavaScript features, have tests with complete code coverage, and include documentation. The latter may be necessary only if a new feature is added or an existing documented feature is modified.

Potential New Features

Suffix trees can be used for a wide variety of applications such as text compression, least common extension, string approximation, and suffix array construction to name a few. Over time some of these features will be added. Perhaps a useful tree construction feature to have is an on-demand removal of a string(s) from a suffix tree.

Implementation Details

Those interested in details of the algorithm used should first get acquainted with the data structure that this implementation uses to store a generalized suffix tree. Algorithm description can be read next.

Performance

A variety of tests are provided to gauge suffixer's performance. Code from suffix-tree-demo was used for some comparisons. That project directly implements Ukkonen's algorithm for generalized trees. The package concatenates all strings together separated by a delimiter. Build and query tests show this to be not an efficient approach. suffixer can build a tree of 370,105 dictionary words in less than four seconds. The delimited approach (on the same machine) takes 10s of minutes. When ingesting just one very large string, suffixer is about a third faster than a typical Ukkonen implementation. Using Maps instead of plain objects contributes to speed boost. This library minimizes some function calls and eliminates full edge representation for 1-character edges that have child nodes further improving performance.

Run the following command to see performance tests.

npm run performance

Caveats

Very large suffix trees may take more space than a default heap allocation. For example, for node.js, --max-old-space-size command line option may need to be used to increase the amount of working memory. Suffix trees in general consume a lot of space.