@picosearch/language-english
v1.0.1
Published
English text preprocessor for picosearch.
Downloads
11
Readme
English Text Preprocessor
This module provides basic text preprocessing functions for English text, including tokenization, punctuation removal, stopword filtering, and stemming.
Functions
tokenizer(doc: string): string[]
This function takes a string as input and returns an array of tokens (words) extracted from by matching it against word characters. If the input is not a string, it returns an empty array.
analyzer(token: string): string
This function processes a single token by removing punctuation and converting it to lowercase. It then checks the token against a list of English stopwords and removes it if found. If not, it stems the token using the porter stemmer.
Dependencies
porter-stemmer: English word stemmer.stopword: A library containing a list of stopwords for various languages, including English.
