dawg-search
v2.0.1
Published
search multiple word in text, by leveraging two types of automaton
Readme
DAWG Search
Description
- It aims at solving a string searching problem: how to find many words in one text.
- It's designed to work with non-English languages like Chinese and Japanese.
- It's based on two well-studied data strucutres: Trie Automaton and Suffix Automaton
- If multiple words overlaps, words shown first or longer take precedence,
Install
npm i dawg-searchUsage
import { prepareSearch, refineMatches } from 'dawg-search'
const text = '举头望明月,低头思故乡'
const words = ['明月', '故乡', '月,低']
const { findWords } = prepareSearch(words)
const results = refineMatches(findWords(text))
// [{start: 3, end: 5}, {start: 9, end: 11}]
How it works
Trie Automaton
First you need to prepare a dictionary of words. It will be processed into a trie, which merged not only the prefixes but also the suffixes.
Historically it's called Deterministic Acyclic Finite State Automaton (DAFSA). (A regular trie will work, but considering many words have common suffix, it could save a lot of memory).Suffix Automaton When searching text, the text is processed into a suffix automaton. The trie automaton can be reused.
Traversing Phase
Then, from root node of the suffix automaton, it will traverse every transition, which implicitly will traverse all substrings.
It runs in two pass,findWordsreturn results in unordered list,refineMatcheswill sort and eliminate overlaps.
Limitations
- This lib can only process concrete words, not regex.
- For the two data structures, a long chain of single transitions could be compressed into one transition, achieving more compact forms.
License
MIT License © 2026 UnluckyNinja
