retext-lexrank
v1.4.0
Published
Lexrank algorithm for retextjs
Maintainers
Readme
Retext Lexrank
Retext plugin for generating unsupervised text summarization using the Lexrank algorithm.
Install
npm i --save retext-lexrankUse
import { unified } from 'unified'
import latin from 'retext-latin'
import lexrank from 'retext-lexrank'
const processor = unified()
.use(latin)
.use(lexrank)
const file = '...' // vfile or text string
const tree = processor.parse(file)
processor.run(tree, file)Options
retext-lexrank accepts an optional options object:
type Options = {
/**
* Maximum number of sentences per chunk.
*
* Defaults to `Infinity` (no chunking).
*/
maxSentencesPerChunk?: number
/**
* Optional semantic chunk delimiter.
*
* If provided, matching paragraphs split the document into independent
* sections. Delimiter paragraphs are not scored.
*/
delimiter?: string | RegExp | ((text: string, node: Paragraph) => boolean)
}Large documents
LexRank is based on pairwise sentence similarity and becomes expensive on very
large inputs. For long documents (for example, books or manuals), use
maxSentencesPerChunk to cap chunk size:
const processor = unified().use(latin).use(lexrank, {
maxSentencesPerChunk: 500
})Chunking uses a balanced strategy so you do not end up with tiny tail chunks.
For example, with 505 sentences and maxSentencesPerChunk: 500, the plugin
splits into two balanced chunks instead of 500 + 5.
Semantic chunking with delimiters
If you have meaningful separators (for example chapter markers), use
delimiter to split scoring by section:
const processor = unified().use(latin).use(lexrank, {
delimiter: /^\[CHAPTER_BREAK\]$/
})You can also provide a custom function:
const processor = unified().use(latin).use(lexrank, {
delimiter(text) {
return text.startsWith('Chapter ')
}
})Use with retext-keywords
Adding the part-of-speech and keywords plugins to the pipeline yields more polarized results.
import { unified } from 'unified'
import latin from 'retext-latin'
import pos from 'retext-pos'
import keywords from 'retext-keywords'
import lexrank from 'retext-lexrank'
const processor = unified()
.use(latin)
.use(pos)
.use(keywords)
.use(lexrank)Example
Note
The
retext-lexrankplugin works best on medium-to-long samples of text, like web articles, blogs, and essays. The following is a simple example.
Using the classic write-music sample from the unifiedjs use-cases:
Write Music (by Gary Provost)
This sentence has five words. Here are five more words.
Five word sentences are fine. But several together
become monotonous. Listen to what is happening. The
writing is getting boring. The sound of it drones. It's
like a stuck record. The ear demands some variety.
Now listen. I vary the sentence length, and I create
music. Music. The writing sings. It has a pleasant
rhythm, a lilt, a harmony. I use short sentences. And I
use sentences of medium length. And sometimes when I am
certain the reader is rested, I will engage him with a
sentence of considerable length, a sentence that burns
with energy and builds with all the impetus of a
crescendo, the roll of the drums, the crash of the
cymbals—sounds that say listen to this, it is important.
So write with a combination of short, medium, and long
sentences. Create a sound that pleases the reader's ear.
Don't just write words. Write music.Supplying the above text to the processor, we can then find the top-ranked sentences:
import { selectAll } from 'unist-util-select'
import { toString } from 'nlcst-to-string'
selectAll('SentenceNode', tree)
.sort(({ data: { lexrank: a } }, { data: { lexrank: b } }) => b - a)
.slice(0, 3)
.forEach(sentence => {
const score = sentence.data.lexrank.toFixed(2)
console.log(`[${score}]: ${toString(sentence)}`)
})Running the above yields:
[1.00]: I vary the sentence length, and I create music.
[0.85]: And I use sentences of medium length.
[0.71]: So write with a combination of short, medium, and long sentences.Tests
Run npm test to run tests.
Run npm coverage to produce a test coverage report.
