retext-lexrank

v1.4.0

Published

20 days ago

Lexrank algorithm for retextjs

0High
0Medium
0Low

gorango

unist retext nlp lexrank salience

Retext Lexrank

Retext plugin for generating unsupervised text summarization using the Lexrank algorithm.

Install

npm i --save retext-lexrank

Use

import { unified } from 'unified'
import latin from 'retext-latin'
import lexrank from 'retext-lexrank'

const processor = unified()
  .use(latin)
  .use(lexrank)

const file = '...' // vfile or text string
const tree = processor.parse(file)

processor.run(tree, file)

Options

retext-lexrank accepts an optional options object:

type Options = {
  /**
   * Maximum number of sentences per chunk.
   *
   * Defaults to `Infinity` (no chunking).
   */
  maxSentencesPerChunk?: number

  /**
   * Optional semantic chunk delimiter.
   *
   * If provided, matching paragraphs split the document into independent
   * sections. Delimiter paragraphs are not scored.
   */
  delimiter?: string | RegExp | ((text: string, node: Paragraph) => boolean)
}

Large documents

LexRank is based on pairwise sentence similarity and becomes expensive on very large inputs. For long documents (for example, books or manuals), use maxSentencesPerChunk to cap chunk size:

const processor = unified().use(latin).use(lexrank, {
  maxSentencesPerChunk: 500
})

Chunking uses a balanced strategy so you do not end up with tiny tail chunks. For example, with 505 sentences and maxSentencesPerChunk: 500, the plugin splits into two balanced chunks instead of 500 + 5.

Semantic chunking with delimiters

If you have meaningful separators (for example chapter markers), use delimiter to split scoring by section:

const processor = unified().use(latin).use(lexrank, {
  delimiter: /^\[CHAPTER_BREAK\]$/
})

You can also provide a custom function:

const processor = unified().use(latin).use(lexrank, {
  delimiter(text) {
    return text.startsWith('Chapter ')
  }
})

Use with `retext-keywords`

Adding the part-of-speech and keywords plugins to the pipeline yields more polarized results.

import { unified } from 'unified'
import latin from 'retext-latin'
import pos from 'retext-pos'
import keywords from 'retext-keywords'
import lexrank from 'retext-lexrank'

const processor = unified()
  .use(latin)
  .use(pos)
  .use(keywords)
  .use(lexrank)

Example

Note
The retext-lexrank plugin works best on medium-to-long samples of text, like web articles, blogs, and essays. The following is a simple example.

Using the classic write-music sample from the unifiedjs use-cases:

Write Music (by Gary Provost)

This sentence has five words. Here are five more words.
Five word sentences are fine. But several together
become monotonous. Listen to what is happening. The
writing is getting boring. The sound of it drones. It's
like a stuck record. The ear demands some variety.

Now listen. I vary the sentence length, and I create
music. Music. The writing sings. It has a pleasant
rhythm, a lilt, a harmony. I use short sentences. And I
use sentences of medium length. And sometimes when I am
certain the reader is rested, I will engage him with a
sentence of considerable length, a sentence that burns
with energy and builds with all the impetus of a
crescendo, the roll of the drums, the crash of the
cymbals—sounds that say listen to this, it is important.

So write with a combination of short, medium, and long
sentences. Create a sound that pleases the reader's ear.
Don't just write words. Write music.

Supplying the above text to the processor, we can then find the top-ranked sentences:

import { selectAll } from 'unist-util-select'
import { toString } from 'nlcst-to-string'

selectAll('SentenceNode', tree)
  .sort(({ data: { lexrank: a } }, { data: { lexrank: b } }) => b - a)
  .slice(0, 3)
  .forEach(sentence => {
    const score = sentence.data.lexrank.toFixed(2)
    console.log(`[${score}]: ${toString(sentence)}`)
  })

Running the above yields:

[1.00]: I vary the sentence length, and I create music.
[0.85]: And I use sentences of medium length.
[0.71]: So write with a combination of short, medium, and long sentences.

Tests

Run npm test to run tests.

Run npm coverage to produce a test coverage report.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme