longestrepeatedstrings

v1.0.41

Published

4 months ago

Finds duplicated text strings and generates a report about the longest substrings or most frequent words in supplied text

0High
0Medium
0Low

braksator

longest repeated substring string strings substrings duplicate duplicated text find finder words code

Longest Repeated Strings

Finds duplicated text and generates a report about the longest substrings or most frequent words in supplied text, weighted by how much space the string takes up overall (length * occurences).

You supply input text or files. It returns raw data or a text report.

🧵 Try an online demo

(This module was designed to analyze javascript code for refactoring opportunities in a Gulp task)

Stand-alone usage

See online demo link above, or download project zip file and open index.html to use the GUI.

Installation

This is a Node.JS module available from the Node Package Manager (NPM).

https://www.npmjs.com/package/longestrepeatedstrings

Here's the command to download and install from NPM:

npm install longestrepeatedstrings -S

or with Yarn:

yarn add longestrepeatedstrings

Usage

Include Longest Repeated Strings in your project:

var LRS = require('longestrepeatedstrings');

Finding Repeated Substrings in Text

You can analyze a single text by using the text function to find the longest repeated substrings:

const text = 'Your text content goes here';
const results = LRS.text(text, { maxRes: 20, minLen: 8 });
console.log(results);

Parameters:

text (String): The input text to analyze.
opts (Object, optional): A configuration object with the following properties:
- maxRes (Number, default: 50): The maximum number of results to return. Restricts the final list to highest scoring results and does not speed up processing.
- minLen (Number, default: 4): The minimum length of substrings to consider.
- maxLen (Number, default: 40): The maximum length of substrings to consider.
- minOcc (Number, default: 2): The minimum number of occurrences a substring must have to be included.
- penalty (Number, default: 0): Per-occurence score penalty, helps order results for deduplication.
- split (Array, default: [' ', ',', '.', '\n']): Splits input after specified strings. If not using the words and clean options, settings THIS up properly for expected input will be key to making this module effective.
- break (Array, default: []): Splits input ON these strings and won't include them in matches. Can be used to concatenate an array of texts with a special char.
- escSafe (Boolean, default: true): Will take extra care around escaped characters. May as well leave this on.
- words (Boolean, default: true): If true, matches only whole words.
- clean (Boolean, default: false): If true, strips all symbols from input.
- trim (Boolean, default: true): If true, trims white space from results.
- omit (Array, default: []): An array of substrings to omit from the results. Can be used to ignore accepted long/frequent words. as minLen, for example, will cause longer substrings to appear earlier in the results. Negative penalty will favor more frequent substrings.

Returns: An array of objects containing the repeated substrings, their count, and a score for each.

Analyzing Files

You can analyze multiple files by using the files function. This will read the contents of the files and find repeated substrings in each one.

const fs = require('fs');
const files = ['file1.txt', 'file2.txt'];
const results = LRS.files(files, opts);
console.log(results);

Parameters:

files (Array): An array of file paths to analyze.
opts (Object, optional): Same options as in the text function.

Returns: An object where the keys are file names and the values are the repeated substrings found in each file.

Creating Reports

File Analysis Report

const report = LRS.filesReport(results, 1); // Pass `1` to log to console
console.log(report);

Parameters:

results (Object): The results returned by the files function.
out (Number, optional, default: 0): If set to 1, the report will be logged to the console too.
chars (Object, optional): A configuration object with the following properties:
- delim (String, default: '★'): Character/s to insert between each result.
- open (String, default: '⦅'): Character/s to insert before the repeat count.
- close (String, default: '×⦆'): Character/s to insert after the repeat count.

Returns: A text report summarizing the repeated substrings found in each file.

Text Analysis Report

const report = LRS.textReport(results, 1); // Pass `1` to log to console
console.log(report);

Parameters:

results (Array): The results returned by the text function.
out (Number, optional, default: 0): If set to 1, the report will be logged to the console too.
chars (Object, optional): Same options as in the filesReport function.

Returns: A list of repeated substrings with their occurrence counts.

Example Workflow

Either, analyze a single text or multiple files:

const text = 'This is an example text with repeated substrings';
const results = LRS.text(text);

const files = ['file1.txt', 'file2.txt'];
const results = LRS.files(files);

Afterward, generate a report:

const report = LRS.filesReport(results, 1); // Logs the report to console

Notes

Results are sorted by a score, which is calculated based on the length of the substring and the number of occurrences.
This package is used in JCrush; a Javascript code deduplicator.

Gulp usage

In your gulpfile.mjs, use Longest Repeated Strings as a Gulp plugin:

Step 1: Import Longest Repeated Strings

import LRS from 'longestrepeatedstrings';

Step 2: Create a Gulp Task for Longest Repeated Strings

var analyzeStrings = true;
gulp.task('analyze', function (done) {
  if (analyzeStrings) {
    LRS.filesReport(LRS.files(['./script.min.js', './styles.min.css', './index.html'], {
      clean: 1, words: 1,
      omit: [
        // This is a list of words that we just accept we've used a lot in the
        // content, and we don't need to see them appear in repeated-strings
        // reports. (supply all with lower-case)
        'consciousness', 'enlightenment', 'ephemeral', 'watching', 'observing',
        'communication', 'inspiring', 'realizing', 'uplifting', 'illusion',
      ],
    }), 1, {delim: ", "});
    analyzeStrings = false;
  }
  setTimeout(() => {analyzeStrings = true}, 1000 * 60 * 60); // Only run once an hour.
  done(); // Signal completion
});

Step 3: Run Longest Repeated Strings After Minification

To run Longest Repeated Strings after your minification tasks, add Longest Repeated Strings in series after other tasks, such as in this example:

gulp.task('default', gulp.series(
  gulp.parallel('minify-css', 'minify-js', 'minify-html'), // Run your minification tasks first
  'analyze' // Then run LRS
));

Contributing

https://github.com/braksator/LongestRepeatedStrings

In lieu of a formal style guide, take care to maintain the existing coding style.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Longest Repeated Strings

Stand-alone usage

Installation

Usage

Finding Repeated Substrings in Text

Analyzing Files

Creating Reports

File Analysis Report

Text Analysis Report

Example Workflow

Notes

Gulp usage

Step 1: Import Longest Repeated Strings

Step 2: Create a Gulp Task for Longest Repeated Strings

Step 3: Run Longest Repeated Strings After Minification

Contributing