@jrc03c/js-nlp-tools

v0.0.13

Published

7 months ago

This is a little set of JS natural language processing tools.

0High
0Medium
0Low

jrc03c

Intro

This is a little set of JS natural language processing tools.

Installation

npm install --save @jrc03c/js-nlp-tools

Usage

import { Corpus, Document } from "@jrc03c/js-nlp-tools"
import fs from "node:fs"

const doc1 = new Document({
  name: "Frankenstein",
  raw: fs.readFileSync("path/to/frankenstein.txt", "utf8"),
})

const doc2 = new Document({
  name: "Pride & Prejudice",
  raw: fs.readFileSync("path/to/pride-and-prejudice.txt", "utf8"),
})

const doc3 = new Document({
  name: "Moby Dick",
  raw: fs.readFileSync("path/to/moby-dick.txt", "utf8"),
})

const corpus = new Corpus({ docs: [doc1, doc2, doc3] })

corpus.process().then(() => {
  console.log(corpus.computeTFIDFScore("Frankenstein", doc1))
})

API

`Corpus`

Methods

`Corpus(data)` (constructor)

Returns a new Corpus instance. Can optionally take a data argument, which is an object with properties corresponding to Corpus instance properties (e.g., docs).

`computeIDFScore(word)`

Returns the inverse document frequency score for a given word. Is computed as:

\text{IDF} = \text{log}(N / n_t)

Where:

$N$ = the total number of documents in the corpus
$n_t$ = the number of documents in which the word appears

`computeTFScore(word, doc)`

Returns the term frequency score for a given word and document. Is computed as:

\text{TF} = 0.5 + 0.5 \frac{f_{t, d}}{\text{max}_{\{t'∈d\}} f_{t',d}}

Where:

$f_{t, d}$ = the number of times the word appears in the document
$\text{max}_{{t'∈d}} f_{t',d}$ = the number of times the most frequently-occurring word appears in the document

`computeTFIDFScore(word, doc)`

Returns the tf-idf score for a given word and document. Is computed as the term frequency score multiplied by the inverse document frequency score.

`process(progress)`

Returns a Promise that resolves once all documents in the corpus have been processed. Can optionally take a callback function that is passed the progress through the documents as a value between 0 and 1.

Properties

`docs`

An array of Document instances.

`hasBeenProcessed`

A boolean indicating whether or not the instance's process method has been invoked (and completed).

`Document`

Methods

`Document(data)` (constructor)

Returns a new Document instance. Can optionally take a data object with properties corresponding to Document instance properties (e.g., wordCounts).

`getWordCount(word)`

Returns the number of times word (a string) appears in the document.

`process()`

Returns a Promise that resolves once the document has been processed (indexed).

Properties

`hasBeenProcessed`

A boolean representing whether or not the instance's process method has been invoked (and completed).

`isCaseSensitive`

A boolean representing whether or not case should matter when indexing words.

`mostFrequentWord`

A string representing the word that appears most frequently in the document.

`name`

A string representing the name of the document. If no name is assigned via the data object passed into the constructor, then a random string will be assigned as the document's name.

`raw`

A string representing the raw text on which the document is based.

`totalWordCount`

A non-negative integer representing the total number of words in the document.

`wordCounts`

A dictionary that maps words (as strings) to the numbers of times those words appear in the document (as non-negative integers).

Utility functions

`clean(raw, shouldPreserveCase)`

Given raw (a string) and optionally shouldPreserveCase (a boolean), returns a copy of raw in which all punctuation has been removed and all whitespace characters have been replaced with spaces. By default, shouldPreserveCase is false.

`defineReadOnlyProperty(object, name, value)`

Defines a read-only property called name on object with the value value. Returns object.

Note that any read-only properties defined this way will fail silently when new values are assigned to them. In other words, you won't be notified when any assignment attempts fail.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Intro

Installation

Usage

API

Corpus

Methods

Corpus(data) (constructor)

computeIDFScore(word)

computeTFScore(word, doc)

computeTFIDFScore(word, doc)

process(progress)

Properties

docs

hasBeenProcessed

Document

Methods

Document(data) (constructor)

getWordCount(word)

process()

Properties

hasBeenProcessed

isCaseSensitive

mostFrequentWord

name

raw

totalWordCount

wordCounts

Utility functions

clean(raw, shouldPreserveCase)

defineReadOnlyProperty(object, name, value)

`Corpus`

`Corpus(data)` (constructor)

`computeIDFScore(word)`

`computeTFScore(word, doc)`

`computeTFIDFScore(word, doc)`

`process(progress)`

`docs`

`hasBeenProcessed`

`Document`

`Document(data)` (constructor)

`getWordCount(word)`

`process()`

`hasBeenProcessed`

`isCaseSensitive`

`mostFrequentWord`

`name`

`raw`

`totalWordCount`

`wordCounts`

`clean(raw, shouldPreserveCase)`

`defineReadOnlyProperty(object, name, value)`