punkt
v1.0.2
Published
A port of NLTK's Punkt sentence tokenizer to JS.
Maintainers
Readme
Punkt
A port of NLTK's Punkt sentence tokenizer to JS. The algorithm for this tokenizer is described in:
Kiss, Tibor and Strunk, Jan (2006): Unsupervised Multilingual Sentence
Boundary Detection. Computational Linguistics 32: 485-525.Installation
Punkt is available through npm:
npm install punktYou can also download the language packs via npm, under punkt-packs:
npm install punkt-packsExamples
Base example
import { PunktParameters, PunktTokenizer } from "punkt"
import { english } from "punkt-packs"
const params = new PunktParameters(english)
const tokenizer = new PunktTokenizer(params)
const text = "Dr. Smith went to Washington. He met Sen. Warren at 3 p.m. They talked."
const sentences = tokenizer.tokenize(text)
console.log(sentences) // ["Dr. Smith went to Washington.", "He met Sen. Warren at 3 p.m.", "They talked."]Custom language packs
import { PunktParameters, PunktTokenizer } from "punkt"
import { readFile } from "fs"
import { join } from "path"
const params = new PunktParameters(n => readFile(`punt_tab/english/${n}`, "utf-8"))
await params.init()
const tokenizer = new PunktTokenizer(params)
const text = "The CEO of L.L.Bean, Shawn Gorman, spoke. Sales rose."
const sentences = tokenizer.tokenize(text)
console.log(sentences) // ["The CEO of L.L.Bean, Shawn Gorman, spoke.", "Sales rose."]Documentation
The module exposes 3 classes:
PunktTokenizer: The main class, responsible for the sentence splitting.PunktParameters: Per-language sentence splitting heuristics data.PunktLanguageVars: Fine tuned configuration.
