npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@clipperhouse/jargon

v0.3.0

Published

A tokenizer and lemmatizer for canonical terms in text

Downloads

126

Readme

Jargon is a TypeScript/JavaScript library for tokenization and lemmatization. It finds variations on canonical terms and converts them to a single form.

For example, in tech, you might see 'node js' or 'NodeJS' or 'node.js' and want them understood as the same term. That’s lemmatization.

Quick start

npm install "@clipperhouse/jargon@latest"

Then create a file, preferably TypeScript.

// demo.ts

import jargon from '@clipperhouse/jargon';	
import stackexchange from '@clipperhouse/jargon/stackexchange';	// a dictionary

const text = 'I ❤️ Ruby on Rails and vue';

const lemmatized = jargon.Lemmatize(text, stackexchange);

console.log(lemmatized.toString());

// I ❤️ ruby-on-rails and vue.js
// demo.js

const jargon = require('@clipperhouse/jargon');
const stackexchange = require('@clipperhouse/jargon/stackexchange');

const text = 'I ❤️ Ruby on Rails and vue';

const lemmatized = jargon.Lemmatize(text, stackexchange);
console.log(lemmatized.toString());

// I ❤️ ruby-on-rails and vue.js

What’s it doing?

jargon tokenizes the incoming text — it’s not search-and-replace or regex. It understands tech-ish terms as single words, such as Node.js and TCP/IP, and #hashtags and @handles, where other tokenizers might split those words. It does pretty well with email addresses and URLs.

Tokens go to the lemmatizer, with a dictionary. The lemmatizer passes over tokens, and asks the dictionary if it recognizes them. It greedily looks for multi-token phrases like 'Ruby on Rails', converting them a single ruby-on-rails token.

The lemmatizer returns a lazy iterable. You should consume tokens with for..of.

Dictionaries

Two dictionaries are included.

The stackexchange dictionary looks for technology terms, using tags and synonyms from Stack Overflow. It is insensitive to spaces, hyphens, dots, slashes and case, so variations like ASPNET and asp.net are recognized as the same thing. It also understands synonyms such as ecmascript ↔ javascript.

Another example is the contractions dictionary. It splits tokens like it'll into two tokens it and will.

You can pass multiple dictionaries to the lemmatizer.

// demo.ts

import jargon from '@clipperhouse/jargon';
import stackexchange from '@clipperhouse/jargon/stackexchange';
import contractions from '@clipperhouse/jargon/contractions';

const text = 'She’ll use react js and type script';

const lemmatized = jargon.Lemmatize(text, stackexchange, contractions);

console.log(lemmatized.toString());

// She will use react.js and typescript