npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

pelias-analysis

v1.2.1

Published

Analysis libraries

Downloads

9

Readme

Pelias analysis libraries

Greenkeeper badge

This repository contains prebuild textual analysis functions (analyzers) which are composed of smaller modules (tokenizers), each tokenizer performs actions such as transforming, filtering and enriching word tokens.

Using Analyzers

Analyzers are available as functions and can be called like any regular function, the input is a single string and the output is also a single string:

var street = require('./analyzer/street')
var analyzer = street()

analyzer('main str s')
// Main Street South

Analyzers also accept a 'context' object which is available throughout the analysis pipeline:

var analyzer = street({ locale: 'de' })

analyzer('main str s')
// Main Strasse Sued

Using Tokenizers

Tokenizers are intended to be used as part of an analyzer, but can also be used independently by calling Array.reduce on an array of tokens:

var tokenizer = require('./tokenizer/diacritic')

[ 'žůžo', 'Cinématte' ].reduce( tokenizer, [] )
// [ 'zuzo', 'Cinematte' ]

Writing Tokenizers

Tokenizers are functions with the interface expected by Array.reduce.

In their simplest form a tokenizer is written as:

// a delete-all tokenizer emits no words
var tokenizer = function( res, word, pos, arr ){

  // you must always return $res
  return res
}

For a tokenizer to have no effect on the token stream it must res.push() on to the response array each word it took in:

// a no-op tokenizer emits words verbatim as they were taken in
var tokenizer = function( res, word, pos, arr ){

  // push the word on to the response array unmodified
  res.push( word )

  // you must always return $res
  return res
}

A tokenizer can choose which words are pushed downstream, it can also modify words and push more than one word on to the response array:

// a split tokenizer cuts a string on word boudaries, producing multiple words
var tokenizer = function( res, word, pos, arr ){

  // split the input word on word boundaries
  var parts = word.split(/\b/g)

  // push each part downstream
  parts.forEach( function( part ){
    res.push( part )
  })

  // you must always return $res
  return res
}

Using these techniques, you can write tokenizers which delete, modify or create new words.

Writing Tokenizers (advanced)

More advanced tokenizers require information about the context in which they were run, for example, knowing the locale of your input tokens might allow you to vary its functionality accordingly.

Context is provided to tokenizers by using Function.bind to bind the context to the tokenizer. This information will then be available inside the tokenizer using the this keyword:

// an abbreviation tokenizer converts the contracted form of a word to its equivalent expanded form
var tokenizer = function( res, word, pos, arr ){

  // detect the input locale (or default to english)
  var locale = this.locale || 'en'

  if( 'str.' === word ){
    switch( locale ){
      case 'de':
        // transform to German expansion
        res.push( 'strasse' )
        return res
      case 'en':
        // transform to English expansion
        res.push( 'street' )
        return res
    }
  }

  // push the word on to the response array unmodified
  res.push( word )

  // you must always return $res
  return res
}

You can then control the runtime context of the analyzer using Function.bind:

var english = tokenizer.bind({ locale: 'en' })
[ 'str.' ].reduce( english, [] )
// [ 'street' ]

var german = tokenizer.bind({ locale: 'de' })
[ 'str.' ].reduce( german, [] )
// [ 'strasse' ]

Command line interface

there is an included CLI script which allows you to easily pipe in files for testing an analyzer:

# test a single input
$ node cli.js en street <<< "n foo st w"

North Foo Street West

# test multiple inputs
$ echo -e "n foo st w\nw 16th st" | node cli.js en street

North Foo Street West
West 16 Street

# test against the contents of a file
$ node cli.js en street < nyc.names

100 Avenue
100 Drive
100 Road
... etc

# test against openaddresses data
$ cut -d',' -f4 /data/oa/de/berlin.csv | sort | uniq | node cli.js de street

Aachener Strasse
Aalemannufer
Aalesunder Strasse
... etc

using the linux diff command you can view a side-by-side comparison of the data before and after analysis:

$ diff \
  --side-by-side \
  --ignore-blank-lines \
  --suppress-common-lines \
  --width=100 \
  --expand-tabs \
  nyc.names \
  <(node cli.js en street < nyc.names)

ZEBRA PL                  | Zebra Place
ZECK CT                   | Zeck Court
ZEPHYR AVE                  | Zephyr Avenue
... etc

Running tests

units test are run with:

$ npm test

functional tests are run with:

$ npm run funcs