npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@datagica/parse-companies

v0.0.7

Published

Find companies in a text

Downloads

10

Readme

Parse companies

Find companies in a text. Well, most of the time.

Free data sources used

  • french open corporation database (SIRENE)
  • http://www.opendata500.com/us/list/
  • https://github.com/GovLab/OpenData500
  • http://api.corpwatch.org
  • https://datahub.io/dataset/corpwatch

Guidelines

It is better to have little solid data than a lot of garbage, so I try to clean up and verify things by hand (with google) as much as possible, or by using pattern matching to normalize anomalies.

When cleaning up the dataset, there are a couple of things to keep in mind:

Common problems

Examples of anomalies and issues:

  • The legal category of companies may or may not be present
  • The legal category might have variations (eg. S.A.S., S.A.S, SAS..)
  • Or it can be at the beginning or the end (eg. "FOOBAR SARL", "SARL FOOBAR")

The solution to this is to normalize the names whenever possible, and generate aliases covering most of the cases

About aliases

You may wonder why we pre-generate aliases, since it could be done at runtime for entries in the dataset and/or for each word found in documents as we go.

It is important to understand that normalizing stuff at runtime is costly for the CPU while disk, memory and network are getting faster and cheaper.

Until a certain size limit, it is much faster to just load a big blob in memory, rather than doing additional per-row work (especially since we have 1M+ rows).

Admittedly that doesn't mean we cannot perform normalization at runtime: @datagica/parse-entities actually does perform basic normalization, and allow one to define custom rules.

I didn't feel the need for this step in parse-companies until now, but if you want you can have a look at parse-institution to see how it is done.

Perhaps this technique could be used to solve cases such as "SAS FRENCH COMPANY" VS "FRENCH COMPANY SAS". Still, to be used with moderation as we need to keep a good balance between loading time, cpu and memory.

cover names

Many business have a corporate name different from the business name, eg. the name of the company owning a restaurant might be different.

In that case, the restaurant name should be in the aliases

small business

For small business better to prepend the category and/or append the street eg:

"BAR TABAC" => "Bar Tabac, 28 Smith Street, Brooklyn" "LE CARREFOUR" => "RESTAURANT LE CARREFOUR", "LE CARREFOUR, 42 RUE DU BLABLA" "MARCEL" => "GARAGE MARCEL"

Hedge funds subsidiaries / family trusts

  • they are mostly piggy banks and proxies for money, not "real" business with offices, services, customers..
  • usually owned by a single guy or a hedge fund and located in a tax-free heaven
  • they have weird names such as "LITTLE SUNSHINE HEDGE FUND CAIMAN B XXVII" or "JOHN DOE JR FAMILY HOLDING"

For the moment they are of little interest for us when analyzing news reports and curriculums, so we delete them when we find one (filter by name, company size eg. 1 person company => delete).

However, if the hedge fund is actually a big thing, we can also delete the subsidiary and only keep the parent company (eg. "LITTLE SUNSHINE HEDGE FUND CAIMAN B XXVII" => "LITTLE SUNSHINE")

This is not an issue, because if want to use them in the future, we can still go back to CorpWatch (us) or SIRENE (fr) and download the full listings again.