@datagica/parse-companies

v0.0.7

Published

2 years ago

Find companies in a text

Downloads

0High
0Medium
0Low

datagica

Parse companies

Find companies in a text. Well, most of the time.

Free data sources used

french open corporation database (SIRENE)
http://www.opendata500.com/us/list/
https://github.com/GovLab/OpenData500
http://api.corpwatch.org
https://datahub.io/dataset/corpwatch

Guidelines

It is better to have little solid data than a lot of garbage, so I try to clean up and verify things by hand (with google) as much as possible, or by using pattern matching to normalize anomalies.

When cleaning up the dataset, there are a couple of things to keep in mind:

Common problems

Examples of anomalies and issues:

The legal category of companies may or may not be present
The legal category might have variations (eg. S.A.S., S.A.S, SAS..)
Or it can be at the beginning or the end (eg. "FOOBAR SARL", "SARL FOOBAR")

The solution to this is to normalize the names whenever possible, and generate aliases covering most of the cases

About aliases

You may wonder why we pre-generate aliases, since it could be done at runtime for entries in the dataset and/or for each word found in documents as we go.

It is important to understand that normalizing stuff at runtime is costly for the CPU while disk, memory and network are getting faster and cheaper.

Until a certain size limit, it is much faster to just load a big blob in memory, rather than doing additional per-row work (especially since we have 1M+ rows).

Admittedly that doesn't mean we cannot perform normalization at runtime: @datagica/parse-entities actually does perform basic normalization, and allow one to define custom rules.

I didn't feel the need for this step in parse-companies until now, but if you want you can have a look at parse-institution to see how it is done.

Perhaps this technique could be used to solve cases such as "SAS FRENCH COMPANY" VS "FRENCH COMPANY SAS". Still, to be used with moderation as we need to keep a good balance between loading time, cpu and memory.

cover names

Many business have a corporate name different from the business name, eg. the name of the company owning a restaurant might be different.

In that case, the restaurant name should be in the aliases

small business

For small business better to prepend the category and/or append the street eg:

"BAR TABAC" => "Bar Tabac, 28 Smith Street, Brooklyn" "LE CARREFOUR" => "RESTAURANT LE CARREFOUR", "LE CARREFOUR, 42 RUE DU BLABLA" "MARCEL" => "GARAGE MARCEL"

Hedge funds subsidiaries / family trusts

they are mostly piggy banks and proxies for money, not "real" business with offices, services, customers..
usually owned by a single guy or a hedge fund and located in a tax-free heaven
they have weird names such as "LITTLE SUNSHINE HEDGE FUND CAIMAN B XXVII" or "JOHN DOE JR FAMILY HOLDING"

For the moment they are of little interest for us when analyzing news reports and curriculums, so we delete them when we find one (filter by name, company size eg. 1 person company => delete).

However, if the hedge fund is actually a big thing, we can also delete the subsidiary and only keep the parent company (eg. "LITTLE SUNSHINE HEDGE FUND CAIMAN B XXVII" => "LITTLE SUNSHINE")

This is not an issue, because if want to use them in the future, we can still go back to CorpWatch (us) or SIRENE (fr) and download the full listings again.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme