npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

datahorde

v0.0.14

Published

> A data gathering and manipulation kit with support for a variety of database platforms like MsSQL, MySQL, PSQL, MariaDB and SQLite.

Downloads

20

Readme

DataHorde

A data gathering and manipulation kit with support for a variety of database platforms like MsSQL, MySQL, PSQL, MariaDB and SQLite.

pipeline status coverage report

A toolkit which allows a user to easily, and programmatically, crawl a site for data. It contains a tool for extracting to pipe delimited csv for easy import and manipulation. Source is available in GitLab and documentation on components is available at https://rdfedor.gitlab.io/datahorde.

Table of Contents

  1. Installation
  2. Available Commands
    1. Crawler
    2. Devour
    3. Digest
    4. Egress
    5. Mimic
  3. Development Setup
  4. Release History
  5. Contributing

1. Installation

OS X & Linux:

To install the CLI globally and accessible from terminal,

npm install -g datahorde

To install as part of a project,

npm install --save datahorde

2. Available Commands

Usage: dh [options] [command]

Options:
  -V, --version                                       output the version number
  -h, --help                                          output usage information

Commands:
  crawl|c [options] <targetUrl> [otherTargetUrls...]  Searches a collection of urls, collects data and writes to csv
  devour|d [options] <inputFile>                      Allows for the easy ingestion of csvs to various datasources
  digest|di [options] <action>                        Processes digest action [execute,migrate,rollback] against a particular database
  mimic|m [options] <templateFile> <outputFile>       Templating engine which uses ejs to build customized scripts
  egress|v [options] <query>                          Exports data from a datasource using a query to csv

2.1 Crawler

Allows a user to crawl a website collecting data as defined by the definition.

Usage: dh crawl|c [options] <outputFile> <targetUrl> [otherTargetUrls...]

Searches a collection of urls, collects data and writes to csv

Options:
  -l, --load                                           load targetUrls from local csv
  -k, --loadUrlKey <loadUrlKey>                        the header that contains the url (default: "url")
  -e, --loadDelimiter <loadDelimiter>                  delimeter separating columns in the csv (default: "|")
  -a, --loadHeaders <column1>[,column2...]             header name of columns in the csv (default: true)
  -m, --loadRenameHeader                               map column data to headers (default: false)
  -p, --profile <name>                                 use a specific crawler profile (Available: generic) (default: "generic")
  -r, --rowSelector <selector>                         css selector to the rows to select the data from (default: "body")
  -c, --columnSelector <selector>[,otherSelectors...]  css selector to the values inside the rows (default: "a")
  -n, --nextPageLinkSelector <selector>                css selector to url for next page
  -o, --outputHeaders <name>[,otherColumns...]         displays headers at the top of the export
  -h, --headers <header>=<value>[,otherHeaders...]     set header values in the requests made
  -x, --proxy <proxy>                                  path to the socks proxy
  -t, --threads <threads>                              run multiple threads at the same time (default: 1)
  -b, --browser                                        emulate a browser request
  -h, --help                                           output usage information

As an example, lets write to csv and display a list of the 42nd weekend's top release titles, release title link,movie studio that published it and the link to the imdb movie studio page,

dh crawl -r 'table.mojo-body-table tr' -c 'td.mojo-field-type-release a,td.mojo-field-type-release a:attr(href),td.mojo-field-type-release_studios a,td.mojo-field-type-release_studios a:attr(href)' -o 'release_title,release_title_link,movie_studio,movie_studio_link' mojo_movies_weekend_24.csv https://www.boxofficemojo.com/weekend/2019W42/

Then taking the output of the above, we can collect details from each page like the total demoestic and internation gross, genre and feature length,

dh crawl -l -i release_title_link -h 'title,domestic_gross,international_gross,genre,feature_length' -r main -c 'h1,.mojo-performance-summary-table div:nth-child(2) span:nth-child(3) span:text,.mojo-performance-summary-table div:nth-child(3) span:nth-child(3) span:text,.mojo-summary-values > div:nth-child(4) > span:nth-child(2):text,.mojo-summary-values > div:nth-child(2) > span:nth-child(2):text' mojo_movie_details.csv mojo_movies_weekend_24.csv

2.2 Devour

Ingests a file based csv into a database table.

Usage: dh devour|d [options] <inputFile>

Allows for the easy ingestion of csvs to various datasources

Options:
  -V, --version                                         output the version number
  -s, --schema <name>=<columnType>,[<otherColumns>...]  describe the csv file and the import table model. (Types: int, str, bigint, double, float, date, uuid)
  -d, --delimiter <delimiter>                           Delimiter separating columns in csv. (default: "|")
  -u, --databaseUri <uri>                               DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:")
  -t, --tableName <tableName>                           Name of the table to create at the databaseUri. (default: "importTable")
  -h, --noHeaders                                       Csv file has no headers
  -r, --renameHeaders                                   Rename headers to match schema on import
  -h, --help                                            output usage information

Builing on the output from the crawler, we can then ingest the csv using devour into a file based sqlite database,

dh devour --databaseUri='sqlite:./mojo_movies.sqlite' --schema 'release_title=str(75),release_title_link=str,movie_studio=str(75),movie_studio_link=str' mojo_movies_weekend_24.csv

One could also rename the columns as we import them if they don't line up with what we're aiming for,

dh devour --databaseUri='sqlite:mojo_movies.sqlite' --schema 'title=str(75),titleUrl=str,studio=str(75),studioUrl=str' --renameHeaders ./mojo_movies_weekend_24.csv

2.3 Digest

Digest executes a series of sql against a data source. It supports database migrations and allows for the ingestion and processing of data from csvs to a data source.

Usage: dh digest|di [options] <action> <basePath>

Processes digest action [execute,crawl,seed,egress,migrate,rollback] against a particular database

Options:
  -V, --version                               output the version number
  -d, --data <name>=<value>,[<otherData>...]  digest data used when generating migrations and sql
  -u, --databaseUri <uri>                     DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:")
  -h, --help                                  output usage information

Action Definitions

  • Execute (ex) - Runs crawl, migrate, seeds, process the path files then egresses the data back to csv.
  • Crawl (cr) - Collects data from a given set of urls and exports them to csv.
  • Seed (se) - Imports data from a csv to a particular table location. Supports same parameters as defined in the cli.
  • Egress (eg) - Exports data to csv from a file based query. It can be templatized to use the values passed by the data parameter.
  • Migrate (mi) - Applies schema migrations to a particular data source.
  • Rollback (ro) - Reverses schema migrations applied to a data source.

A basic example is available to show how to build the digest where the parameters are based on the options from their respective CLI implemtations. Migrations are supported by Sequelize.

dh digest exec ./examples/digest-basic-example/

A more complex example with crawlers is available. Change year or week by passing via the cli like this and they will override the default values defined in the configuration.

dh digest exec -d 'year=2019,week=43' ./examples/digest-movie-studio-report/

2.4 Egress

Executes a given series of commands against a database and exports data from a tabl to a file based csv.

Usage: dh egress|v [options] <query>

Exports data from a datasource using a query to csv

Options:
  -V, --version                               output the version number
  -o, --output <filepath>                     write data to csv
  -d, --delimeter <delimeter>                 delimeter separating column data (default: "|")
  -u, --databaseUri <uri>                     DatabaseUri supported by Sequelize library. Mysql, MariaDB, MSSQL, SQLite, PSQL supported. (default: "sqlite::memory:")
  -h, --headers <headers>[, otherHeaders...]  comma separated list of headers in the csv (default: true)
  -h, --help                                  output usage information

Then export the same data using egress,

dh egress --databaseUri='sqlite:./mojo_movies.sqlite' --output ./mojo_movies_output.csv 'select * from importTable'

2.5 Mimic

Allows for files to be templatized so they can be customized based on the parameters passed.

Usage: dh mimic|m [options] <templateFile> <outputFile>

Templating engine which uses ejs to build customized scripts

Options:
  -V, --version                               output the version number
  -d, --data <name>=<value>,[<otherData>...]  generates template using the given values
  -n, --noOverwrite                           does not overwrite the file if the outputFile already exists
  -h, --help                                  output usage information

For more documentation, please refer to the jsdoc.

3. Development Setup

To develop locally, simple checkout the repository and run,

npm install
npm test

4. Release History

  • 0.0.9 - 0.0.14
    • Updates to documentation
  • 0.0.8
    • Extract common database components
    • Add extensions to common database components
    • Add digest tool to automate commands
    • Add better error handling
  • 0.0.7
    • Update version for npm release
  • 0.0.2
    • Refactor CLI into a single command
    • Refactor library to support csv export
    • Add csv export utility "vomit"
  • 0.0.1
    • Work in progress

5. Contributing

  1. Fork it (< https://gitlab.com/rdfedor/datahorde/-/forks/new >)
  2. Create your feature branch (git checkout -b feature/fooBar)
  3. Commit your changes (git commit -am 'Add some fooBar')
  4. Push to the branch (git push origin feature/fooBar)
  5. Create a new Pull Request