npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2024 – Pkg Stats / Ryan Hefner

@wikiviews/wikiviews-importer

v1.0.0-beta3

Published

Importer for the Wikipedia Pageviews (https://dumps.wikimedia.org/other/pageviews/) dataset.

Downloads

6

Readme

Wikiviews Importer

A tool for importing excerpts from the Wikipedia Pageviews dataset into Elasticsearch for the use as data-backend for the Wikiviews application.

It automates the process of selecting, downloading and parsing the data and inserting them into Elasticsearch.

Installation

via NPM

The project provides an NPM package, which can be installed via

npm install -g @wikiviews/wikiviews-importer

This package is automatically generated for each repository tag via Travis-CI(Build Status).

Manually

After cloning the repository, you can install the package manually via

npm install && npm run build && npm install -g

Usage

The tool provides the commandline utility

wv-import

to perform the import.

It optionally accepts a path to a configuration file as first parameter and optional configuration via commandline parameters.

Configuration

The importer can be configured via a configuration file in JSON format and via commandline parameters (in the format --{PARAMETERNAME}={VALUE}).

The following configuration options are available:

Option | Corresponding CLI parameter | Description | Default -------|-----------------------------|-------------|-------- tasks.download | download | If set to true, the selected data dumps are downloaded. If set to false, the selection is applied to all already existing elements in the destination directory, which are then used as source. | true tasks.elasticsearch | elasticsearch | If set to true, the selected data dumps are inserted into Elasticsearch, otherwise not. | true download.source | source | The source pattern. It needs to be a URL pointing to the data dumps containing variables (consecutive appearances of variables, which are longer than the value are padded with preceding zeros), which are then substituted by the selected date ranges. Keep in mind, that ALL occurrences of the variables are substituted and the variables are the same, which are used in the download.output pattern. so choose your variables wisely (or stick to the default). | 'https://dumps.wikimedia.org/other/pageviews/bbbb/bbbb-ff/pageviews-bbbbffjj-ll0000.gz' download.output | output | The output filename pattern. It needs to contain the same variables like the download.source pattern, which are then substituted by the selected date ranges. | 'bbbb-ff-jj-ll.csv' download.destination | destination | The destination directory, where the output files are written to. | ./data download.years | years | The selected range of downloaded years. The variable you have chosen for years (by default b) is substituted by the values in this range. It needs to be specified in the format '{VARIABLE}:{BEGINNING}-{END}'. | 'b:2016-2016' download.months | months | The selected range of downloaded months in each year. The variable you have chosen for months (by default f) is substituted by the values in this range. It needs to be specified in the format '{VARIABLE}:{BEGINNING}-{END}'. | 'f:1-1' download.days | days | The selected range of downloaded days in each month. The variable you have chosen for days (by default j) is substituted by the values in this range. It needs to be specified in the format '{VARIABLE}:{BEGINNING}-{END}'. | 'j:1-31' download.hours | hours | The selected range of downloaded hours in each day. The variable you have chosen for hours (by default j) is substituted by the values in this range. It needs to be specified in the format '{VARIABLE}:{BEGINNING}-{END}'. | 'l:0-23' download.concurrent | concurrentDownloads | The number of concurrently downloaded dump files. The Wikipedia Dump server allows only 3 concurrent downloads per source. | 3 download.compression | sourceCompression | The compression algorithm used to decompress the source files. The values 'gz' (for gzip) and 'zip' (for zip) are supported. All other values will deactivate decompression. The Wikipedia dumps are compressed via gzip. | 'gz' elasticsearch.port | esPort | The port on which the target Elasticsearch instance is listening. | 9200 elasticsearch.address | esAddress | The address / domain on which the target Elasticsearch instance is listening. | 'localhost' elasticsearch.index | esIndex | The target index in the Elasticsearch cluster. The default matches if the Cluster is set up via wikiviews-elasticsearch. | 'wikiviews' elasticsearch.type | esType | The target type in the Elasticsearch cluster. The default matches if the Cluster is set up via wikiviews-elasticsearch. | 'article' elasticsearch.concurrent | concurrentInsertions | The number of files, which get inserted concurrently into elasticsearch. Either 'all' or the number of files. | 'all' elasticsearch.batch | batchInsert | The number of batch-inserted dataset rows. Adapt the value to optimize your memory consumption. | 10000

For example the configuration file, which would set the default values would look like this:

{
  "tasks": {
    "download": true,
    "elasticsearch": true
  },
  "download": {
    "source": "https://dumps.wikimedia.org/other/pageviews/bbbb/bbbb-ff/pageviews-bbbbffjj-ll0000.gz",
    "output": "bbbb-ff-jj-ll.csv",
    "destination": "./data",
    "years": "b:2016-2016",
    "months": "f:1-1",
    "days": "j:1-31",
    "hours": "l:0-23",
    "concurrent": 3,
    "compression": "gz"
  },
  "elasticsearch": {
    "port": 9200,
    "address": "localhost",
    "index": "wikiviews",
    "type": "article",
    "concurrent": "all",
    "batch": 10000
  }
}

Running

The tool is ran via the wv-import command. Running the command with the configuration file config.json and without downloading the source data, would look like this:

wv-import config.json --download=false