poolingh

v1.0.3

Published

a month ago

PoolinGH is a lightweight, easy-to-use open-source library designed to accelerate and ensure efficient mining of the GitHub Search API while taking full advantage of its potential. It enables automatic pooling of multiple tokens, parallelizes queries, opt

0High
0Medium
0Low

poolingh

poolingh github github api github search api mining software repositories

PoolinGH

📣 Description

📝 How to cite?


%%% Cite the paper

@inproceedings{andre2026poolingh,
  title         = {PoolinGH: Fast, Efficient, and Robust GitHub Repository Mining},
  author        = {Andr{\'e}, Maxime and Raglianti, Marco and Serbout, Souhaila and Cleve, Anthony and Lanza, Michele},
  booktitle     = {Proceedings of the 23rd International Mining Software Repositories Conference (MSR 2026): Data and Tool Showcase Track},
  year          = {2026},
  organization  = {ACM Press},
  doi           = {https://doi.org/10.1145/3793302.3793321}
}

%%% Cite the software

@software{poolingh,
  author    = {Andr{\'e}, Maxime and Serbout, Souhaila and Raglianti, Marco},
  title     = {PoolinGH},
  month     = oct,
  year      = 2025,
  publisher = {Zenodo},
  doi       = {10.5281/zenodo.17574293},
  url       = {https://doi.org/10.5281/zenodo.17574293}
}

▶️ Getting started

Prerequisites

Node.js installation:

Install Node.js.

PoolinGH installation from npm:

  npm install poolingh

GitHub Search API tokens creation:

Generate and collect GitHub tokens from one or more active GitHub accounts. GitHub > Settings > Developer Settings > Personal access tokens > Generate new token > Enter a "Token name" > Define an "Expiration" date > Select "Public Repositories (read-only)" > Click on "Generate token".

⚠️ Keep the tokens secret in any case!

⚠️ Use fine-grained personal access tokens instead of personal access tokens (classic) whenever possible.

⚠️ Use tokens from real active GitHub accounts. DO NOT CREATE FAKE ACCOUNTS!

More information: Managing your personal access tokens

⭐ Usage

Import

import { GitHubApiRequest, GitHubApiClient, GitHubApiQueue } from 'poolingh';

Create clients

A GitHub Search API client is associated with one GitHub access token. It allows querying the GitHub Search API as an authenticated user. Create as many clients as you have tokens.

let client1 = new GitHubApiClient(YOUR_TOKEN_1);
let client2 = new GitHubApiClient(YOUR_TOKEN_2);
let client3 = new GitHubApiClient(YOUR_TOKEN_3);
// ...

Create a queue

A GitHub Search API queue is a process that handles requests to the GitHub Search API while parallelizing and distributing the work across available GitHub API clients, respecting GitHub Search API's limits and best practices, and facilitating error handling and process monitoring.

let queue = new GitHubApiQueue([client1, client2, client3, ...]);

Create a request

A request corresponds to an HTTP call to the GitHub Search API. It contains a URL, options, and a callback function.

let request = new GitHubApiRequest(
  'https://api.github.com/search/repositories?q=stars:>=10000', // YOUR URL
  {}, // YOUR OPTIONS
  (result) => {
    console.log(result);
  }, // YOUR CALLBACK FUNCTION
);

Queue a request

Inserts a request at the end of the queue.

queue.push(request);

An alternative is to insert a request at the beginning of the queue.

queue.unshift(request);

Each method accepts one or more requests at a time.

Start the queue

Starts the queue and invites clients to process requests one by one.

queue.start();

Requests are processed according to the LIFO (last in, first out) strategy. When a request fails, it is sent back to the front of the queue. This prevents a request from reaching the "maximum failures per request" threshold in the event of a network problem. Except in this case, a request that reaches this threshold is aborted and removed from the queue.

Stop the queue

Stops the queue.

queue.stop();

Advanced features

Configure a delay in a client

Depending on the current conditions, you may experience difficulties with the network or the API reactivity in case of overloading. To smooth the load, you can use the pause(resetAt) method to pause or delay the clients on startup or later. By default, all the clients start together. It can create a delay relative to other clients.

In this example, we space each client startup by one minute:

for (let i = 1; i < queue.getClients().length; i++) {
  queue.getClients()[i].pause(Date.now() + 1000 * 60 * i);
}

Configuring the safety margin for the number of remaining requests in a client

Depending on the current conditions, it may happen that the mining script is faster than the API at updating the remaining request counter of a client. To avoid the overflow of requests sent and the risk of being flagged, you can adapt the safety margin for the number of remaining requests via the safetyRemainingRequestCount parameter in the constructor of the client. The default value is 5 remaining requests. This value can be increased or decreased.

In this example, we increase the value to 10.

let client1 = new GitHubApiClient(YOUR_TOKEN_1, 10);

Configuring the safety margin for the resume time in a client

Depending on the current conditions, it may happen that the mining script is faster than the API at updating the resume time of a client. To avoid a premature resume of a client and the risk of being flagged, you can adapt the safety margin for the resume time via the tokenResumeBufferTime parameter in the constructor of the client. The default value is 2000ms. This value can be increased or decreased.

In this example, we set the tokenResumeBufferTime to 5s instead of the default 2s.

let client1 = new GitHubApiClient(YOUR_TOKEN_1, 10, 5000);

Configuring the safety margin for the number of errors per request in the queue

Depending on the current conditions, it may happen that some requests fail in series. To save requests "credits" and avoid the risk of being flagged, you can adapt the safety margin for the number of errors per request via the maxErrorCountPerRequest parameter in the constructor of the queue. The default value is 5 requests. This value can be increased or decreased.

In this example, we set the maxErrorCountPerRequest to 10 instead of the default 5.

let queue = new GitHubApiQueue([client1, client2, client3, ...], 10);

Configuring the safety margin for the total number of errors in the queue

Depending on the current conditions, it may happen that a long queuing process accumulates a large amount of failures. To save requests "credits" and avoid the risk of being flagged, you can adapt the safety margin for the number of errors per request via the maxErrorCountInTotal parameter in the constructor of the queue. The default value is 1000 times the maxErrorCountPerRequest parameter. This value can be increased or decreased.

In this example, we set the maxErrorCountInTotal to 20,000 instead of the default 10,000.

let queue = new GitHubApiQueue([client1, client2, client3, ...], 10, 20000);

Monitoring the process through logs

The clients and the queue are decorated by an integrated logger logging the progress of the process. By default, the logs are registered in the ./logs directory. You can change this destination via the loggingPath parameter in the constructor of the client and the queue. If the folder does not already exist, it will be created automatically. Four files are created: combined.log, error.log, info.log, and warn.log. A log line generally includes the timestamp and the short version of the token. Then, it prints some details depending on the operation logged. For instance, it logs when a queue starts and stops, the URL with the current rate limit of a request when it is consumed, when a client is paused or resumed with the resuming time. It also logs all the errors like when a queue hits the maximum amount of failed requests, when a request fails and when it is retried or aborted.

In this example, we set the logs folder destination to ./mining/logs.

let client1 = new GitHubApiClient(YOUR_TOKEN_1, 10, 5000, './mining/logs');
// ...
let queue = new GitHubApiQueue([client1, client2, client3, ...], 10, 20000, './mining/logs');

Examples

A few examples, basic and advanced, are available here.

👩‍💻 Development details

Structure of this repository

README.md: This file.
LICENSE.txt: The license.
/src: The implementation.
- /helper: The implementation of the helpers used for the GitHub Search API mining.
- /model: The implementation of the models used for the GitHub Search API mining.
/tests: The unit tests.
- /helper: The unit tests of the helpers.
- /model: The unit tests of the models.
package.json: Metadata and dependencies definition.

Build

Launch the build step with the following command.

npm run build

The build step will create the dist folder with the sources.

Launch the pack step with the following command.

npm pack

The pack step will create the archive .tgz with the sources.

Unit tests

Unit test suites are set up thanks to the Vitest framework.

The tests are specified in the /tests directory and are named following the *.test.js pattern.

Launching the tests

Launch the unit tests with the following command.

npm run test

🪛 Technical details

Technologies

JavaScript
NodeJS

Libraries

dotenv is the package for environment variables.
winston is the package for logging.
axios is the package for HTTP calls.
vitest is the package for unit tests.
chalk is the package for log coloring.

🤝 Contributing

If you want to contribute to the project, please consider the following instructions:

Any file, class, method, attribute must be named clearly (i.e., no abbreviations).
More generally, any contribution must follow the conventions and keep the shape of previous contributions.
Any contribution must be tested (unit tests). See /tests directory.
All the tests and the CI/CD pipeline must pass before definitively integrating the contribution.
Any contribution must be documented, especially through comments in the source code and by updating the README.md file.
Any contribution must be developed on a separate branch.
Any contribution must be approved via the pull request mechanism.