npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

@mohssineaboutaj/scraper

v0.1.0

Published

A modular TypeScript scraping framework with fetchers, orchestration, and CLI tooling.

Downloads

92

Readme

Scrapar

A modular TypeScript scraping framework with fetchers, orchestration, and CLI tooling.

Features

  • Dual Fetchers: HTML (Cheerio) and API (JSON) fetchers with shared retry/timeout logic
  • Pipeline Orchestration: Dependency resolution, ordered execution, and lifecycle hooks
  • Persistence: JSON sink with configurable output and step logging for resume flows
  • CLI Tooling: Run, resume, retry-failed, and status commands
  • TypeScript First: Full type safety and IntelliSense support

Installation

npm install @mohssineaboutaj/scraper

Quick Start

1. Create a Pipeline Module

Create a TypeScript file (e.g., my-pipeline.ts):

import {
  ScrapeRunnerImpl,
  HtmlFetcher,
  JsonSink,
  type PipelineStep,
  type RunnerConfig,
  type PipelineModule,
} from '@mohssineaboutaj/scraper';

const config: RunnerConfig = {
  mode: 'production',
  delay: 1000,
  maxItems: 100,
  resumeFromLog: true,
  rateLimit: {
    requests: 10,
    perMilliseconds: 1000,
  },
};

const htmlFetcher = new HtmlFetcher();
const jsonSink = new JsonSink({ outputDir: './data' });

const steps: PipelineStep[] = [
  {
    id: 'fetch-pages',
    label: 'Fetch HTML Pages',
    dependencies: [],
    async run(context) {
      const urls = ['https://example.com/page1', 'https://example.com/page2'];
      const results = [];

      for (const url of urls) {
        const response = await htmlFetcher.fetch({ url }, context);
        results.push({
          url: response.url,
          title: response.$('title').text(),
        });
      }

      context.data.pages = results;
    },
  },
  {
    id: 'save-results',
    label: 'Save Results',
    dependencies: ['fetch-pages'],
    async run(context) {
      await jsonSink.write(context.data.pages, context);
    },
  },
];

export const pipeline: PipelineModule = {
  config,
  steps,
};

2. Run the Pipeline

npx scrapar run -p my-pipeline.ts

CLI Commands

Run

Execute a pipeline from scratch:

scrapar run -p pipeline.ts [options]

Options:

  • -p, --pipeline <path>: Path to pipeline module (required)
  • -m, --mode <mode>: Execution mode (development|production, default: development)
  • --delay <ms>: Override delay between iterations
  • --max-items <count>: Override maximum items to process
  • --log-dir <path>: Directory for step logs (default: ./logs)

Resume

Resume a pipeline from the latest persisted step log:

scrapar resume -p pipeline.ts --run-id <id> [options]

Options:

  • -p, --pipeline <path>: Path to pipeline module (required)
  • --run-id <id>: Run identifier to resume (required)
  • -m, --mode <mode>: Execution mode
  • --start-step <id>: Explicit step identifier to resume from
  • --log-dir <path>: Directory for step logs

Retry Failed

Retry failed items from logs:

scrapar retry-failed -p pipeline.ts --run-id <id> [options]

Options:

  • -p, --pipeline <path>: Path to pipeline module (required)
  • --run-id <id>: Run identifier (required)
  • --log-dir <path>: Directory for step logs

Status

Display current pipeline status and progress:

scrapar status --run-id <id> [options]

Options:

  • --run-id <id>: Run identifier (required)
  • --log-dir <path>: Directory for step logs

Configuration

RunnerConfig

interface RunnerConfig {
  mode: 'development' | 'production';
  delay: number;                    // Delay (ms) between iterations
  maxItems?: number;                // Optional safety cap
  resumeFromLog?: boolean;          // Enable resume capability
  rateLimit?: {
    requests: number;
    perMilliseconds: number;
  };
  retry?: {
    attempts: number;
    backoffStrategy: 'none' | 'linear' | 'exponential';
    baseDelay: number;
  };
  telemetry?: {
    enabled: boolean;
    logLevel: 'silent' | 'info' | 'debug';
  };
}

Fetchers

HTML Fetcher

Fetch and parse HTML pages with Cheerio:

import { HtmlFetcher } from '@mohssineaboutaj/scraper';

const fetcher = new HtmlFetcher({
  axios: {
    timeout: 10000,
    headers: { 'User-Agent': 'MyBot/1.0' },
  },
  retry: {
    retries: 3,
    minTimeout: 500,
  },
  rateLimit: {
    requests: 5,
    perMilliseconds: 1000,
  },
});

const response = await fetcher.fetch(
  { url: 'https://example.com' },
  context
);

console.log(response.$('title').text()); // Access Cheerio API

API Fetcher

Make REST API calls with authentication:

import { ApiFetcher } from '@mohssineaboutaj/scraper';

const fetcher = new ApiFetcher({
  axios: {
    baseURL: 'https://api.example.com',
  },
});

// GET request
const getResponse = await fetcher.get('/users', context);

// POST request with auth
const postResponse = await fetcher.post(
  '/data',
  context,
  {
    data: { name: 'John' },
    auth: {
      type: 'bearer',
      token: 'your-token',
    },
  }
);

Sinks

JSON Sink

Persist data to JSON files:

import { JsonSink } from '@mohssineaboutaj/scraper';

const sink = new JsonSink({
  outputDir: './data',
  fileName: 'results',
  calculateItemCount: true,
  successMessage: 'Data saved',
});

await sink.write([{ id: 1, name: 'Item' }], context);

Lifecycle Hooks

Step Lifecycle

const step: PipelineStep = {
  id: 'my-step',
  async run(context) {
    // Step logic
  },
  beforeStep(event) {
    console.log(`Starting step: ${event.step.id}`);
  },
  afterStep(event) {
    console.log(`Completed step: ${event.step.id}`);
  },
  onError(event) {
    console.error(`Error in step ${event.step.id}:`, event.error);
  },
};

Runner Lifecycle

import { ScrapeRunnerImpl } from '@mohssineaboutaj/scraper';

const runner = new ScrapeRunnerImpl(config, {
  beforeStep(event) {
    console.log(`[${event.stepIndex + 1}/${event.totalSteps}] ${event.step.id}`);
  },
  afterStep(event) {
    console.log(`✓ Completed: ${event.step.id}`);
  },
  onError(event) {
    console.error(`✗ Failed: ${event.step.id}`, event.error);
  },
});

Extension Points

Custom Fetcher

import { Fetcher, ScrapeContext } from '@mohssineaboutaj/scraper';

class CustomFetcher implements Fetcher<CustomRequest, CustomResponse> {
  readonly id = 'custom-fetcher';

  async fetch(input: CustomRequest, context: ScrapeContext): Promise<CustomResponse> {
    // Your fetching logic
  }
}

Custom Sink

import { Sink, ScrapeContext } from '@mohssineaboutaj/scraper';

class DatabaseSink implements Sink<Payload> {
  readonly id = 'database-sink';

  async write(payload: Payload, context: ScrapeContext): Promise<void> {
    // Your persistence logic
  }
}

Examples

See the examples/ directory for complete pipeline examples:

  • multi-html-pipeline.ts: Scraping multiple HTML pages
  • mixed-html-api-pipeline.ts: Combining HTML and API fetchers

License

MIT