scraper-helper-core

v1.0.3

Published

6 months ago

NestJS Scraping Engine Library - Declarative web scraping with human-like behavior

0High
0Medium
0Low

llx1

nestjs scraper playwright web-scraping automation

@scraper/helper-core

A standalone NestJS library for declarative web scraping with human-like behavior, browser pool management, and anti-detection capabilities.

Installation

npm install @scraper/helper-core

Peer Dependencies

This library requires the following peer dependencies (your main app should have these installed):

npm install @nestjs/common @nestjs/core playwright rxjs
# Optional: mongoose (for database integrations)

Quick Start

1. Import the Module

import { Module } from '@nestjs/common';
import { ScraperHelperModule } from '@scraper/helper-core';

@Module({
  imports: [
    ScraperHelperModule.forRoot({
      browserPool: {
        maxInstances: 3,
        headless: true,
      },
      enableStealth: true,
      humanLikeInteraction: true,
    }),
  ],
})
export class AppModule {}

2. Inject and Use Services

import { Injectable } from '@nestjs/common';
import {
  ScraperEngineService,
  BrowserPoolService,
  TaskConfig,
  ScraperContext,
} from '@scraper/helper-core';

@Injectable()
export class MyScrapeService {
  constructor(
    private readonly scraperEngine: ScraperEngineService,
    private readonly browserPool: BrowserPoolService,
  ) {}

  async scrapeWebsite(url: string) {
    // Acquire browser context
    const { context, page, contextId } =
      await this.browserPool.acquireContext();

    try {
      // Define task configuration
      const taskConfig: TaskConfig = {
        name: 'Example Scrape',
        domain: new URL(url).hostname,
        steps: [
          { action: 'navigate', value: url },
          { action: 'wait', selector: '.content' },
          {
            action: 'extract',
            selector: '.item',
            schema: {
              title: '.item-title',
              price: '.item-price',
            },
          },
        ],
      };

      // Execute scrape
      const scraperContext: ScraperContext = { page, context };
      const result = await this.scraperEngine.execute(
        { jobId: 'job-1', taskConfig },
        scraperContext,
      );

      return result.extractedData;
    } finally {
      await this.browserPool.releaseContext(contextId);
    }
  }
}

Integration with Host App Services

Proxy Provider

Implement IProxyProvider to provide proxies from your database:

import { Injectable } from '@nestjs/common';
import { IProxyProvider, ProxyConfig } from '@scraper/helper-core';

@Injectable()
export class MyProxyProvider implements IProxyProvider {
  async getProxy(domain?: string): Promise<ProxyConfig | null> {
    // Fetch proxy from your database
    return { host: 'proxy.example.com', port: 8080 };
  }

  async markProxyFailed(proxy: ProxyConfig, reason: string): Promise<void> {
    // Update proxy status in database
  }

  async releaseProxy(proxy: ProxyConfig): Promise<void> {
    // Release proxy back to pool
  }
}

Then register it:

ScraperHelperModule.forRoot({
  proxyProvider: MyProxyProvider,
});

Session Provider

Implement ISessionProvider for session management:

import { Injectable } from '@nestjs/common';
import { ISessionProvider, SessionState } from '@scraper/helper-core';

@Injectable()
export class MySessionProvider implements ISessionProvider {
  async getSession(domain: string): Promise<SessionState | null> {
    // Fetch session from your database
  }

  async saveSession(domain: string, session: SessionState): Promise<void> {
    // Save session to your database
  }

  async invalidateSession(domain: string): Promise<void> {
    // Invalidate session in database
  }
}

Job Signal Checker

Implement IJobSignalChecker for pause/cancel integration with your queue:

import { Injectable } from '@nestjs/common';
import { IJobSignalChecker } from '@scraper/helper-core';

@Injectable()
export class MyJobSignalChecker implements IJobSignalChecker {
  async checkSignal(jobId: string): Promise<'PAUSE' | 'CANCEL' | null> {
    // Check Redis or your queue for signals
    return null;
  }

  async waitForResume(jobId: string): Promise<void> {
    // Wait for resume signal
  }
}

Async Configuration

Use forRootAsync for configuration that depends on other services:

import { ConfigModule, ConfigService } from '@nestjs/config';
import { ScraperHelperModule } from '@scraper/helper-core';

@Module({
  imports: [
    ConfigModule,
    ScraperHelperModule.forRootAsync({
      imports: [ConfigModule],
      useFactory: async (config: ConfigService) => ({
        browserPool: {
          maxInstances: config.get('BROWSER_POOL_SIZE'),
          headless: config.get('HEADLESS_MODE'),
        },
      }),
      inject: [ConfigService],
    }),
  ],
})
export class AppModule {}

Local Development & Testing

Option 1: npm link

# In the library directory
cd scraper-helper-core
npm install
npm run build
npm link

# In your main app directory
cd ../your-main-app
npm link @scraper/helper-core

Option 2: yalc (Recommended)

# Install yalc globally
npm install -g yalc

# In the library directory
cd scraper-helper-core
npm install
npm run build
yalc publish

# In your main app directory
cd ../your-main-app
yalc add @scraper/helper-core
npm install

# After making changes to the library
cd ../scraper-helper-core
npm run build
yalc push

Unlinking

# For npm link
cd ../your-main-app
npm unlink @scraper/helper-core

# For yalc
cd ../your-main-app
yalc remove @scraper/helper-core
npm install

API Reference

ScraperHelperModule

forRoot(options: ScraperHelperOptions) - Static configuration
forRootAsync(options: ScraperHelperAsyncOptions) - Async configuration

Services

| Service | Description | | ------------------------- | --------------------------------------- | | ScraperEngineService | Main orchestrator for task execution | | BrowserPoolService | Browser instance and context management | | ProxyManagerService | Proxy rotation and health tracking | | FingerprintService | Browser fingerprint generation | | StealthService | Anti-detection measures | | HumanInteractionService | Human-like typing/clicking | | ActionExecutorService | Individual action execution | | NavigationGuardService | Anti-bot detection and recovery |

Interfaces

| Interface | Description | | ------------------------ | ------------------------------ | | TaskConfig | Declarative task configuration | | ActionStep | Individual action in a task | | ScraperContext | Browser page and context | | ScraperExecutionResult | Result from task execution | | IProxyProvider | Proxy provider integration | | ISessionProvider | Session storage integration | | IJobSignalChecker | Job control integration |

License

MIT