scraper-helper-core
v1.0.3
Published
NestJS Scraping Engine Library - Declarative web scraping with human-like behavior
Maintainers
Readme
@scraper/helper-core
A standalone NestJS library for declarative web scraping with human-like behavior, browser pool management, and anti-detection capabilities.
Installation
npm install @scraper/helper-corePeer Dependencies
This library requires the following peer dependencies (your main app should have these installed):
npm install @nestjs/common @nestjs/core playwright rxjs
# Optional: mongoose (for database integrations)Quick Start
1. Import the Module
import { Module } from '@nestjs/common';
import { ScraperHelperModule } from '@scraper/helper-core';
@Module({
imports: [
ScraperHelperModule.forRoot({
browserPool: {
maxInstances: 3,
headless: true,
},
enableStealth: true,
humanLikeInteraction: true,
}),
],
})
export class AppModule {}2. Inject and Use Services
import { Injectable } from '@nestjs/common';
import {
ScraperEngineService,
BrowserPoolService,
TaskConfig,
ScraperContext,
} from '@scraper/helper-core';
@Injectable()
export class MyScrapeService {
constructor(
private readonly scraperEngine: ScraperEngineService,
private readonly browserPool: BrowserPoolService,
) {}
async scrapeWebsite(url: string) {
// Acquire browser context
const { context, page, contextId } =
await this.browserPool.acquireContext();
try {
// Define task configuration
const taskConfig: TaskConfig = {
name: 'Example Scrape',
domain: new URL(url).hostname,
steps: [
{ action: 'navigate', value: url },
{ action: 'wait', selector: '.content' },
{
action: 'extract',
selector: '.item',
schema: {
title: '.item-title',
price: '.item-price',
},
},
],
};
// Execute scrape
const scraperContext: ScraperContext = { page, context };
const result = await this.scraperEngine.execute(
{ jobId: 'job-1', taskConfig },
scraperContext,
);
return result.extractedData;
} finally {
await this.browserPool.releaseContext(contextId);
}
}
}Integration with Host App Services
Proxy Provider
Implement IProxyProvider to provide proxies from your database:
import { Injectable } from '@nestjs/common';
import { IProxyProvider, ProxyConfig } from '@scraper/helper-core';
@Injectable()
export class MyProxyProvider implements IProxyProvider {
async getProxy(domain?: string): Promise<ProxyConfig | null> {
// Fetch proxy from your database
return { host: 'proxy.example.com', port: 8080 };
}
async markProxyFailed(proxy: ProxyConfig, reason: string): Promise<void> {
// Update proxy status in database
}
async releaseProxy(proxy: ProxyConfig): Promise<void> {
// Release proxy back to pool
}
}Then register it:
ScraperHelperModule.forRoot({
proxyProvider: MyProxyProvider,
});Session Provider
Implement ISessionProvider for session management:
import { Injectable } from '@nestjs/common';
import { ISessionProvider, SessionState } from '@scraper/helper-core';
@Injectable()
export class MySessionProvider implements ISessionProvider {
async getSession(domain: string): Promise<SessionState | null> {
// Fetch session from your database
}
async saveSession(domain: string, session: SessionState): Promise<void> {
// Save session to your database
}
async invalidateSession(domain: string): Promise<void> {
// Invalidate session in database
}
}Job Signal Checker
Implement IJobSignalChecker for pause/cancel integration with your queue:
import { Injectable } from '@nestjs/common';
import { IJobSignalChecker } from '@scraper/helper-core';
@Injectable()
export class MyJobSignalChecker implements IJobSignalChecker {
async checkSignal(jobId: string): Promise<'PAUSE' | 'CANCEL' | null> {
// Check Redis or your queue for signals
return null;
}
async waitForResume(jobId: string): Promise<void> {
// Wait for resume signal
}
}Async Configuration
Use forRootAsync for configuration that depends on other services:
import { ConfigModule, ConfigService } from '@nestjs/config';
import { ScraperHelperModule } from '@scraper/helper-core';
@Module({
imports: [
ConfigModule,
ScraperHelperModule.forRootAsync({
imports: [ConfigModule],
useFactory: async (config: ConfigService) => ({
browserPool: {
maxInstances: config.get('BROWSER_POOL_SIZE'),
headless: config.get('HEADLESS_MODE'),
},
}),
inject: [ConfigService],
}),
],
})
export class AppModule {}Local Development & Testing
Option 1: npm link
# In the library directory
cd scraper-helper-core
npm install
npm run build
npm link
# In your main app directory
cd ../your-main-app
npm link @scraper/helper-coreOption 2: yalc (Recommended)
# Install yalc globally
npm install -g yalc
# In the library directory
cd scraper-helper-core
npm install
npm run build
yalc publish
# In your main app directory
cd ../your-main-app
yalc add @scraper/helper-core
npm install
# After making changes to the library
cd ../scraper-helper-core
npm run build
yalc pushUnlinking
# For npm link
cd ../your-main-app
npm unlink @scraper/helper-core
# For yalc
cd ../your-main-app
yalc remove @scraper/helper-core
npm installAPI Reference
ScraperHelperModule
forRoot(options: ScraperHelperOptions)- Static configurationforRootAsync(options: ScraperHelperAsyncOptions)- Async configuration
Services
| Service | Description |
| ------------------------- | --------------------------------------- |
| ScraperEngineService | Main orchestrator for task execution |
| BrowserPoolService | Browser instance and context management |
| ProxyManagerService | Proxy rotation and health tracking |
| FingerprintService | Browser fingerprint generation |
| StealthService | Anti-detection measures |
| HumanInteractionService | Human-like typing/clicking |
| ActionExecutorService | Individual action execution |
| NavigationGuardService | Anti-bot detection and recovery |
Interfaces
| Interface | Description |
| ------------------------ | ------------------------------ |
| TaskConfig | Declarative task configuration |
| ActionStep | Individual action in a task |
| ScraperContext | Browser page and context |
| ScraperExecutionResult | Result from task execution |
| IProxyProvider | Proxy provider integration |
| ISessionProvider | Session storage integration |
| IJobSignalChecker | Job control integration |
License
MIT
