@omindu/scrapely

v1.0.0

Published

15 days ago

Declarative web scraping toolkit with schema-driven extraction, pagination, caching, and data export

0High
0Medium
0Low

scraping web-scraping scraper cheerio axios html-parser web-crawler data-extraction pagination csv-export table-extraction

Scrapely

Declarative web scraping toolkit for Node.js built on Axios and Cheerio.

Install

npm install @omindu/scrapely

Quick Start

const Scrapely = require('@omindu/scrapely');

const scraper = new Scrapely();

const title = await scraper.getText('https://example.com', 'h1');

Features

Schema-driven structured extraction
Automatic retry with linear back-off
Built-in rate limiting and user-agent rotation
Auto-pagination
Table, form, email, phone, and link extraction
File and image downloads
JSON / CSV export (RFC 4180)
TTL-based response cache with max-size eviction
Proxy support
EventEmitter hooks (request, response, retry, error, cacheHit)
Custom error hierarchy (ScrapelyError, FetchError, ValidationError, ExportError)
TypeScript definitions included

API

`new Scrapely(options?)`

const scraper = new Scrapely({
  timeout: 30000,           // request timeout (ms)
  maxRetries: 3,            // retry attempts
  retryDelay: 1000,         // base delay between retries (ms)
  headers: {},              // default HTTP headers
  followRedirects: true,    // follow 3xx redirects
  validateStatus: (s) => s >= 200 && s < 400,
  rateLimit: 2,             // max requests per second
  proxy: null,              // axios proxy config
  rotateUserAgent: false,   // cycle built-in UA strings
  cache: true,              // true | { ttl: 300000, maxSize: 200 }
});

Core Methods

`scraper.fetch(url, axiosConfig?) → Promise<string>`

Fetch raw HTML with retry, rate-limiting, UA rotation, and caching.

`scraper.load(url, axiosConfig?) → Promise<CheerioAPI>`

Fetch and parse into a Cheerio instance.

`scraper.getText(url, selector, opts?) → Promise<string | string[]>`

const title = await scraper.getText('https://example.com', 'h1');
const items = await scraper.getText(url, 'li', { multiple: true });

`scraper.getAttribute(url, selector, attribute, opts?) → Promise<string | string[]>`

const href = await scraper.getAttribute(url, 'a', 'href');
const srcs = await scraper.getAttribute(url, 'img', 'src', { multiple: true });

`scraper.getHtml(url, selector, opts?) → Promise<string | null>`

Returns inner HTML of matched elements.

Schema Extraction

`scraper.extract(url, schema, axiosConfig?) → Promise<object>`

const data = await scraper.extract('https://example.com', {
  title:       { selector: 'h1', type: 'text' },
  description: { selector: 'meta[name="description"]', type: 'attribute', attribute: 'content' },
  links:       { selector: 'a', type: 'attribute', attribute: 'href', multiple: true },
  price:       { selector: '.price', type: 'text', transform: (v) => parseFloat(v.replace(/[^0-9.]/g, '')) },
});

Field definition:

| Property | Type | Default | Description | |-------------|------------|----------|--------------------------------------| | selector | string | — | CSS selector | | type | string | 'text' | 'text' / 'html' / 'attribute' | | attribute | string | — | Required when type is 'attribute' | | multiple | boolean | false | Return array of matches | | transform | Function | — | (raw, $, el) => value |

`scraper.extractList(url, containerSelector, itemSchema, axiosConfig?) → Promise<object[]>`

const products = await scraper.extractList(url, '.product-card', {
  name:  { selector: 'h3', type: 'text' },
  price: { selector: '.price', type: 'text' },
  image: { selector: 'img', type: 'attribute', attribute: 'src' },
});

Pagination

`scraper.paginate(startUrl, opts) → Promise<any[]>`

const allProducts = await scraper.paginate('https://shop.com/products?page=1', {
  nextSelector: 'a.next-page',
  maxPages: 10,
  dataExtractor: async ($, url, pageNum) => {
    return $('.item').map((_, el) => $(el).text().trim()).get();
  },
  stopWhen: ($, collected, pageNum) => collected.length >= 100,
});

Structured Extractors

`scraper.extractTable(url, selector?, opts?) → Promise<{headers, rows} | null>`

const table = await scraper.extractTable(url, 'table.data');
// { headers: ['Name', 'Price'], rows: [{ Name: 'A', Price: '10' }, ...] }

Pass { all: true } to get every matched table.

`scraper.extractForm(url, selector?) → Promise<object | object[]>`

Returns action, method, and all field definitions of HTML forms.

`scraper.extractEmails(url) → Promise<string[]>`

`scraper.extractPhoneNumbers(url) → Promise<string[]>`

`scraper.extractLinks(url, opts?) → Promise<{href, text, title}[]>`

const internal = await scraper.extractLinks(url, { internal: true, unique: true });
const external = await scraper.extractLinks(url, { external: true });
const filtered = await scraper.extractLinks(url, { pattern: '/products/' });

Multi-page

`scraper.scrapeMultiple(urls, handler, opts?) → Promise<any[]>`

const results = await scraper.scrapeMultiple(
  ['https://a.com', 'https://b.com'],
  async ($, url) => ({ url, title: $('h1').text().trim() }),
  { concurrency: 3, ignoreErrors: true }
);

Downloads

`scraper.downloadFile(url, dest, axiosConfig?) → Promise<string>`

`scraper.downloadImages(url, selector?, dir?, opts?) → Promise<string[]>`

await scraper.downloadFile('https://example.com/file.pdf', './downloads/file.pdf');
const saved = await scraper.downloadImages(url, 'img', './images', { ignoreErrors: true });

Data Export

`scraper.exportJSON(data, filepath) → Promise<string>`

`scraper.exportCSV(data, filepath) → Promise<string>`

await scraper.exportJSON(products, './output/products.json');
await scraper.exportCSV(products, './output/products.csv');

CSV output follows RFC 4180 — values with commas, quotes, or newlines are properly escaped.

Configuration

`scraper.setHeaders(headers)`

scraper.setHeaders({ 'Authorization': 'Bearer TOKEN' });

`scraper.setCookies(cookies)`

scraper.setCookies({ sessionId: '12345', userId: 'user1' });
scraper.setCookies('sessionId=12345; userId=user1');

`scraper.clearCache()`

`scraper.cacheSize → number`

Quick Scrape (Stateless)

const { quickScrape } = require('@omindu/scrapely');

const title = await quickScrape.getText('https://example.com', 'h1');
const data  = await quickScrape.extract(url, schema);

Data Utilities

const { DataUtils } = require('@omindu/scrapely');

DataUtils.cleanText('  extra   spaces  ');        // 'extra spaces'
DataUtils.extractNumbers('price: $12.50, qty: 3'); // [12.5, 3]
DataUtils.parsePrice('$1,234.56');                 // 1234.56
DataUtils.parsePrice('1.234,56 EUR');              // 1234.56
DataUtils.parseDate('2026-02-08');                 // Date object
DataUtils.getDomain('https://example.com/path');   // 'example.com'
DataUtils.normalizeUrl(url, { removeTrailingSlash: true });
DataUtils.sanitizeFilename('file<name>.txt');      // 'file_name_.txt'
DataUtils.detectType('[email protected]');             // 'email'

Events

scraper.on('request',  (url) => console.log('GET', url));
scraper.on('response', (url, status) => console.log(status, url));
scraper.on('retry',    (url, attempt, err) => console.warn(`Retry ${attempt}`, url));
scraper.on('error',    (err) => console.error(err.code, err.message));
scraper.on('cacheHit', (url) => console.log('Cache hit', url));

Error Handling

const { FetchError, ValidationError, ExportError } = require('@omindu/scrapely');

try {
  await scraper.fetch(url);
} catch (err) {
  if (err instanceof FetchError) {
    console.error(err.code, err.meta.url, err.meta.attempts);
  }
}

All errors extend ScrapelyError and include a code and meta object.

| Error Class | Code | When | |-------------------|---------------------|-----------------------------| | FetchError | ERR_FETCH_FAILED | All retry attempts exhausted | | ValidationError | ERR_VALIDATION | Invalid parameter | | ExportError | ERR_EXPORT_FAILED | File write failure |

Best Practices

Check robots.txt before scraping any site.
Use rateLimit to avoid overwhelming servers.
Keep concurrency low (2–3) for multi-page scraping.
Set a descriptive User-Agent header identifying your bot.
Always wrap scraping code in try/catch.

License

MIT — Omindu Dissanayaka

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

Scrapely

Install

Quick Start

Features

API

new Scrapely(options?)

Core Methods

scraper.fetch(url, axiosConfig?) → Promise<string>

scraper.load(url, axiosConfig?) → Promise<CheerioAPI>

scraper.getText(url, selector, opts?) → Promise<string | string[]>

scraper.getAttribute(url, selector, attribute, opts?) → Promise<string | string[]>

scraper.getHtml(url, selector, opts?) → Promise<string | null>

Schema Extraction

scraper.extract(url, schema, axiosConfig?) → Promise<object>

scraper.extractList(url, containerSelector, itemSchema, axiosConfig?) → Promise<object[]>