npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

nemo-webminer

v2.0.0

Published

A powerful Node.js web scraping toolkit with pagination, batch processing, data transformation, and multiple export formats.

Readme

Nemo WebMiner

npm version MIT License Node.js Version

A powerful Node.js web scraping toolkit built on Playwright with advanced features including pagination, batch processing, data transformation, and multiple export formats.


✨ Features

Core Features

  • 🎯 Multiple Output Formats - Export to JSON, XLSX, CSV, Markdown, SQL, HTML
  • 🔄 Data Transformation - Transform data on-the-fly with custom or built-in functions
  • 📄 Pagination Support - Automatically scrape multiple pages
  • 🔢 Batch URL Scraping - Scrape multiple sites concurrently
  • 🔁 Retry Logic - Automatic retry with exponential backoff
  • 🌐 Browser Configuration - Custom viewports, user agents, and device presets
  • ⏱️ Wait Strategies - Handle dynamic content with flexible wait options
  • 🎬 Before-Scrape Actions - Execute clicks, scrolls, and form fills before scraping
  • 📸 Screenshot Capture - Take screenshots during scraping
  • 📋 Metadata Extraction - Extract SEO data

📦 Installation

npm install nemo-webminer
npm install playwright xlsx   # Peer dependencies
npx playwright install        # Install Playwright browsers

🚀 Quick Start

Basic Scraping

const { nemoMine } = require('nemo-webminer');

// Simple JSON export
const data = await nemoMine({
  url: 'https://example.com',
  selectors: { Title: 'h1', Links: 'a' },
  format: 'json'
});

console.log(data);
// { Title: ['Example Domain'], Links: ['More information...'] }

Export to Different Formats

// Excel (XLSX)
await nemoMine({
  url: 'https://example.com',
  selectors: { Title: 'h1', Content: 'p' },
  format: 'xlsx',
  output: 'data.xlsx'
});

// CSV
await nemoMine({
  url: 'https://example.com',
  selectors: { Name: '.name', Email: '.email' },
  format: 'csv',
  output: 'contacts.csv'
});

// Markdown
await nemoMine({
  url: 'https://example.com',
  selectors: { Title: 'h1', Content: 'p' },
  format: 'markdown',
  output: 'data.md'
});

💡 Advanced Features

Pagination

Scrape multiple pages automatically:

await nemoMine({
  url: 'https://example.com/page-1',
  selectors: { Product: '.product-name', Price: '.price' },
  pagination: {
    nextSelector: '.next-page',  // Click-based navigation
    maxPages: 10,
    waitBetweenPages: 2000
  },
  format: 'xlsx',
  output: 'all-products.xlsx'
});

// Or use URL patterns
await nemoMine({
  url: 'https://example.com/page-1',
  selectors: { Title: 'h1' },
  pagination: {
    urlPattern: 'https://example.com/page-{n}',
    maxPages: 5
  },
  format: 'json'
});

Data Transformation

Transform data as you scrape:

const { nemoMine, commonTransforms } = require('nemo-webminer');

await nemoMine({
  url: 'https://shop.com/products',
  selectors: {
    Name: '.product-name',
    Price: '.price',
    Rating: '.rating'
  },
  transform: {
    Price: commonTransforms.price,  // "$1,234.56" → 1234.56
    Rating: (text) => parseFloat(text.match(/(\d+\.?\d*)/)[1])
  },
  format: 'json'
});

Available Common Transforms:

  • price - Extract numeric price values
  • email - Extract email addresses
  • phone - Extract phone numbers
  • url - Extract URLs
  • date - Parse dates to ISO format
  • number - Extract numbers
  • boolean - Convert to boolean
  • And more...

Batch URL Scraping

Scrape multiple URLs concurrently:

await nemoMine({
  urls: [
    'https://site1.com',
    'https://site2.com',
    'https://site3.com'
  ],
  selectors: { Title: 'h1', Content: '.content' },
  batch: {
    concurrent: 3,
    retryFailed: true,
    continueOnError: true
  },
  format: 'json'
});

Retry Logic

Handle failures gracefully:

await nemoMine({
  url: 'https://unreliable-site.com',
  selectors: { Data: '.content' },
  retry: {
    maxAttempts: 3,
    backoff: 'exponential',  // or 'linear', 'fixed'
    initialDelay: 1000,
    onRetry: (attempt, error) => {
      console.log(`Retry ${attempt}: ${error.message}`);
    }
  },
  format: 'json'
});

Browser Configuration

Customize browser behavior:

// Use presets
await nemoMine({
  url: 'https://example.com',
  selectors: { Content: '.content' },
  browserPreset: 'mobileIPhone',  // or 'desktopChrome', 'tabletIPad'
  format: 'json'
});

// Custom configuration
await nemoMine({
  url: 'https://example.com',
  selectors: { Data: '.data' },
  browserOptions: {
    viewport: { width: 1920, height: 1080 },
    userAgent: 'Custom User Agent',
    headers: { 'Authorization': 'Bearer token' }
  },
  format: 'json'
});

Before-Scrape Actions

Execute actions before scraping:

await nemoMine({
  url: 'https://example.com',
  selectors: { Product: '.product' },
  beforeScrape: async (page) => {
    // Click "Load More" button
    await page.click('.load-more');
    await page.waitForTimeout(1000);
    
    // Scroll to bottom
    await page.evaluate(() => {
      window.scrollTo(0, document.body.scrollHeight);
    });
  },
  format: 'json'
});

// Or use configuration object
await nemoMine({
  url: 'https://example.com',
  selectors: { Content: '.content' },
  beforeScrape: {
    scroll: 'bottom',
    click: '.load-more',
    wait: 2000
  },
  format: 'json'
});

Screenshot Capture

Take screenshots during scraping:

await nemoMine({
  url: 'https://example.com',
  selectors: { Title: 'h1' },
  screenshot: {
    path: 'screenshot.png',
    fullPage: true
  },
  format: 'json'
});

Metadata Extraction

Extract SEO and social media metadata:

const result = await nemoMine({
  url: 'https://example.com',
  selectors: { Title: 'h1' },
  metadata: true,
  format: 'json'
});

console.log(result.metadata);
// {
//   title: 'Page Title',
//   description: 'Page description',
//   openGraph: { ... },
//   twitter: { ... }
// }

📚 API Reference

Main Function

nemoMine(options)

Options:

| Parameter | Type | Default | Description | |-----------|------|---------|-------------| | url | string | - | Single URL to scrape | | urls | array | - | Multiple URLs for batch scraping | | selectors | object/string | - | CSS selectors for data extraction | | output | string | 'output.xlsx' | Output filename | | format | string | 'file' | Output format: 'json', 'xlsx', 'csv', 'markdown', 'sql', 'html' | | timeout | number | 30000 | Page load timeout (ms) | | pagination | object | - | Pagination configuration | | batch | object | - | Batch processing configuration | | transform | object | - | Data transformation functions | | retry | object | - | Retry configuration | | browserOptions | object | - | Browser configuration | | browserPreset | string | - | Browser preset name | | waitStrategy | object | - | Wait strategy configuration | | beforeScrape | function/object | - | Actions to execute before scraping | | screenshot | object | - | Screenshot configuration | | metadata | boolean | false | Extract page metadata |

Selector Formats

// 1. Object format
{ Title: 'h1', Links: 'a', Price: '.price' }

// 2. String format with names
'Title: h1, Links: a, Price: .price'

// 3. Simple string format
'h1, a, .price'

// 4. No selectors (scrape all tags)
undefined // or null

Output Formats

| Format | Extension | Description | |--------|-----------|-------------| | json | - | JavaScript object (in-memory) | | xlsx | .xlsx | Excel spreadsheet | | csv | .csv | Comma-separated values | | markdown | .md | Markdown table | | sql | .sql | SQL INSERT statements | | html | .html | HTML table |


🎯 Real-World Examples

E-commerce Product Scraper

const products = await nemoMine({
  url: 'https://shop.example.com/products',
  selectors: {
    Name: '.product-name',
    Price: '.product-price',
    Rating: '.product-rating',
    InStock: '.stock-status'
  },
  pagination: {
    nextSelector: '.next-page',
    maxPages: 5,
    waitBetweenPages: 2000
  },
  transform: {
    Price: commonTransforms.price,
    Rating: (text) => parseFloat(text.match(/(\d+\.?\d*)/)?.[1] || '0'),
    InStock: (text) => text.toLowerCase().includes('in stock')
  },
  format: 'csv',
  output: 'products.csv'
});

Job Listings Aggregator

const jobs = await nemoMine({
  urls: [
    'https://jobs1.com/listings',
    'https://jobs2.com/openings',
    'https://jobs3.com/careers'
  ],
  selectors: {
    Title: '.job-title',
    Company: '.company-name',
    Location: '.location',
    Salary: '.salary'
  },
  batch: {
    concurrent: 3,
    continueOnError: true
  },
  transform: {
    Salary: commonTransforms.price
  },
  retry: {
    maxAttempts: 3,
    backoff: 'exponential'
  },
  format: 'xlsx',
  output: 'all-jobs.xlsx'
});

News Article Scraper

const articles = await nemoMine({
  url: 'https://news.example.com',
  selectors: {
    Headline: 'h1',
    Author: '.author',
    Date: '.publish-date',
    Content: '.article-body'
  },
  beforeScrape: {
    scroll: 'bottom',
    wait: 1000
  },
  transform: {
    Date: commonTransforms.date
  },
  metadata: true,
  screenshot: {
    path: 'article.png',
    fullPage: true
  },
  format: 'markdown',
  output: 'articles.md'
});

📖 Step-by-Step Guides

This section provides detailed walkthroughs for common scenarios.

Basic Setup

  1. Install the module in your project:

    npm install nemo-webminer
       
  2. Import the module:

    const { nemoMine } = require('nemo-webminer');
    // Or if using directly:
    const { nemoMine } = require('./path/to/nemo-webminer/dist/index.js');
  3. Ensure Playwright browsers are installed:

    npx playwright install

Scenario 1: Scraping with Specific Selectors to JSON

  1. Define target URL and selectors:

    const url = 'https://example.com';
    const selectors = {
      'Title': 'h1',                   // Get main headings
      'Paragraphs': 'p',               // Get paragraphs
      'Links': 'a',                    // Get links
      'Images': 'img',                 // Get images (will include src and alt attributes)
      'ListItems': 'ul > li, ol > li'  // Get list items from unordered and ordered lists
    };
  2. Call nemoMine with JSON output format:

    nemoMine({
      url,
      selectors,
      format: 'json'
    })
    .then(data => {
      console.log('Scraped data:', data);
         
    })
    .catch(error => {
      console.error('Scraping failed:', error.message);
    });
  3. The resulting data structure will be:

    {
      "Title": ["Example Domain", "Another Heading", ...],
      "Paragraphs": ["This domain is for use in examples...", ...],
      "Links": ["More information...", ...],
      "Images": ["Image description , ...],
      "ListItems": ["List item 1", "List item 2", ...]
    }

Scenario 2: Scraping All Tags to Excel File

  1. Define target URL (no selectors necessary):

    const url = 'https://example.com';
    const outputFile = 'all_tags_data.xlsx';
  2. Call nemoMine with file output format:

    nemoMine({
      url,
      // No selectors provided will scrape all tags
      output: outputFile,
      format: 'file' // Default,
    })
    .then(filePath => {
      console.log(`Excel file saved to: ${filePath}`);
      //  the Excel file for further analysis
    })
    .catch(error => {
      console.error('Scraping failed:', error.message);
    });

Scenario 3: Using String-Format Selectors

  1. Define your target URL and selectors as a string:

    const url = 'https://example.com';
    // Format: "Name1: selector1, Name2: selector2"
    const selectors = 'Headings: h1, h2, h3, Content: p, .content, Article: article';
  2. Call nemoMine with your preferred output format:

    nemoMine({
      url,
      selectors,
      format: 'json' // Or 'file' with an output filename
    })
    .then(result => {
      console.log('Result:', result);
    })
    .catch(error => {
      console.error('Error:', error.message);
    });

Scenario 4: Integration with Express API

  1. Set up an Express route to handle scraping requests:
    const express = require('express');
    const { nemoMine } = require('nemo-webminer');
    const app = express();
       
    app.use(express.json());
       
    app.post('/api/scrape', async (req, res) => {
      try {
        const { url, selectors, format = 'json' } = req.body;
           
        if (!url) {
          return res.status(400).json({ error: 'URL is required' });
        }
           
        const result = await nemoMine({ url, selectors, format });
           
        if (format === 'json') {
          // Return JSON data
          return res.json(result);
        } else {
          // For file output, save the file
          return res.json({ filePath: result });
        }
      } catch (error) {
        res.status(500).json({ error: error.message });
      }
    });
       
    app.listen(3005, () => {
      console.log('API server running on port 3004');
    });

Demo Web UI

  1. Start the demo server:
npm run demo
  1. Open your browser at http://localhost:3004
  2. Enter a URL and optionally enter selectors in one of these formats:
    • Comma-separated: h1, a, p
    • Comma-separated with names: Title: h1, Links: a
    • Or leave blank to scrape all HTML tags
  3. Click "Download XLSX" to get an Excel file or "Get JSON Data" to see the JSON output directly in the browser.

The demo interface provides immediate feedback and allowing to experiment.


API Notes

  • JSON Output Structure:
    • With selectors: { "SelectorName": ["value1", "value2", ...] }
    • Without selectors: { "tagName": ["value1", "value2", ...] }

Selector Formats Explained

Nemo-webminer supports multiple ways to specify which elements to scrape.Below are simple explanations of each format:

1. Object Format

// Define selectors as an object
const selectors = {
  "Title": "h1",                  // Key is the column name, value is the CSS selector
  "Content": "p",
  "Links": "a",
  "ImportantText": ".highlight"    // Can use any valid CSS selector
};

// Pass to nemoMine
nemoMine({ url: 'https://example.com', selectors, format: 'json' });

2. String Format with Names

// Define selectors as a string with name:selector pairs
const selectors = "Title: h1, Content: p, Links: a, ImportantText: .highlight";

// Pass to nemoMine
nemoMine({ url: 'https://example.com', selectors, format: 'json' });

3. Simple String Format

// Just list the selectors (column names will be the same as selectors)
const selectors = "h1, p, a, .highlight";

// This is equivalent to:
// { "h1": "h1", "p": "p", "a": "a", ".highlight": ".highlight" }

nemoMine({ url: 'https://example.com', selectors, format: 'json' });

4. No Selectors (All Tags)

// Remove the selectors parameter or pass an empty string
nemoMine({ url: 'https://example.com', format: 'json' });

// Results will be grouped by HTML tag name
// { "h1": [...], "p": [...], "a": [...], "div": [...], ... }

📝 Changelog

[2.0.0]

🎉 Major Release - 10 New Features

Added:

  • ✨ Multiple output formats (CSV, Markdown, SQL, HTML)
  • ✨ Data transformation with 11 built-in helpers
  • ✨ Pagination support (next button + URL patterns)
  • ✨ Batch URL scraping with concurrent processing
  • ✨ Retry logic
  • ✨ Browser configuration with device presets
  • ✨ Wait strategies for dynamic content
  • ✨ Before-scrape actions (click, scroll, fill)
  • ✨ Screenshot capture
  • ✨ Metadata extraction

Fixed:

  • 🐛 Selector parsing now supports CSS pseudo-selectors (:hover, :first-child, etc.)
  • 🐛 Added URL format validation with clear error messages
  • 🐛 Added configurable timeout parameter (default: 30s)
  • 🐛 Fixed memory leak in demo server with proper file cleanup

Maintained:

  • ✅ 100% backward compatibility - no breaking changes
  • ✅ All existing APIs continue to work
  • ✅ Zero migration required for existing users

[1.0.1]

Fixed:

  • Bug fixes and improvements

[1.0.0]

Initial Release:

  • Basic web scraping functionality
  • JSON and XLSX export
  • CSS selector support
  • Interactive demo UI

🙏 Acknowledgments

  • Built with Playwright for reliable browser automation
  • Uses xlsx for Excel file generation
  • Inspired by the need for a simple yet powerful web scraping solution