npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@yamiwamiyu/nodejs-crawl

v1.0.3

Published

Webpage crawler using puppeteer

Readme

nodejs-crawl

介绍

使用puppeteer的网页数据爬虫

可爱的爬虫

  • 魔兽世界代练 https://www.g2g.com/categories/wow-boosting-service?sort=most_recent
  • FIFA金币站评论 https://futcoin.net/en/reviews
  • FIFA23球员背景卡及CSS https://www.futbin.com/players

获取爬虫

  1. 获取已有爬虫可关注git源码库
  • https://gitee.com/yamiwamiyu/nodejs-crawl
  • https://github.com/yamiwamiyu/nodejs-crawl
  1. 定制爬虫和学习,可联系作者
  • QQ: 359602182
  • 微信: yamiwamiyu

安装教程

npm i @yamiwamiyu/nodejs-crawl

使用和参数说明

  1. 翻页爬行采集
const { pageCrawl } = require('@yamiwamiyu/nodejs-crawl');

pageCrawl({
    // 配置信息,'*' 开头表示必须


    // 浏览器相关设置 ↓

    // 使用你本地安装的浏览器进行操作
    chrome: "C:/Program Files (x86)/Google/Chrome/Application/chrome.exe",
    // 在你已经打开的Chrome浏览器的某个页签里操作(开发时推荐使用)
    // 这需要你的浏览器开启远程调试功能,开启方法如下
    // 假如你是Windows系统,Chrome浏览器位于C:\Program Files (x86)\Google\Chrome\Application\chrome.exe
    // - CMD: 打开控制台(Win + R, 输入cmd回车),输入"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222
    // - 快捷方式: 鼠标右键 -> 新建 -> 快捷方式,输入"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222,双击打开
    // - 代码开启:require('child_process').exec('"C:\Program Files (x86)\Google\Chrome\Application\chrome.exe" --remote-debugging-port=9222')
    port: 9222,
    // 不显示浏览器
    headless: true,


    // 采集器相关设置 ↓

    // * 采集的数据输出的文件名,不配置会不输出文件
    output: __filename,
    // 输出.csv文件,不配置默认输出.json文件
    csv: true,

    // * 采集数据的目标网站url,地址打开应该可以采集到数据的第一页
    url: "https://www.npmjs.com/package/@yamiwamiyu/nodejs-crawl",
    // * 翻页按钮的DOM元素selector,如果页面没有按钮或按钮禁用就会结束爬行
    // selector可参考教程https://www.runoob.com/cssref/css-selectors.html
    // 可以直接是字符串例如next: ".next-button-selector"
    // 可以是function(已经采集的页数),返回空时可以结束爬行
    next: (pageCount) => {
      // 采集10页后结束
      if (pageCount == 10)
        return;
      return ".next-button-selector";
    },
    // 每采集到一页数据时回调
    ondata: (data, url) => {
      console.log("当前页", url, "采集到的数据", data);
    },

    // 接口采集 ↓ (接口必须是返回json格式数据,否则请简单修改源码部分)

    // * 接口的url,包含部分即可
    request: 'api/test?',
    // * 请根据接口返回的json数据,返回最终的数据数组
    // 例如数据格式为 { code: 0, msg: null, results: [{},{},{},{}] }
    json: (i) => i.results,

    // 页面采集 ↓
    // 等待数据渲染完成的关键DOM元素的selector
    wait: ".all-datas-selector",
    // * 从页面采集数据,JS的DOM操作,最终返回数据数组
    // 你应该到浏览器的调试工具控制台中先写好再粘贴到这里
    dom: () => {
      var data = [];
      var comments = document.querySelectorAll(".all-datas-selector");
      for (var c of comments) {
        data.push({
          'key1': c.querySelector(".value1").innerText,
          'key2': c.querySelector(".value2").dataset.id,
          'key3': c.querySelector(".value3").src,
        })
      }
      return data;
    },
  })
  1. 直接接口采集
const { xhrCrawl } = require('@yamiwamiyu/nodejs-crawl');

// 循环构建每页的接口参数
const datas = [];
for (let i = 1; i <= 2; i++) {
  datas.push({
    service_id: 'lgc_service_18',
    brand_id: 'lgc_game_2299',
    sort: 'most_recent',
    page: i,
    page_size: 48,
    currency: 'CNY',
    country: 'CN',
  })
}
// 发起接口拉取数据
await xhrCrawl({
  // 接口并发的数量,默认4
  queue: 4,
  // GET或POST,默认GET
  method: "GET",
  // HTTP请求头
  headers: {
    header1: "value1",
    header2: "value2",
  },
  // * 需要采集的每页的接口参数
  datas: datas,

  // 以下参数均和 pageCrawl 方法的参数一致

  // * 接口地址
  url: "https://sls.g2g.com/offer/search",
  // * 请根据接口返回的json数据,返回最终的数据数组
  json: (i) => i.payload.results,
  // 每采集到一页数据时回调
  ondata: (data, param) => {
    console.log("当前接口参数", param, "采集到了数据", data.length, "条");
  },
  // * 采集的数据输出的文件名,不配置会不输出文件
  output: __filename,
  // 输出.csv文件,不配置默认输出.json文件
  csv: true,
});

实战使用示例

  1. 通过接口获取数据
const { pageCrawl } = require('@yamiwamiyu/nodejs-crawl');

pageCrawl({
  output: __filename,
  // 接口地址
  request: "offer/search?",
  // json数据
  json: (i) => i.payload.results,
  url: "https://www.g2g.com/categories/wow-boosting-service?sort=most_recent",
  next: (i) => {
    // 测试只采集2页
    if (i == 2)
      return;
    return '.q-pagination>button:last-child';
  },
})
  1. 通过页面获取数据
const { pageCrawl } = require('@yamiwamiyu/nodejs-crawl');

pageCrawl({
  output: __filename,
  url: "https://futcoin.net/en/reviews",
  // 等待页面渲染完成的关键DOM元素的selector
  wait: ".fc-comment .uk-first-column .uk-text-left",
  // 从页面采集数据
  dom: () => {
    var array = [];
    var comments = document.querySelectorAll(".fc-comment");
    for (var c of comments) {
      array.push({
        'user': c.querySelector(".uk-first-column .uk-text-left").innerText,
        'nation': ((i) => {
          var index = i.lastIndexOf('/') + 1;
          return i.substring(index, index + 2).toUpperCase();
        })(c.querySelector(".uk-first-column [data-uk-img]").dataset.src),
        'time': c.querySelector(".uk-first-column .uk-text-muted").innerText,
        'coin': ((i) => {
          return i = i.substring(1, i.indexOf(' ', 1)).replaceAll(',', '');
        })(c.querySelector(".uk-width-expand .uk-first-column").innerText),
        'platform': c.querySelector(".uk-width-expand .uk-text-nowrap").innerText,
        'star': c.querySelector(".fc-comment-rating").querySelectorAll(".fc-icon-star").length,
        'content': c.querySelector(".uk-card>.uk-text-left").innerText,
      })
    }
    return array;
  },
  next: (i) => {
    // 测试只采集2页
    if (i == 2)
      return;
    return '[data-uk-icon="arrow-right"]';
  },
  // 输出成csv
  csv: true,
})
  1. 一边拉取数据一边保存,下次从上次拉取到的位置继续
const { pageCrawl, writeCSVLines } = require('@yamiwamiyu/nodejs-crawl');
const fs = require('fs');

const file = __dirname + "/data.csv"
// 没有数据文件则先创建数据文件
if (!fs.existsSync(file))
  fs.writeFileSync(file, "");
pageCrawl({
  // ... 其它参数设置

  // 不再对最终采集到的所有数据进行文件输出
  output: '',
  // 采集到一页数据后就追加到数据文件末尾
  ondata(data) {
    fs.appendFileSync(file, writeCSVLines(data));
  },
})
  1. 直接接口并发采集,采集速度快
const { xhrCrawl } = require('@yamiwamiyu/nodejs-crawl');

// 循环构建每页的接口参数
const datas = [];
for (let i = 1; i <= 2; i++) {
  datas.push({
    service_id: 'lgc_service_18',
    brand_id: 'lgc_game_2299',
    sort: 'most_recent',
    page: i,
    page_size: 48,
    currency: 'CNY',
    country: 'CN',
  })
}
// 发起接口拉取数据
await xhrCrawl({
  datas,
  url: "https://sls.g2g.com/offer/search",
  json: (i) => i.payload.results,
  output: __filename,
});