npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

pup-crawler

v2.2.0

Published

A simple web crawler that extracts data from websites, using Node.js and Puppeteer.

Readme

PUP Crawler

这是一个基于 puppeteer 的简单的爬虫,可以爬取动态、静态加载的网站。 常用于【列表-详情-内容】系列的网站,比如电影视频等网站。

github 地址

Usage

npm install pup-crawler

简单用法:

import { PupCrawler } from 'pup-crawler'
;async () => {
  const crawler = new PupCrawler()
  await crawler.open()
  await crawler.crawlPage({
    name: 'list',
    url: 'https://www.example.com/list',
    target: {
      values: [{ label: 'detailData', attr: 'href', css: '.list-item > a', all: true }],
    },
    callback: async (result: any) => {
      const { detailData } = result
      console.log(detailData)
    },
  })
  await crawler.close()
}

复杂用法:详细看 example.ts 文件,那以腾讯动漫为例,爬取列表和详情和内容页。

// Target 类型
target: {
    actives: [
      { type: 'click', css: 'a.btn-startchat', delay: 1000 },
      { type: 'input', css: '#chat-input', value: 'input text...', delay: 1000 },
      ...
    ],
    values: [
        // 1. 普通获取值, 例如获取 .item > a 中的文本内容。attr默认获取textContent
        {label: 'val', css: '.item > a'},
        // 2. 获取属性值, 例如获取 .item > a 中的href属性值。attr = getAttribute('xxx')
        {label: 'val2', attr: 'href', css: '.item > a'},
        // 3. 实现 document.querySelectorAll('.item > a') 功能。 加 all: true=querySelectorAll, false=querySelector
        {label: 'val3', attr: 'href', css: '.item > a', all: true},
        // 4. 实现 document.querySelectorAll('.item > a')[3] 功能。 加 all: true, allIdx: 3
        {label: 'val4', attr: 'href', css: '.item > a', all: true, allIdx: 3},
        // 5. 实现 document.querySelectorAll('.item > a')[3].querySelector('.sub-item > a') 功能。 加 all: true,  allIdx: 3
        {label: 'val5', attr: 'href', css: ['.item > a', '.sub-item > a'], all: true, allIdx: 3},
        // 6. 获取 window.location.href 值, 不用加css, 需要从window对象开始获取
        {label: 'val6',  attr: 'window.location.href'},
        // 7. 获取多个a标签的href值,且循环遍历。 加 loopOpt: CrawlOptions; loopOpt执行完的值是下一个target.values的对象,会赋给label,
        {label: 'val7', attr: 'href', css: '.list-item > ul > li > a', all: true, loopOpt: NextPageOpt},
        ...
    ],
    // 在本类型页面循环,例如获取某个电视剧播放的集数列表的播放源
    // loopKey:1、从上面values中选循环的label对应的值(一般是all: true的,loopOpt:不再做下一层循环)
    // loopVals:2、从上面values中选循环的label需要返回的值。比如最后一个页面没必要太多值,只需要val2, val4这两个值
    recursion: { loopKey: 'playList', loopVals: ['val2', 'val4'] },
    // 前置函数,返回true则继续执行。常用控制页面爬取,例如数据库检查当前爬取值是否已存在
    before: () => boolean | Promise<boolean>,
    // 后置函数,如果为false则立即返回爬取结果,否则返回对象和入参时一样。可以过滤一些不必要的参数往下执行
    after: (obj: object) => false | Promise<obj>,
}

配置项

export interface CrawlOptions {
  /** 名 */
  name?: string
  /** 爬取模式: 默认dynamic */
  mode?: 'static' | 'dynamic'
  /** 要爬取的页面地址 */
  url?: string
  /** 超时时间: 默认60s */
  timeout?: number
  /** 延迟时间 */
  delayTime?: number
  /** 页面加载配置: 目前用在动态页面配置 */
  pageOpts?: GoToOptions
  /** 要爬取的目标 */
  target: Target
  /** 前置函数 */
  before?: () => boolean | Promise<boolean>
  /** 后置函数, 如果返回false则终止爬取,否者会把值传给后续继续处理 */
  after?: (obj: object) => false | Promise<obj>
  /** 回调函数 */
  callback?: (obj: object) => obj | Promise<obj>
  /** 报错函数 */
  error?: (err: any) => void
  /** 最终执行函数 */
  finally?: (obj?: object) => void
  /** 自循环递归 */
  recursion?: {
    /** 循环的key */
    loopKey: string
    /** 循环需要取值target.values的label值,会统计成一个object */
    loopVals: string[]
  }
}
export interface IProps {
  /** 网页前缀 */
  host?: string
  /** 打印 */
  showLog?: boolean
}

API

  1. PupCrawler 类:用于创建爬虫实例,并提供一些方法用于控制爬虫的运行。
  2. open 方法:打开浏览器,并等待浏览器启动完成。
  3. close 方法:关闭浏览器,并等待浏览器关闭完成。
  4. crawlPage 方法:爬取页面。

⚠ 注意事项

    1. 如果设置了回调函数打印但是控制台没有打印结果,则注意检查 label 属性名是否有冲突,或 css 选择器是否正确。

🚀 更新日志

------------------ 2025-05-01 ------------------

  • 1、优化代码
  • 2、增加静态页面爬取 mode=static,避免浏览器过多请求
  • 3、去除 console,默认打印
  • 4、抽象爬取功能和解析功能
  • 5、调整 scroll 传参
  • 6、优化 after 函数的功能

------------------ 2025-05-09 ------------------ 1、去除 autoScroll, 操作归纳到 target.actives 中 2、增加 actives 和 finally