npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@raphaellcs/web-scraper

v2.0.0

Published

Web 抓取工具 - 代理支持、数据转换、批量处理

Readme

@raphaellcs/web-scraper

npm downloads license

Web 抓取工具 - 简单的网页数据提取

🚀 功能

  • CSS 选择器:使用熟悉的 CSS 语法选择元素
  • 属性提取:提取链接、图片等元素的属性
  • 批量抓取:支持抓取多个页面
  • 多种格式:JSON、CSV、TXT 输出
  • 延迟控制:避免请求过快被封
  • 自定义请求头:模拟浏览器请求
  • 代理支持:支持 HTTP/HTTPS 代理(新)
  • User-Agent 轮换:9种常见浏览器 UA 自动轮换(新)
  • 代理池:自动选择代理,失败重试(新)
  • 数据转换:类型转换、验证、过滤、聚合(新)
  • 去重功能:支持字段和对象去重(新)

📦 安装

npx @claw-dev/web-scraper

📖 快速开始

1. 命令行抓取

web-scraper scrape https://example.com -s "h1,p,a" -a "href"

2. 配置文件

生成配置:

web-scraper init

编辑 web-scraper.json

{
  "url": "https://example.com",
  "selectors": ["h1", "h2", "p", "a"],
  "attributes": ["href", "src", "alt"],
  "options": {
    "saveHtml": true,
    "delay": 1000
  }
}

运行:

web-scraper run

📋 配置说明

选项

| 选项 | 说明 | 默认值 | |------|------|--------| | -s, --selectors | CSS 选择器(逗号分隔) | h1,h2,h3,p,a,img | | -a, --attributes | 提取属性(逗号分隔) | - | | -H, --headers | 请求头(JSON 格式) | - | | -t, --timeout | 超时时间(毫秒) | 30000 | | -l, --limit | 限制显示数量 | - | | --html | 保存 HTML 内容 | false | | -o, --output | 保存到文件 | - | | --format | 输出格式(json/csv/txt) | json | | -p, --proxy | 使用代理(格式:protocol://host:port) | - | | -u, --user-agent | 自定义 User-Agent | - | | --rotate-user-agent | 轮换 User-Agent | false |

数据转换命令

| 参数 | 说明 | |------|------| | <input> | 输入文件(JSON 格式)| | -o, --output | 保存到文件 | | --rules <json> | 转换规则(JSON 格式)| | --filter <json> | 过滤条件(JSON 格式)| | --dedup <field> | 去重字段 | | --aggregate <json> | 聚合规则(JSON 格式)|

代理测试命令

| 参数 | 说明 | |------|------| | <proxy> | 代理地址(格式:protocol://host:port)| | -u, --url | 测试 URL(默认:https://httpbin.org/ip)|

🎯 使用场景

1. 抓取新闻标题

web-scraper scrape https://news.example.com -s "h2.title"

2. 提取所有链接

web-scraper scrape https://example.com -s "a" -a "href"

3. 抓取产品信息

web-scraper scrape https://shop.example.com \
  -s "div.product" \
  -a "data-id,data-price"

4. 批量抓取

{
  "urls": [
    "https://site1.com",
    "https://site2.com",
    "https://site3.com"
  ],
  "selectors": ["h1", "p"],
  "options": {
    "delay": 2000
  }
}

5. 保存为 CSV

web-scraper scrape https://example.com -s "a" -a "href" \
  -o results.csv --format csv

💡 高级功能

代理支持

使用代理抓取:

web-scraper scrape https://example.com -p "http://127.0.0.1:8080"

测试代理连接:

web-scraper test-proxy "http://127.0.0.1:8080"

User-Agent 轮换

自动轮换 User-Agent:

web-scraper scrape https://example.com --rotate-user-agent

数据转换

类型转换:

web-scraper transform data.json --rules '{"price":{"type":"number"},"text":{"maxLength":50}}'

正则替换:

web-scraper transform data.json --rules '{"price":{"regex":{"pattern":"\\$","replacement":""}}}'

数据过滤:

web-scraper transform data.json --filter '{"price":{"gte":100}}'

去重:

web-scraper transform data.json --dedup "title"

数据聚合:

web-scraper transform data.json --aggregate '{"price":"sum","count":"count"}'

自定义请求头

web-scraper scrape https://example.com \
  -H '{"User-Agent":"Mozilla/5.0"}'
{
  "url": "https://example.com",
  "headers": {
    "User-Agent": "Mozilla/5.0",
    "Accept": "text/html"
  }
}

批量抓取配置

{
  "urls": [
    "https://site1.com",
    "https://site2.com"
  ],
  "selectors": ["h1", "p"],
  "attributes": ["href"],
  "options": {
    "delay": 2000,
    "timeout": 60000,
    "saveHtml": false
  }
}

输出格式

JSON

[
  {
    "selector": "h1",
    "index": 1,
    "text": "Example Title",
    "html": "<h1>Example Title</h1>"
  }
]

CSV

selector,index,text,href
h1,1,Example Title,https://example.com
a,1,Link 1,https://example.com/page1

📊 输出示例

🌐 抓取: https://example.com

📊 找到 5 个元素:

[1] h1
  Example Domain

[2] p
  This domain is for use in illustrative examples...

[3] a
  More information...
  属性: href="https://www.iana.org/domains/example"

[4] a
  Example link 1
  属性: href="/page1"

[5] a
  Example link 2
  属性: href="/page2"

⚠️ 注意事项

1. 遵守 robots.txt

检查网站的 robots.txt 文件,遵守抓取规则。

2. 控制频率

使用 delay 选项避免请求过快。

web-scraper run  # 配置 delay: 2000

3. 使用 User-Agent

有些网站会拒绝爬虫请求,设置 User-Agent。

web-scraper scrape https://example.com \
  -H '{"User-Agent":"Mozilla/5.0"}'

4. 法律合规

抓取前确保符合网站的使用条款和相关法律法规。

🔧 常见问题

抓取失败

检查:

  • URL 是否正确
  • 网站是否可访问
  • 是否需要设置请求头
  • 是否超时(增加 timeout)

没有结果

检查:

  • CSS 选择器是否正确
  • 页面是否是动态加载的(此工具不支持 JavaScript)
  • 元素是否存在

被封禁

  • 使用 delay 增加延迟
  • 设置 User-Agent
  • 更换 IP(如使用代理)

🚧 待实现

  • [ ] 支持动态页面(Puppeteer/Playwright)
  • [ ] 支持表单提交
  • [ ] 支持登录认证
  • [ ] 支持代理
  • [ ] 支持增量抓取
  • [ ] 导出更多格式(Excel、SQL)

🤝 贡献

欢迎提交 Issue 和 PR!

📄 许可证

MIT © 梦心


Made with 🌙 by 梦心