@raphaellcs/web-scraper

v2.0.0

Published

2 months ago

Web 抓取工具 - 代理支持、数据转换、批量处理

0High
0Medium
0Low

raphaellcs

scraper web data extraction crawling

@raphaellcs/web-scraper

Web 抓取工具 - 简单的网页数据提取

🚀 功能

CSS 选择器：使用熟悉的 CSS 语法选择元素
属性提取：提取链接、图片等元素的属性
批量抓取：支持抓取多个页面
多种格式：JSON、CSV、TXT 输出
延迟控制：避免请求过快被封
自定义请求头：模拟浏览器请求
代理支持：支持 HTTP/HTTPS 代理（新）
User-Agent 轮换：9种常见浏览器 UA 自动轮换（新）
代理池：自动选择代理，失败重试（新）
数据转换：类型转换、验证、过滤、聚合（新）
去重功能：支持字段和对象去重（新）

📦 安装

npx @claw-dev/web-scraper

📖 快速开始

1. 命令行抓取

web-scraper scrape https://example.com -s "h1,p,a" -a "href"

2. 配置文件

生成配置：

web-scraper init

编辑 web-scraper.json：

{
  "url": "https://example.com",
  "selectors": ["h1", "h2", "p", "a"],
  "attributes": ["href", "src", "alt"],
  "options": {
    "saveHtml": true,
    "delay": 1000
  }
}

运行：

web-scraper run

📋 配置说明

选项

| 选项 | 说明 | 默认值 | |------|------|--------| | -s, --selectors | CSS 选择器（逗号分隔） | h1,h2,h3,p,a,img | | -a, --attributes | 提取属性（逗号分隔） | - | | -H, --headers | 请求头（JSON 格式） | - | | -t, --timeout | 超时时间（毫秒） | 30000 | | -l, --limit | 限制显示数量 | - | | --html | 保存 HTML 内容 | false | | -o, --output | 保存到文件 | - | | --format | 输出格式（json/csv/txt） | json | | -p, --proxy | 使用代理（格式：protocol://host:port） | - | | -u, --user-agent | 自定义 User-Agent | - | | --rotate-user-agent | 轮换 User-Agent | false |

数据转换命令

| 参数 | 说明 | |------|------| | <input> | 输入文件（JSON 格式）| | -o, --output | 保存到文件 | | --rules <json> | 转换规则（JSON 格式）| | --filter <json> | 过滤条件（JSON 格式）| | --dedup <field> | 去重字段 | | --aggregate <json> | 聚合规则（JSON 格式）|

代理测试命令

| 参数 | 说明 | |------|------| | <proxy> | 代理地址（格式：protocol://host:port）| | -u, --url | 测试 URL（默认：https://httpbin.org/ip）|

🎯 使用场景

1. 抓取新闻标题

web-scraper scrape https://news.example.com -s "h2.title"

2. 提取所有链接

web-scraper scrape https://example.com -s "a" -a "href"

3. 抓取产品信息

web-scraper scrape https://shop.example.com \
  -s "div.product" \
  -a "data-id,data-price"

4. 批量抓取

{
  "urls": [
    "https://site1.com",
    "https://site2.com",
    "https://site3.com"
  ],
  "selectors": ["h1", "p"],
  "options": {
    "delay": 2000
  }
}

5. 保存为 CSV

web-scraper scrape https://example.com -s "a" -a "href" \
  -o results.csv --format csv

💡 高级功能

代理支持

使用代理抓取：

web-scraper scrape https://example.com -p "http://127.0.0.1:8080"

测试代理连接：

web-scraper test-proxy "http://127.0.0.1:8080"

User-Agent 轮换

自动轮换 User-Agent：

web-scraper scrape https://example.com --rotate-user-agent

数据转换

类型转换：

web-scraper transform data.json --rules '{"price":{"type":"number"},"text":{"maxLength":50}}'

正则替换：

web-scraper transform data.json --rules '{"price":{"regex":{"pattern":"\\$","replacement":""}}}'

数据过滤：

web-scraper transform data.json --filter '{"price":{"gte":100}}'

去重：

web-scraper transform data.json --dedup "title"

数据聚合：

web-scraper transform data.json --aggregate '{"price":"sum","count":"count"}'

自定义请求头

web-scraper scrape https://example.com \
  -H '{"User-Agent":"Mozilla/5.0"}'

{
  "url": "https://example.com",
  "headers": {
    "User-Agent": "Mozilla/5.0",
    "Accept": "text/html"
  }
}

批量抓取配置

{
  "urls": [
    "https://site1.com",
    "https://site2.com"
  ],
  "selectors": ["h1", "p"],
  "attributes": ["href"],
  "options": {
    "delay": 2000,
    "timeout": 60000,
    "saveHtml": false
  }
}

输出格式

JSON

[
  {
    "selector": "h1",
    "index": 1,
    "text": "Example Title",
    "html": "<h1>Example Title</h1>"
  }
]

CSV

selector,index,text,href
h1,1,Example Title,https://example.com
a,1,Link 1,https://example.com/page1

📊 输出示例

🌐 抓取: https://example.com

📊 找到 5 个元素:

[1] h1
  Example Domain

[2] p
  This domain is for use in illustrative examples...

[3] a
  More information...
  属性: href="https://www.iana.org/domains/example"

[4] a
  Example link 1
  属性: href="/page1"

[5] a
  Example link 2
  属性: href="/page2"

⚠️ 注意事项

1. 遵守 robots.txt

检查网站的 robots.txt 文件，遵守抓取规则。

2. 控制频率

使用 delay 选项避免请求过快。

web-scraper run  # 配置 delay: 2000

3. 使用 User-Agent

有些网站会拒绝爬虫请求，设置 User-Agent。

web-scraper scrape https://example.com \
  -H '{"User-Agent":"Mozilla/5.0"}'

4. 法律合规

抓取前确保符合网站的使用条款和相关法律法规。

🔧 常见问题

抓取失败

检查：

URL 是否正确
网站是否可访问
是否需要设置请求头
是否超时（增加 timeout）

没有结果

检查：

CSS 选择器是否正确
页面是否是动态加载的（此工具不支持 JavaScript）
元素是否存在

被封禁

使用 delay 增加延迟
设置 User-Agent
更换 IP（如使用代理）

🚧 待实现

[ ] 支持动态页面（Puppeteer/Playwright）
[ ] 支持表单提交
[ ] 支持登录认证
[ ] 支持代理
[ ] 支持增量抓取
[ ] 导出更多格式（Excel、SQL）

🤝 贡献

欢迎提交 Issue 和 PR！

📄 许可证

MIT © 梦心

Made with 🌙 by 梦心

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@raphaellcs/web-scraper

🚀 功能

📦 安装

📖 快速开始

1. 命令行抓取

2. 配置文件

📋 配置说明

选项

数据转换命令

代理测试命令

🎯 使用场景

1. 抓取新闻标题

2. 提取所有链接

3. 抓取产品信息

4. 批量抓取

5. 保存为 CSV

💡 高级功能

代理支持

User-Agent 轮换

数据转换

自定义请求头

批量抓取配置

输出格式

JSON

CSV

📊 输出示例

⚠️ 注意事项

1. 遵守 robots.txt

2. 控制频率

3. 使用 User-Agent

4. 法律合规

🔧 常见问题

抓取失败

没有结果

被封禁

🚧 待实现

🤝 贡献

📄 许可证