alou-fetch-mcp

v1.0.0

Published

5 months ago

MCP服务器用于抓取网页内容，支持HTML、Markdown、纯文本和JSON格式，特别优化了微信公众号文章和学术论文的抓取

0High
0Medium
0Low

mcp model-context-protocol fetch web-scraping html markdown text-extraction wechat wechat-articles academic-papers content-extraction ai-tools

@alou/fetch-mcp

一个强大的MCP（Model Context Protocol）服务器，专门用于抓取网页内容，特别优化了微信公众号文章和学术论文的提取。

功能特性

🌐 多种格式支持：HTML、Markdown、纯文本、JSON
📱 微信公众号优化：专门针对微信公众号文章内容提取
📚 学术论文支持：支持ArXiv、IEEE、ACM等学术平台
🖼️ 图片提取：自动识别和提取网页中的图片
🔄 批量处理：支持批量处理多个URL
⚡ 高性能：基于Electron net模块，支持更好的网络请求

安装

npm install @alou/fetch-mcp

使用方法

作为MCP服务器使用

在你的MCP配置文件中添加：

{
  "mcpServers": {
    "fetch": {
      "command": "npx",
      "args": ["@alou/fetch-mcp"]
    }
  }
}

可用的工具

1. fetch-html

获取网页的HTML内容

{
  "name": "fetch-html",
  "arguments": {
    "url": "https://example.com",
    "headers": {} // 可选
  }
}

2. fetch-txt

获取网页的纯文本内容（推荐用于微信公众号文章）

{
  "name": "fetch-txt",
  "arguments": {
    "url": "https://mp.weixin.qq.com/s/...",
    "headers": {} // 可选
  }
}

3. fetch-markdown

获取网页的Markdown格式内容

{
  "name": "fetch-markdown",
  "arguments": {
    "url": "https://example.com",
    "headers": {} // 可选
  }
}

4. fetch-json

获取JSON数据

{
  "name": "fetch-json",
  "arguments": {
    "url": "https://api.example.com/data",
    "headers": {} // 可选
  }
}

可用的Prompts

1. wechat-article-extractor

专门用于抓取微信公众号文章

{
  "name": "wechat-article-extractor",
  "arguments": {
    "url": "https://mp.weixin.qq.com/s/...",
    "extract_images": "true", // 是否提取图片
    "save_format": "markdown", // 保存格式：markdown, html, text
    "include_metadata": "true" // 是否包含元数据
  }
}

2. academic-paper-fetcher

专门用于抓取学术论文

{
  "name": "academic-paper-fetcher",
  "arguments": {
    "paper_url": "https://arxiv.org/abs/...",
    "paper_source": "arxiv", // arxiv, ieee, acm, springer, elsevier, custom
    "extract_metadata": "true", // 是否提取元数据
    "download_pdf": "false" // 是否下载PDF
  }
}

3. content-batch-processor

批量处理多个内容源

{
  "name": "content-batch-processor",
  "arguments": {
    "urls": "url1,url2,url3", // 用逗号分隔的URL列表
    "content_type": "wechat", // wechat, paper, mixed
    "output_directory": "./output", // 输出目录
    "naming_convention": "auto" // title, date, url, auto
  }
}

4. image-extractor

从网页中提取所有图片

{
  "name": "image-extractor",
  "arguments": {
    "url": "https://example.com",
    "image_types": "all", // all, jpg, png, gif, webp
    "min_size": "100x100", // 最小图片尺寸
    "save_path": "./images" // 图片保存路径
  }
}

使用场景

1. 微信公众号文章抓取

# 使用fetch-txt工具抓取微信公众号文章
npx @alou/fetch-mcp fetch-txt --url "https://mp.weixin.qq.com/s/..."

2. 学术论文信息提取

# 使用academic-paper-fetcher prompt
npx @alou/fetch-mcp academic-paper-fetcher --paper_url "https://arxiv.org/abs/..."

3. 批量内容处理

# 批量处理多个微信公众号文章
npx @alou/fetch-mcp content-batch-processor --urls "url1,url2,url3" --content_type "wechat"

技术特性

Electron兼容：在Electron环境中自动使用electron.net.fetch，提供更好的网络性能
错误处理：完善的错误处理和重试机制
类型安全：使用TypeScript和Zod进行类型验证
模块化设计：清晰的代码结构，易于扩展

开发

# 安装依赖
npm install

# 开发模式
npm run dev

# 构建
npm run build

# 启动
npm start

许可证

MIT License

贡献

欢迎提交Issue和Pull Request！

更新日志

v1.0.0

初始版本发布
支持HTML、Markdown、纯文本、JSON抓取
专门优化微信公众号文章和学术论文抓取
支持批量处理和图片提取
提供丰富的Prompt模板

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@alou/fetch-mcp

功能特性

安装

使用方法

作为MCP服务器使用

可用的工具

1. fetch-html

2. fetch-txt

3. fetch-markdown

4. fetch-json

可用的Prompts

1. wechat-article-extractor

2. academic-paper-fetcher

3. content-batch-processor

4. image-extractor

使用场景

1. 微信公众号文章抓取

2. 学术论文信息提取

3. 批量内容处理

技术特性

开发

许可证

贡献

更新日志

v1.0.0