npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@mcptoolshop/repo-crawler-mcp

v1.3.1

Published

MCP server that crawls GitHub repos and extracts structured data for AI agents

Readme


为什么?

与代码交互的 AI 代理需要理解代码仓库,而不仅仅是文件,而是需要了解完整的上下文:谁在贡献代码,哪些地方存在问题,哪些依赖项存在漏洞,项目的活跃程度如何。手动抓取这些信息会消耗 API 配额和上下文窗口。

Repo Crawler MCP 将 GitHub 的所有数据接口以结构化的 MCP 工具形式呈现。只需调用一次 crawl_repo 方法,并设置 tier: '3',即可返回元数据、文件目录、编程语言、README 文件、提交记录、贡献者、分支、标签、发布版本、社区健康状况、CI 工作流、问题、拉取请求、流量数据、里程碑、Dependabot 告警、安全建议、SBOM(软件物料清单)、代码扫描告警以及秘密扫描告警——所有部分都是可选的,都受到速率限制,并且都具有优雅降级机制。

功能

  • 5 个 MCP 工具:抓取仓库、抓取组织、总结、比较、导出
  • 三级数据模型:从基础开始,根据需要深入挖掘
  • 分部分抓取:仅调用您请求的 API,节省配额
  • 优雅降级:即使 Dependabot 出现 403 错误,也不会停止抓取;权限按部分跟踪
  • 内置速率限制:使用 Octokit 进行限速,并在出现 429 错误时自动重试
  • 安全导出:CSV 文件具有防止公式注入的功能,Markdown 文件具有管道转义功能
  • 适配器模式:首先支持 GitHub,并可扩展到 GitLab/Bitbucket

快速开始

使用 Claude Code

{
  "mcpServers": {
    "repo-crawler": {
      "command": "npx",
      "args": ["-y", "@mcptoolshop/repo-crawler-mcp"],
      "env": {
        "GITHUB_TOKEN": "ghp_your_token_here"
      }
    }
  }
}

使用 Claude Desktop

将相同的配置添加到您的 claude_desktop_config.json 文件中。

配置

| 变量 | 必需 | 描述 | | ---------- | ---------- | ------------- | | GITHUB_TOKEN | 推荐 | GitHub 个人访问令牌。如果没有该令牌:每小时 60 次请求。如果使用该令牌:每小时 5000 次请求。 |

不同级别的令牌权限:

| 级别 | 所需权限 | | ------ | ---------------- | | 第一级 | public_repo(或 repo 用于私有仓库) | | 第二级 | 同上 + 用于流量数据的推送/管理权限 | | 第三级 | 同上 + 用于 Dependabot、代码扫描和秘密扫描的 security_events 权限 |

工具

crawl_repo

主要工具。抓取单个仓库,适用于任何数据级别。

crawl_repo({ owner: "facebook", repo: "react", tier: "2" })

| 参数 | 类型 | 默认值 | 描述 | | ------- | ------ | --------- | ------------- | | owner | 字符串 | — | 仓库所有者 | | repo | 字符串 | — | 仓库名称 | | tier | '1' | '2' | '3' | '1' | 数据级别 | | sections | 字符串数组 | all | 要包含的特定部分 | | exclude_sections | 字符串数组 | none | 要跳过的部分 | | commit_limit | 数字 | 30 | 最大提交数(第一级) | | contributor_limit | 数字 | 30 | 最大贡献者数(第一级) | | issue_limit | 数字 | 100 | 最大问题数(第二级) | | pr_limit | 数字 | 100 | 最大拉取请求数(第二级) | | issue_state | 'open' | 'closed' | 'all' | 'all' | 问题/拉取请求过滤器(第二级) | | alert_limit | 数字 | 100 | 最大安全告警数(第三级) |

crawl_org

抓取组织中的所有仓库,并使用过滤器。

crawl_org({ org: "vercel", tier: "1", min_stars: 100, language: "TypeScript" })

| 参数 | 类型 | 默认值 | 描述 | | ------- | ------ | --------- | ------------- | | org | 字符串 | — | 组织名称 | | tier | '1' | '2' | '3' | '1' | Data tier per repo | | min_stars | number | 0 | Minimum star count | | language | string | any | Filter by primary language | | include_forks | boolean | false | Include forked repos | | include_archived | boolean | false | Include archived repos | | repo_limit | number | 30 | Max repos to crawl | | alert_limit | number | 30 | Max security alerts per repo (Tier 3) |

get_repo_summary

Quick human-readable summary. Only 4 API calls — ideal for triage.

get_repo_summary({ owner: "anthropics", repo: "claude-code" })

compare_repos

Side-by-side comparison of 2–5 repos. Stars, languages, activity, community health, size.

compare_repos({ repos: [
  { owner: "vitejs", repo: "vite" },
  { owner: "webpack", repo: "webpack" }
]})

export_data

Export crawl results as JSON, CSV, or Markdown. CSV includes formula injection prevention.

export_data({ data: crawlResult, format: "markdown", sections: ["metadata", "issues"] })

Data Tiers

Tier 1 — Repository Fundamentals

Everything you need to understand a repo at a glance.

| Section | API Endpoint | Calls | | --------- | ------------- | ------- | | metadata | GET /repos/{owner}/{repo} | 1 | | tree | GET /repos/.../git/trees/{sha}?recursive=1 | 1 | | languages | GET /repos/.../languages | 1 | | readme | GET /repos/.../readme | 1 | | commits | GET /repos/.../commits | 1+ | | contributors | GET /repos/.../contributors | 1+ | | branches | GET /repos/.../branches | 1+ | | tags | GET /repos/.../tags | 1+ | | releases | GET /repos/.../releases | 1+ | | community | GET /repos/.../community/profile | 1 | | workflows | GET /repos/.../actions/workflows | 1 |

Budget: ~11 API calls per full crawl. ~450 full crawls/hr with token.

Tier 2 — Project Activity (includes Tier 1)

Issues, PRs, traffic, milestones — the pulse of the project.

| Section | API Endpoint | Calls | Notes | | --------- | ------------- | ------- | ------- | | traffic | .../traffic/views + .../traffic/clones | 2 | Requires push/admin access. Degrades gracefully on 403. | | issues | GET /repos/.../issues | 1+ | Filters out PRs. Body capped at 500 chars. | | pullRequests | GET /repos/.../pulls | 1+ | Includes draft/merged status, head/base refs. | | milestones | GET /repos/.../milestones | 1+ | All states (open + closed). | | discussions | (GraphQL — stub) | 0 | Returns empty. Planned for future release. |

Tier 3 — Security & Compliance (includes Tier 1 + 2)

Vulnerability data, dependency analysis, leaked secrets.

| Section | API Endpoint | Calls | Notes | | --------- | ------------- | ------- | ------- | | dependabotAlerts | GET /repos/.../dependabot/alerts | 1 | CVE/GHSA IDs, patched versions, severity. | | securityAdvisories | GET /repos/.../security-advisories | 1 | Repo-level advisories with vulnerability details. | | sbom | GET /repos/.../dependency-graph/sbom | 1 | SPDX format. Packages, versions, licenses, ecosystems. | | codeScanningAlerts | GET /repos/.../code-scanning/alerts | 1 | CodeQL, Semgrep, etc. Rule IDs, file locations. | | secretScanningAlerts | GET /repos/.../secret-scanning/alerts | 1 | Leaked tokens/keys. Push protection bypass tracking. |

Permission tracking: Every Tier 3 section returns a permission status (granted, denied, or not_enabled) so the agent knows exactly what's accessible and what requires elevated access.

Graceful degradation: Each section is fetched independently. A 403 on code scanning doesn't block Dependabot or SBOM.

Examples

Quick repo triage

get_repo_summary({ owner: "expressjs", repo: "express" })

Deep security audit

crawl_repo({ owner: "myorg", repo: "api-server", tier: "3" })

Compare 框架

compare_repos({ repos: [
  { owner: "sveltejs", repo: "svelte" },
  { owner: "vuejs", repo: "core" },
  { owner: "facebook", repo: "react" }
], aspects: ["metadata", "activity", "community"] })

Export issues to CSV

const result = crawl_repo({ owner: "myorg", repo: "app", tier: "2", sections: ["issues"] })
export_data({ data: result, format: "csv" })

Org-wide vulnerability scan

crawl_org({ org: "myorg", tier: "3", alert_limit: 50 })

Development

npm install
npm run typecheck    # Type check with tsc
npm test             # Run tests with vitest
npm run build        # Compile to build/

Test Coverage

60 tests across 5 test files:

  • Validation — owner/repo regex, URL parsing, edge cases
  • CSV escaping — formula injection vectors, quoting, special chars
  • Markdown escaping — pipe and newline escaping
  • GitHub adapter — Tier 1/2/3 fetching, section filtering, error handling, permission tracking
  • Tool schemas — Zod validation, param defaults

Architecture

src/
  index.ts              # Entry point (shebang for npx)
  server.ts             # MCP server setup + tool registration
  types.ts              # All interfaces, Zod schemas, error codes, tier constants
  adapters/
    types.ts            # Platform-agnostic adapter interface
    github.ts           # GitHub API via Octokit (Tier 1/2/3)
  tools/
    crawlRepo.ts        # crawl_repo — single repo crawling
    crawlOrg.ts         # crawl_org — org-wide crawling with filters
    repoSummary.ts      # get_repo_summary — lightweight 4-call summary
    compareRepos.ts     # compare_repos — side-by-side comparison
    exportData.ts       # export_data — JSON/CSV/Markdown export
  utils/
    logger.ts           # Stderr-only logger (stdout reserved for MCP)
    errors.ts           # CrawlerError class, structured error responses
    validation.ts       # Owner/repo/URL validation with regex
    csvEscape.ts        # Formula injection prevention + CSV quoting
    mdEscape.ts         # Pipe escaping, newline removal for tables

Design Principles

  • 按部分选择获取 — 只为使用的内容付费。请求 sections: ["metadata", "issues"],只有这些 API 才会被调用。
  • 安全时并行处理 — 独立的单次请求接口(例如:元数据、目录、语言、README、社区)通过 Promise.allSettled 并行运行。分页接口按顺序运行,并支持提前终止。
  • 优雅降级 — 每个 API 调用都包含 try/catch 块。即使出现单个错误,也会返回默认值,而不会导致爬虫崩溃。
  • 权限感知 — 第三层会跟踪哪些安全接口返回了 403 错误(权限不足)而不是 404 错误。代理程序可以根据返回结果推断其拥有的访问权限。

许可证

MIT


MCP Tool Shop 构建。