@cafitac/ai-crawler

v0.1.2

Published

20 days ago

npm delivery wrapper for the ai-crawler Python CLI

0High
0Medium
0Low

cafitac

crawler ai mcp wrapper network

ai-crawler

AI-driven network-first crawler compiler for authorized workflows.

ai-crawler turns captured network evidence into reusable crawler recipes. The browser is used as a short-lived probe for API discovery, not as the crawling engine. Bulk collection runs through deterministic HTTP replay with curl-cffi.

Browser is not the crawler. Browser is the probe.
AI is not the request loop. AI is the planner/debugger/recipe author.

What it is

ai-crawler is an early-stage Python OSS library and CLI for building crawler recipes from network evidence.

It focuses on:

Network-first API discovery and replay
Recipe generation, testing, repair, and deterministic execution
Simple CLI defaults for humans and AI harnesses
Python SDK facade for application integrations
stdio MCP server for Hermes, Claude Code, Codex, and other agents
Local-first tests with fake transports and fixture sites
Security boundaries: redaction, challenge detection, and no CAPTCHA/MFA/bot-challenge bypass logic

Install for local development

git clone https://github.com/cafitac/ai-crawler.git
cd ai-crawler
uv sync --extra dev --extra http --extra mcp

If you are already inside a local checkout:

uv sync --extra dev --extra http --extra mcp

npm wrapper

For npm-first onboarding, the repo also ships a thin Node wrapper that delegates to the Python core:

npx @cafitac/ai-crawler --help
npx @cafitac/ai-crawler auto evidence.json --json
npx @cafitac/ai-crawler mcp

Wrapper behavior:

inside the repo checkout: runs the local Python core with uv run --project <repo> ai-crawler ...
outside the repo checkout: runs the published Python core via a git-pinned uvx spec when the wrapper package includes gitHead, otherwise falls back to uvx --from "git+https://github.com/cafitac/ai-crawler.git[all]" ai-crawler ...
override the published Python package spec with AI_CRAWLER_PYTHON_SPEC
override the uvx Python version with AI_CRAWLER_UVX_PYTHON

Quick start

The one-command path from URL to crawler artifacts is:

uv sync --extra browser --extra http
uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --json

compile opens the page briefly, records normalized network response events into evidence.json, generates a recipe, tests it, repairs extraction when possible, retests, and writes final JSONL output. The browser is only used for discovery; the generated recipe and final crawl use deterministic HTTP replay. By default, probe evidence keeps replay-friendly fetch/xhr 2xx/3xx responses and drops static assets, failed responses, and other browser noise.

If you want to inspect or edit evidence before compiling, split the flow:

uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products"
uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --wait-ms 2500 --max-events 50 --include-resource-type fetch,xhr,document
uv run --extra http ai-crawler auto evidence.json --json

If you already have an evidence file, the main AI-harness command is:

ai-crawler auto evidence.json --json

With a local checkout:

uv run --extra http ai-crawler auto evidence.json --json

This writes default artifacts:

evidence.json            # browser probe evidence, if generated by probe
recipe.yaml              # initial generated recipe
repaired.recipe.yaml     # repaired/final recipe
test.jsonl               # initial diagnostic crawl output
crawl.jsonl              # final crawl output
auto.report.json         # stable machine-readable report

The JSON report includes:

final success/failure status
command_type (compile or auto)
failure_phase for quick triage (probe, generate, final_test, or empty on success)
ordered phase_diagnostics for probe -> generate -> initial_test -> repair -> final_test
recipe/output paths
initial and final crawl results
bounded/redacted diagnostic samples
failure classifications such as success, extraction_failed, http_error, no_response, challenge_detected, probe_failed, and no_endpoint_candidates

In --json mode, stdout is reserved for one machine-readable JSON object. Human-readable failures are written to stderr. Exit code 2 still writes auto.report.json so agents can inspect the failure.

Evidence format

Create evidence with a short browser probe:

uv run --extra browser ai-crawler probe https://example.com/products --goal "collect products" --output evidence.json

The probe tuning options are available on both probe and compile:

--wait-ms: browser settle time after network idle (default: 1000)
--max-events: maximum replay candidates retained after filtering (default: 200)
--include-resource-type: comma-separated Playwright resource types to retain (default: fetch,xhr)

Minimal evidence JSON:

{
  "target_url": "https://example.com/products",
  "goal": "collect products",
  "events": [
    {
      "method": "GET",
      "url": "https://example.com/api/products?page=1",
      "status_code": 200,
      "resource_type": "fetch"
    }
  ]
}

Generate and run manually:

uv run --extra browser --extra http ai-crawler compile https://example.com/products --goal "collect products" --json

Or run each artifact step yourself:

uv run --extra http ai-crawler generate-recipe evidence.json
uv run --extra http ai-crawler test-recipe recipe.yaml
uv run --extra http ai-crawler repair-recipe recipe.yaml
uv run --extra http ai-crawler test-recipe repaired.recipe.yaml --output crawl.jsonl

MCP usage

Generate client config snippets for local uv-project usage. For copy-paste examples across CLI/MCP/SDK flows, also see docs/harness-examples.md.

uv run ai-crawler mcp-config --client hermes --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client claude-code --project /path/to/ai-crawler
uv run ai-crawler mcp-config --client codex --project /path/to/ai-crawler

Generate npm-first snippets for the published wrapper:

uv run ai-crawler mcp-config --client hermes --launcher npm

Run as a stdio MCP server:

uv run --extra mcp --extra http ai-crawler mcp

Exposed tools:

compile_url
auto_compile
generate_recipe
test_recipe
repair_recipe

If you prefer npm-first installation for agent tooling, the wrapper can also launch the MCP server:

npx @cafitac/ai-crawler mcp

Hermes development snippet shape:

mcp_servers:
  ai-crawler:
    command: "uv"
    args: ["run", "--project", "/path/to/ai-crawler", "--extra", "mcp", "--extra", "http", "ai-crawler", "mcp"]
    timeout: 300
    connect_timeout: 60

Hermes npm-first snippet shape:

mcp_servers:
  ai-crawler:
    command: "npx"
    args: ["-y", "@cafitac/ai-crawler", "mcp"]
    timeout: 300
    connect_timeout: 60

Python SDK

The Python SDK remains the stable embedded/programmatic surface. The npm package is only a launcher wrapper around this Python core. See docs/harness-examples.md for copy-paste SDK, MCP, and published-wrapper examples.

npm publishing is automated with .github/workflows/npm-publish.yml.

push a tag matching the package version, for example npm-v0.1.2
or run the workflow manually with workflow_dispatch
the workflow validates that package.json, pyproject.toml, and src/ai_crawler/__init__.py agree on the release version before publish
tag-triggered publishes also validate that the pushed tag matches npm-v<package.json version>
use docs/release-runbook.md for the full version bump, tagging, and post-publish smoke checklist

Example tag flow:

git tag npm-v0.1.2
git push origin npm-v0.1.2

from ai_crawler import AICrawler

crawler = AICrawler()
result = crawler.auto("evidence.json")
print(result.ok)
print(result.exit_code)
print(result.report)

compile_result = crawler.compile_url("https://example.com/products", goal="collect products")
print(compile_result.report["command_type"])

For tests or embedded usage, inject a fake fetcher:

crawler = AICrawler(fetcher=my_fake_fetcher)

Verification

Fast local lint/type checks while iterating:

bash scripts/check-python.sh

Full project verification:

bash scripts/verify-ai-harness.sh

MCP auto_compile fixture smoke test:

uv run --extra http python scripts/smoke-mcp-auto-compile.py

This starts a local fixture HTTP site and verifies generate -> test -> repair -> retest without external internet, a real browser, or a real LLM.

Security and compliance boundary

ai-crawler is intended for authorized crawling, internal QA/testing, research, owned or allowed web property monitoring, and data portability workflows.

It does not implement:

CAPTCHA solving
MFA bypass
Cloudflare/bot-challenge bypass
stealth fingerprint manipulation
evasion proxy rotation

Challenge-like responses are classified and surfaced as requiring human/manual handoff where appropriate.

Sensitive values in diagnostic reports are redacted, including common bearer tokens, cookies, session IDs, API keys, and JSON-embedded token fields.

Documentation

Development docs live under .dev/:

.dev/README.md
.dev/03-ai/auto-harness-contract.md
.dev/04-mcp/server.md
.dev/08-operations/security-and-compliance.md
.dev/08-operations/challenge-handling-policy.md

Status

Alpha. The deterministic recipe compiler, one-command compile flow, browser probe CLI, CLI, SDK facade, MCP server, redaction, failure classification, and fixture smoke tests are implemented. Real LLM provider integrations are intentionally optional/future layers behind adapter boundaries.

License

MIT