bluewebcrawler
v0.2.0
Published
A hybrid HTTP + browser web crawler that exports AI-ready markdown.
Maintainers
Readme
bluewebcrawler
BlueWebCrawler is a Node + TypeScript command-line crawler that maps a site and exports AI-ready markdown.

It supports hybrid crawling:
- Fast HTTP-first fetching for normal pages.
- Playwright browser escalation for JS-rendered pages.
Capability Matrix
| Capability | v1 Status | | --- | --- | | Same-host domain crawl | ✅ | | Link + sitemap discovery | ✅ | | JS-rendered page support | ✅ | | Robots.txt respected by default | ✅ | | Per-page markdown output | ✅ | | Index + errors report | ✅ | | Optional JSON manifest | ✅ | | Authenticated crawling | ❌ (future) |
Requirements
- Node.js
>= 20 - npm
>= 10 - Playwright browser binaries (for JS-rendered pages)
If your environment does not have Node 20 yet, install it first (for example with nvm):
nvm install 20
nvm use 20
node -v
npm -vQuick Install (npx)
Run directly without installing globally:
npx bluewebcrawler crawl https://example.comFor JS-rendered pages, install Chromium once on your machine:
npx playwright install chromiumGlobal Install (npm)
Install globally and run as a normal command:
npm install -g bluewebcrawler
bluewebcrawler crawl https://example.comDependency Note (Important)
npm installs this package dependencies automatically (everything listed in dependencies), including the playwright npm package.
If you still get Playwright runtime errors (for example, missing Chromium executable), install browser binaries manually:
npx playwright install chromiumFor Linux CI/containers that also need system libraries:
npx playwright install --with-deps chromiumSecurity Safeguards (Default On)
BlueWebCrawler treats crawled content as untrusted input.
Default behavior:
- prompt-injection policy:
redact - prompt-injection drop threshold (for
dropmode):3 - output encoding:
ascii-escape - Unicode normalization:
NFKC - control and bidi direction characters removed
Useful overrides:
# Detect only (no redaction)
bluewebcrawler crawl https://example.com --prompt-injection-mode detect
# Strict drop mode
bluewebcrawler crawl https://example.com --prompt-injection-mode drop --prompt-injection-threshold 5
# Keep raw Unicode output (less strict)
bluewebcrawler crawl https://example.com --output-encoding utf8Run From Source (Clone + Build)
git clone https://github.com/plgonzalezrx8/bluewebcrawler.git
cd bluewebcrawler
npm ciInstall Playwright Chromium runtime when you need JS-rendered crawling:
npx playwright install chromiumBuild and run:
npm run build
npm run crawl -- https://example.comOutput is written to ./output by default:
output/run-<timestamp>/index.mdoutput/run-<timestamp>/errors.mdoutput/run-<timestamp>/pages/*.mdoutput/run-<timestamp>/manifest.json(when format ismarkdown+json)
CLI Usage
bluewebcrawler crawl <url> [options]Common Recipes
Full crawl with explicit limits
bluewebcrawler crawl https://example.com \
--max-pages 1500 \
--max-depth 8 \
--max-duration 3600Sitemap-only disabled
bluewebcrawler crawl https://example.com --sitemap offStrict robots compliance with conservative concurrency
bluewebcrawler crawl https://example.com --respect-robots true --concurrency 2Configuration File
Create crawler.config.json in your project root and provide it via --config.
{
"output": "./output",
"maxPages": 1000,
"maxDepth": 6,
"maxDurationSeconds": 1800,
"respectRobots": true,
"includeSubdomains": false,
"queryPolicy": "drop",
"queryAllowlist": [],
"sitemap": "auto",
"concurrency": { "min": 2, "max": 8 },
"timeouts": { "requestMs": 15000, "renderMs": 20000 },
"render": { "strategy": "hybrid", "waitUntil": "networkidle", "fallbackTimeoutMs": 20000 },
"format": "markdown+json",
"userAgent": "bluewebcrawler/0.1 (+https://github.com/)",
"verbose": false,
"security": {
"promptInjection": { "mode": "redact", "threshold": 3 },
"outputEncoding": {
"mode": "ascii-escape",
"normalize": "NFKC",
"stripControlChars": true,
"stripBidiControls": true
}
}
}CLI flags override config file values.
Development (Contributors)
npm run lint
npm run test
npm run buildBrowser integration tests are opt-in:
RUN_BROWSER_TESTS=1 npm run testPublishing Checklist (Maintainers)
First release target from the original 0.1.0 state: 0.1.1.
- Run release checks locally.
npm run release:check- Confirm pack output only contains runtime artifacts:
dist/**README.mdLICENSEpackage.json
- Set release version and tag.
npm version patch- Push commit and tag.
git push origin main --follow-tags- Authenticate and publish.
npm login
npm publish- Verify install and runtime.
npm view bluewebcrawler version
npx bluewebcrawler --version
npx bluewebcrawler crawl https://example.com --max-pages 1Continuous Integration
GitHub Actions runs on every push and pull_request using .github/workflows/ci.yml:
Lint, Test, Buildjob runsnpm run lint,npm run test, andnpm run build.Browser Integration Testsjob installs Playwright Chromium and runstests/integration/js-crawl.test.tswithRUN_BROWSER_TESTS=1.
