gimirick-brave-search-scraper

v1.1.3

Published

3 days ago

Server-side scraper that queries Brave Search and extracts URLs

Downloads

1,527

0High
0Medium
0Low

gimirickofficial

Brave Search Scraper

Brave Search Scraper is a Node.js library for scraping Brave Search, easily. It uses axios and cheerio to fetch and parse Brave Search results, returning a clean array of external URLs. Features input validation with Zod, structured logging with Pino, multi-page pagination, and a built-in health check.

npm

Install globally (CLI use):

npm i -g gimirick-brave-search-scraper

Install locally (programmatic use):

npm i gimirick-brave-search-scraper

npm package

Programmatic usage (npm)

const { scrapeBraveSearch } = require('gimirick-brave-search-scraper');

const urls = await scrapeBraveSearch('machine learning');
console.log(urls);

Output:

["https://en.wikipedia.org/wiki/Machine_learning", "https://www.ibm.com/topics/machine-learning"]

CLI (npm)

brave-search-scraper "your search query"

Or via npx without installing:

npx brave-search-scraper "your search query"

With a SEARCH_QUERY environment variable:

SEARCH_QUERY="your search query" brave-search-scraper

Git Clone

Clone and install locally:

git clone https://github.com/GimiRick/Brave-Search-Scraper.git
cd Brave-Search-Scraper
npm install

Programmatic usage (git clone)

const { scrapeBraveSearch } = require('./src/scraper');

const urls = await scrapeBraveSearch('machine learning');
console.log(urls);

CLI (git clone)

node src/scraper.js "your search query"

With a SEARCH_QUERY environment variable:

SEARCH_QUERY="your search query" node src/scraper.js

Additional options

All examples below use require('gimirick-brave-search-scraper') (npm). If using a git clone, replace with require('./src/scraper').

Import only what you need

const {
  scrapeBraveSearch,
  extractUrls,
  extractCookies,
  fetchWithRetry,
  isBraveDomain,
  randomItem,
  sleep,
  main,
  validateSearchQuery,
  healthCheck,
} = require('gimirick-brave-search-scraper');

Searching multiple queries

const { scrapeBraveSearch } = require('gimirick-brave-search-scraper');

const queries = ['node.js tutorial', 'python vs javascript', 'rust programming'];

for (const query of queries) {
  const urls = await scrapeBraveSearch(query);
  console.log(`"${query}" → ${urls.length} results`);
  console.log(urls.join('\n'));
}

Custom retry count

Default is 3 retries on failures or rate limits. Pass a custom count as the fourth argument:

const { fetchWithRetry } = require('gimirick-brave-search-scraper');

const response = await fetchWithRetry(
  'https://search.brave.com/search',
  { q: 'artificial intelligence' },
  { 'User-Agent': 'Mozilla/5.0 ...' },
  5
);

Parse HTML you already have

const cheerio = require('cheerio');
const { extractUrls } = require('gimirick-brave-search-scraper');

const $ = cheerio.load(existingHtml);
const urls = extractUrls($);
console.log(urls);

Extract cookies manually

const axios = require('axios');
const { extractCookies } = require('gimirick-brave-search-scraper');

const response = await axios.get('https://search.brave.com/', {
  headers: { 'User-Agent': 'Mozilla/5.0 ...' },
});

const cookies = extractCookies(response.headers['set-cookie']);
console.log(cookies);

Filter Brave domains from a URL list

const { isBraveDomain } = require('gimirick-brave-search-scraper');

const urls = [
  'https://brave.com/download',
  'https://example.com/article',
  'https://support.brave.com/help',
  'https://en.wikipedia.org/wiki/Brave',
];

const external = urls.filter((url) => !isBraveDomain(new URL(url).hostname));

Throttle requests

const { sleep } = require('gimirick-brave-search-scraper');

await sleep(2000); // wait 2 seconds

Rotate user agents

const { randomItem } = require('gimirick-brave-search-scraper');

const agents = [
  'Mozilla/5.0 (Windows NT 10.0; Win64; x64) Chrome/125.0',
  'Mozilla/5.0 (Macintosh; Intel Mac OS X 14_5) Safari/605.1',
];

const agent = randomItem(agents);

Input validation with Zod

scrapeBraveSearch validates every query before making any network request:

const { scrapeBraveSearch } = require('gimirick-brave-search-scraper');

await scrapeBraveSearch(''); // throws ZodError — empty
await scrapeBraveSearch('   '); // throws ZodError — only whitespace
await scrapeBraveSearch(null); // throws ZodError — not a string
await scrapeBraveSearch(42); // throws ZodError — not a string
await scrapeBraveSearch('hello'); // ✅ passes, returns trimmed 'hello'

The schema is configurable. Access it directly:

const { validateSearchQuery, searchQuerySchema } = require('gimirick-brave-search-scraper');

validateSearchQuery('machine learning'); // 'machine learning'

// Use the schema with your own validation:
const result = searchQuerySchema.safeParse(userInput);
if (!result.success) {
  console.log(result.error.issues);
}

Rules:

Must be a string (not null, undefined, number, object, array)
Must be non-empty after trimming whitespace
Maximum 500 characters

Health check

Run diagnostics from the CLI or programmatically.

CLI:

brave-search-scraper --health
# or via git clone:
node src/scraper.js --health

Output:

{
  "status": "ok",
  "version": "1.1.3",
  "timestamp": "2026-06-18T11:11:01.244Z",
  "checks": {
    "node": { "status": "ok", "version": "v24.15.0", "minRequired": ">=20.18.1" },
    "dependencies": { "status": "ok", "loaded": ["axios", "cheerio", "zod", "pino"], "missing": [] },
    "network": { "status": "ok", "reachable": true, "latencyMs": 128, "detail": "HTTP 200" }
  }
}

Exit codes: 0 if all checks pass, 1 if any check fails.

Programmatic:

const { healthCheck } = require('gimirick-brave-search-scraper');

const status = await healthCheck();
console.log(status.status); // 'ok' | 'degraded' | 'fail'
console.log(status.checks.node.version);
console.log(status.checks.dependencies.loaded);

Version check

Print the installed version:

brave-search-scraper --version
# or via git clone:
node src/scraper.js --version

Output: 1.1.3

Exit code: 0.

Pagination

Scrape multiple pages of results by passing a pages argument:

const { scrapeBraveSearch } = require('gimirick-brave-search-scraper');

// Single page (default):
const page1 = await scrapeBraveSearch('machine learning');

// Three pages — offset=10 per page, 1–3s delay between pages:
const pages = await scrapeBraveSearch('machine learning', 3);
console.log(`Got ${pages.length} results across 3 pages`);

The pages parameter is clamped between 1 and 5. URLs are deduplicated across pages.

Coverage

Generate a test coverage report:

npm run coverage

Output includes a terminal summary and an lcov report under coverage/. Current coverage: 93.57% (100% function coverage).

Tests cover retry paths via a local HTTP server, CLI behavior via child processes, and the main() entry point via in-process mocking of process.exit.

Structured logging with Pino

All diagnostic messages are logged as structured JSON to stderr. No more parsing console.error output.

# JSON logs to stderr (human-readable stdout unaffected):
brave-search-scraper "machine learning"

# stderr output looks like:
# {"level":"info","time":...,"name":"brave-search-scraper","msg":"Search completed"}
# {"level":"warn","time":...,"name":"brave-search-scraper","retry":1,"maxRetries":3,"msg":"Rate limited..."}

DEBUG=true node src/scraper.js "rust programming"

Docker

No Node.js installation required.

docker build -t brave-scraper .
docker run --rm brave-scraper "your search query"

With an environment variable:

docker run --rm -e SEARCH_QUERY="your query" brave-scraper

Docker also supports the health check and version flag:

docker run --rm brave-scraper --health
docker run --rm brave-scraper --version

How it works under the hood

Validates the search query (Zod) — fails fast on bad input, no network call made.
Visits the Brave Search homepage to collect session cookies.
Waits 1–3 seconds with random jitter to avoid detection.
Sends the search request with a rotated User-Agent and the collected cookies.
If Brave returns a 429 Too Many Requests, waits with exponential backoff and retries (up to 3 times by default).
All retries, warnings, and errors are logged as structured JSON to stderr via Pino.
Repeats steps 4–6 for each additional page (if pages > 1), with 1–3s delay between pages.
Parses the HTML with cheerio, extracting URLs from <a href>, [data-result-url], and [data-url] attributes.
Filters out all Brave-owned domains (brave.com, brave.app and subdomains).
Deduplicates across all pages and returns a clean array of external URLs.

Architecture

User Input (argv / env)
       │
       ▼
┌─────────────────────────────┐
│  validateSearchQuery (Zod)  │────► ZodError on invalid input
└─────────────────────────────┘
       │ (validated query)
       ▼
┌──────────────────────────┐
│    scrapeBraveSearch     │
│        (query)           │
│                          │
│  1. GET homepage         │────► extractCookies()
│     (collect cookies)    │
│                          │
│  2. Sleep 1-3s (jitter)  │────► sleep()
│                          │
│  ┌─ Pagination loop ──── │
│  │ 3. GET search         │────► fetchWithRetry()
│  │    (UA rotation,      │       └── axios.get()
│  │     cookies)          │       └── exponential backoff
│  │                       │       └── logger.warn/error (Pino)
│  │ 4. Parse HTML         │────► cheerio.load()
│  │                       │
│  │ 5. Extract URLs       │────► extractUrls()
│  │      ├── a[href]      │       └── isBraveDomain()
│  │      ├── [data-       │
│  │      │   result-url]  │
│  │      └── [data-url]   │
│  │ 6. Sleep 1-3s         │────► (if more pages)
│  └────────────────────── │
│  7. Deduplicate + Return │────► logger.info + JSON array
└──────────────────────────┘

┌──────────────────────────┐
│     healthCheck()        │
│  ┌───────────────────┐   │
│  │ node version      │   │
│  │ dependencies      │   │
│  │ network reachable │   │
│  └───────────────────┘   │
│  Returns structured JSON │
└──────────────────────────┘

Exit codes (CLI)

| Code | Meaning | | :--- | :-------------------------------------------- | | 0 | Success: results printed, or empty array [] | | 0 | Health check passed (--health flag) | | 0 | Version printed (--version flag) | | 1 | Error: no query provided, or scraping failed | | 1 | Health check failed (--health flag) |

Project structure

brave-search-scraper/
  src/scraper.js        main scraper (also the module entry point)
  src/logger.js         Pino structured logger setup
  test/
    scraper.test.js     core unit and integration tests
    cli.test.js         CLI behavior tests via child process
    main.test.js        main() entry point tests via process mocking
    retry.test.js       fetchWithRetry retry tests via local HTTP server

  Dockerfile            production Docker image
  package.json          dependencies and scripts
  example/              usage examples for each feature

About

Part of the GimiRick toolchain. We build open source LLMs and AI systems. Founded by Mohammad Faiz.

License

CC BY-NC-ND 4.0: Attribution-NonCommercial-NoDerivatives 4.0 International.

Permission is granted to view and run this code. No modifications, alterations, or derivative works are permitted.

See the LICENSE file for the full legal text.