smart-web-mcp

v0.15.2

Published

12 hours ago

Portable MCP server for web search, direct-URL fetch, site crawling, and docs export.

0High
0Medium
0Low

richjojo

mcp web-search web-fetch web-crawl claude-code opencode youtube crawler smartfetch smartsearch smartcrawl docs-export

smart-web

Local MCP for agentic web retrieval.

smart-web bundles three tools in one stdio server:

smartfetch for direct URLs
smartsearch for discovery
smartcrawl for same-site multi-page traversal

It is designed for agents and MCP hosts, not for manual browser automation. The job is to return structured retrieval output first, then signal when a browser-task runtime would be a better next step.

If a page is login-gated, keep the login UI, session persistence, and capture flow in a companion runtime instead of moving that stateful behavior into the smart-web core.

npm: https://www.npmjs.com/package/smart-web-mcp
MCP Registry: io.github.jojo-labs/smart-web
Issues: https://github.com/jojo-labs/smart-web/issues

If smart-web returns reproducible incorrect or misleading output, open a GitHub issue with the exact input, tool arguments, observed output, expected behavior, and version. Skip obvious transient network, auth, or rate-limit failures unless the smart-web classification itself looks wrong.

Highlights

one local MCP instead of separate search, fetch, and crawl servers
explicit routing: URL → smartfetch, query → smartsearch, known site → smartcrawl
up-front smartfetch acquisition lanes instead of broad direct → browser retry chains
optional Jina Reader relay for weak generic/article fetches when you explicitly enable it, with JSON mode and alternate-link preservation
shared document extraction for article-like pages keeps browser primary content separate from raw shell HTML, preserving headings, links, blocks, and Markdown when requested
budgeted smartfetch projection records explicit truncation metadata instead of scattering silent hardcoded caps
weak generic pages can recover through JSON-LD or Next.js payload extraction and same-origin RSS/Atom feed discovery before falling back to a handoff
paywalled article pages prefer lawful archive recovery and search-indexed reference previews before telling the host to escalate elsewhere
LinkedIn authwall handling prefers compliant public fallbacks such as archives and search-indexed reference metadata before giving up
legal paper fallback for academic URLs: DOI-aware OpenAlex, Unpaywall, Semantic Scholar, CORE discovery, and Europe PMC enrichment plus bioRxiv/medRxiv API fallback when a direct paper page is thin or blocked
site-native public search fallbacks cover Reddit, GitHub repository discovery, npm, Hacker News, Stack Exchange, and Velog when a matching site: query is available
structured output with assessment and pipeline fields for host-side decision making
local-first defaults with privacy-aware search/fetch behavior
useful normalization for common public surfaces instead of raw HTML dumps

Tool routing

smartfetch first when the user already gave a URL or shortlink
smartsearch when the user gave a topic, keywords, or a site: query
smartcrawl when you already know the site and need multiple relevant pages

| Situation | Tool | | ------------------------------------------------- | ------------- | | User already gave a URL or shortlink | smartfetch | | User gave a topic, keywords, or site: query | smartsearch | | You already know the site and need multiple pages | smartcrawl |

smart-web is a retrieval layer. It is not a general-purpose click/type/form-fill browser agent.[^1]

What the tools return

All three tools are shaped for agent consumption. Common high-signal fields include:

assessment: confidence, block/auth hints, recommended handoff
pipeline: resolve/acquire/normalize/assess stages, plus policy metadata for relay-style usage and authwall likelihood
budget: output projection metadata showing which fields were truncated and by how much
errors: structured failure reasons
post, thread, comments: normalized content surfaces
assets and download: file-like outputs when present

This lets hosts decide whether deterministic retrieval was good enough or whether they should escalate to an auth-aware companion flow or a broader browser-task runtime.[^2]

For smartfetch, MCP structuredContent always carries the projected JSON result. The human-readable content[0].text defaults to compact text instead of duplicating that JSON payload; set format: "json" only when a host needs JSON repeated in the text channel. format: "markdown" keeps the compact response shape and requests Markdown extraction when available.

Supported high-signal surfaces

communities: Reddit, DCInside, LinkedIn public posts with public fallback recovery, Algumon
search-native surfaces: Reddit, GitHub repository discovery, npm, Hacker News, Stack Exchange, Velog
competitive-programming surfaces: solved.ac problem/search/profile routes, BOJ problem/workbook/user pages, Codeforces problem/profile pages, AtCoder task/contest pages, QOJ problem pages, Jungol problem pages
local map place pages: Naver Map, Kakao Map, and redirected naver.me place-share links
reference/article pages: NamuWiki, Wikipedia, Naver Blog, Tistory, Velog
media/social: YouTube, X, Threads, Instagram, Telegram
commerce pages: Amazon, Coupang, Danawa, Aladin, AliExpress
archives/reference wrappers: Wayback snapshots, archive.md lookup pages, arXiv abstract pages with direct paper links
academic paper surfaces: arXiv, PubMed, PMC, bioRxiv, medRxiv, plus legal DOI-linked OA copies surfaced from OpenAlex, Unpaywall, Semantic Scholar, CORE, and Europe PMC when a publisher page is thin or paywalled

When a source stays blocked or under-specified, smart-web prefers partial but honest output over pretending to have verified content.

Install

npx -y smart-web-mcp

When smartfetch or smartcrawl needs Playwright-backed page loading, smart-web checks whether the bundled Chromium revision is available. If the local Playwright browser cache is missing or outdated, smart-web warns at startup and will try npx playwright install chromium automatically on first browser use before surfacing a structured error.

Host setup

Claude Code

claude mcp add smart-web -- npx -y smart-web-mcp

Codex

codex mcp add smart-web -- npx -y smart-web-mcp

Or via ~/.codex/config.toml (or project-scoped .codex/config.toml):

[mcp_servers.smart-web]
command = "npx"
args = ["-y", "smart-web-mcp"]

OpenCode

{
  "$schema": "https://opencode.ai/config.json",
  "mcp": {
    "smart-web": {
      "type": "local",
      "command": ["npx", "-y", "smart-web-mcp"],
      "enabled": true
    }
  }
}

Custom settings file

To use a non-default settings path, append --settings-file to the command:

npx -y smart-web-mcp --settings-file /absolute/path/to/smart-web.settings.json

For Claude Code:

claude mcp add smart-web -- npx -y smart-web-mcp --settings-file /absolute/path/to/smart-web.settings.json

For OpenCode, add "args" after "command":

{
  "mcp": {
    "smart-web": {
      "type": "local",
      "command": ["npx", "-y", "smart-web-mcp", "--settings-file", "/absolute/path/to/smart-web.settings.json"],
      "enabled": true
    }
  }
}

Configuration

Settings path

Default: ~/.config/smart-web/settings.json

Print a template:

npx -y smart-web-mcp --print-settings-example

Initialize the default file:

npx -y smart-web-mcp --init-settings

smart-web treats the settings file as the single runtime config surface.

Full settings reference

{
  // "balanced" (default) or "private"
  // "private" disables relay-style providers and public search helpers
  "profile": "balanced",

  "runtime": {
    // Optional override for staging/export temp files
    // Default: platform cache root (e.g. ~/.cache/smart-web/tmp)
    "tempDir": ""
  },

  "search": {
    // API keys — leave empty to skip that provider
    "exaApiKey": "",
    "braveSearchApiKey": "",
    // Self-hosted SearXNG instance URL
    "searxngBaseUrl": "",
    "enableSearxng": true
  },

  "fetch": {
    // Browser overrides — most users leave these empty
    "chromeChannel": "",
    "chromePath": "",
    // Auto-install Playwright Chromium when missing (default: true)
    "autoInstallPlaywright": true,
    // Jina Reader relay — opt-in third-party path for weak article pages
    "enableJinaReader": false,
    "jinaReaderBaseUrl": "https://r.jina.ai/",
    // Academic fallback — legal OA enrichment for paper URLs
    "enableAcademicFallback": true,
    "enableOpenAlex": true,
    "enableEuropePmc": true,
    "enableBiorxivApi": true,
    // Unpaywall — requires a contact email
    "enableUnpaywall": true,
    "unpaywallEmail": "",
    // Semantic Scholar — optional API key for higher-rate access
    "enableSemanticScholar": true,
    "semanticScholarApiKey": "",
    // CORE — optional API key for richer search-backed enrichment
    "enableCoreDiscovery": true,
    "coreApiKey": "",
    // FxTwitter — transparent x.com → fxtwitter redirect
    "enableFxTwitter": true,
    // Undetected-chromedriver — optional Python Selenium fallback for tough anti-bot pages
    "enableUndetectedChromedriver": true,
    "undetectedChromedriverPython": "python3",
    // Site-specific fetches
    "enableRedditJson": true,
    "enableYoutubeTranscript": true,
    // Archive fallback — Wayback and archive.md recovery
    "enableArchiveFallback": true,
    "enableWayback": true,
    "enableArchiveMd": true,
    // Output projection budget for smartfetch responses
    "outputBudget": {
      "maxPostTextChars": 24000,
      "maxPostMarkdownChars": 32000,
      "maxThreadItems": 40,
      "maxCommentItems": 40,
      "maxOutboundLinks": 120,
      "maxAssets": 30,
      "maxBlocks": 60,
      "maxBlockTextChars": 1000,
      "maxItemTextChars": 1200
    },
    // Optional default named browser profile for authenticated smartfetch calls.
    // Names are 1-64 characters: letters, numbers, dots, underscores, or hyphens;
    // they must start with a letter or number and cannot end with a dot.
    "browserProfile": ""
  },

  "network": {
    // Allow localhost/private/reserved fetch targets (default: false)
    // Applies to smartfetch, smartcrawl, and Korean public API helpers.
    "allowPrivateHosts": false
  }
}

Common setups

Default — no config file needed

Out of the box smart-web works with sensible defaults. Create a settings file only when you need to change something.

Minimal: profile only

{
  "profile": "balanced"
}

Private / local-first

{
  "profile": "private",
  "search": {
    "searxngBaseUrl": "http://localhost:8080"
  }
}

In private mode, relay-style provider requests are blocked, Exa and Brave default off unless you explicitly re-enable them, public no-key search fallbacks are disabled, and SearXNG bases must resolve to localhost or private addresses.

Private with Exa allowed back in

{
  "profile": "private",
  "search": {
    "exaApiKey": "...",
    "enableExa": true
  }
}

With Exa and Brave API keys

{
  "profile": "balanced",
  "search": {
    "exaApiKey": "exa-...",
    "braveSearchApiKey": "BSA-..."
  }
}

Disable Playwright auto-install

{
  "fetch": {
    "autoInstallPlaywright": false
  }
}

The server will return an actionable playwright_browsers_missing or playwright_browsers_outdated error instead.

Opt in to Jina Reader for weak generic article pages

{
  "fetch": {
    "enableJinaReader": true
  }
}

This stays opt-in because it is a third-party relay path. When enabled, it only applies to weak generic/article direct fetches and prefers Jina JSON mode when that improves the page. profile: "private" hard-disables relay-style providers.

Relay eligibility is provider-policy based. Generic and document-like providers can opt in, while browser/auth/social/commerce providers stay off unless their provider declares support. pipeline.policy.relay_style_allowed and pipeline.policy.relay_style_used tell hosts whether a relay path was allowed and whether it was used.

Tune smartfetch output budgets

{
  "fetch": {
    "outputBudget": {
      "maxPostTextChars": 16000,
      "maxThreadItems": 25,
      "maxOutboundLinks": 80
    }
  }
}

Extraction runs before budgeting so providers can work from complete page content. The budget applies to returned smartfetch projections and records truncation under budget.fields.

Force archive fallback off

{
  "fetch": {
    "enableArchiveFallback": false
  }
}

Legal OA DOI recovery for paywalled paper pages

{
  "fetch": {
    "enableUnpaywall": true,
    "unpaywallEmail": "[email protected]",
    "enableSemanticScholar": true,
    "enableCoreDiscovery": true
  }
}

Enable undetected-chromedriver from a dedicated virtualenv

{
  "fetch": {
    "enableUndetectedChromedriver": true,
    "undetectedChromedriverPython": "/absolute/path/to/venv/bin/python"
  }
}

This helper is optional. smart-web still works without it, but when installed and reachable it can act as a second browser engine for stubborn pages where Playwright alone is not enough.

Custom temp directory

{
  "runtime": {
    "tempDir": "/absolute/path/to/smart-web-tmp"
  }
}

Advanced overrides

These keys live inside settings.json when you need to force one provider on or off:

Search

search.enableExa
search.enableBrave
search.enableSearxng
search.enableDuckDuckGo
search.enableBraveHtml

Fetch

fetch.enableFxTwitter
fetch.enableXOembed
fetch.autoInstallPlaywright
fetch.enableJinaReader
fetch.jinaReaderBaseUrl
fetch.enableAcademicFallback
fetch.enableOpenAlex
fetch.enableEuropePmc
fetch.enableBiorxivApi
fetch.enableUnpaywall
fetch.unpaywallEmail
fetch.enableSemanticScholar
fetch.semanticScholarApiKey
fetch.enableCoreDiscovery
fetch.coreApiKey
fetch.enableUndetectedChromedriver
fetch.undetectedChromedriverPython
fetch.enableRedditJson
fetch.enableYoutubeTranscript
fetch.enableArchiveFallback
fetch.enableWayback
fetch.enableArchiveMd
fetch.outputBudget.maxPostTextChars
fetch.outputBudget.maxPostMarkdownChars
fetch.outputBudget.maxThreadItems
fetch.outputBudget.maxCommentItems
fetch.outputBudget.maxOutboundLinks
fetch.outputBudget.maxAssets
fetch.outputBudget.maxBlocks
fetch.outputBudget.maxBlockTextChars
fetch.outputBudget.maxItemTextChars

Compatibility

network.localOnly: advanced override that forces local-only behavior regardless of profile

Quick verification

Sanity-check the server with a few real calls:

smartfetch on a normal article URL
smartfetch on a Medium or other member-only article URL — confirm it returns either archive-backed content or an honest reference_only preview
smartfetch on a Naver Map, Kakao Map, or naver.me place-share URL
smartfetch on an arXiv abstract URL
smartfetch on a PubMed, PMC, or bioRxiv/medRxiv paper URL
smartfetch on a known product URL
smartsearch on a site: query
smartcrawl on a docs site or board you actually use

Reporting issues

If smart-web returns reproducible incorrect or misleading output, open a GitHub issue with:

exact input and tool arguments
observed output
expected behavior
smart-web-mcp version

Skip obvious transient network, auth, or rate-limit failures unless the classification itself looks wrong.

License

This software is licensed under a proprietary license that permits installation and use as a local MCP server but prohibits modification, redistribution, reverse engineering, or use in competing products.

References

[^1]: Model Context Protocol, "Tools" specification — MCP tools are a retrieval and integration surface, not a requirement to emulate full browser-automation flows inside one tool.

[^2]: Model Context Protocol Blog, "Tool Annotations as Risk Vocabulary: What Hints Can and Can't Do" (2026-03-16) — structured tool output and accurate hints improve host-side routing and escalation decisions.