smart-web-mcp
v0.15.2
Published
Portable MCP server for web search, direct-URL fetch, site crawling, and docs export.
Maintainers
Readme
smart-web
Local MCP for agentic web retrieval.
smart-web bundles three tools in one stdio server:
smartfetchfor direct URLssmartsearchfor discoverysmartcrawlfor same-site multi-page traversal
It is designed for agents and MCP hosts, not for manual browser automation. The job is to return structured retrieval output first, then signal when a browser-task runtime would be a better next step.
If a page is login-gated, keep the login UI, session persistence, and capture flow in a companion runtime instead of moving that stateful behavior into the smart-web core.
- npm: https://www.npmjs.com/package/smart-web-mcp
- MCP Registry:
io.github.jojo-labs/smart-web - Issues: https://github.com/jojo-labs/smart-web/issues
If smart-web returns reproducible incorrect or misleading output, open a GitHub issue with the exact input, tool arguments, observed output, expected behavior, and version. Skip obvious transient network, auth, or rate-limit failures unless the smart-web classification itself looks wrong.
Highlights
- one local MCP instead of separate search, fetch, and crawl servers
- explicit routing: URL →
smartfetch, query →smartsearch, known site →smartcrawl - up-front smartfetch acquisition lanes instead of broad direct → browser retry chains
- optional Jina Reader relay for weak generic/article fetches when you explicitly enable it, with JSON mode and alternate-link preservation
- shared document extraction for article-like pages keeps browser primary content separate from raw shell HTML, preserving headings, links, blocks, and Markdown when requested
- budgeted
smartfetchprojection records explicit truncation metadata instead of scattering silent hardcoded caps - weak generic pages can recover through JSON-LD or Next.js payload extraction and same-origin RSS/Atom feed discovery before falling back to a handoff
- paywalled article pages prefer lawful archive recovery and search-indexed reference previews before telling the host to escalate elsewhere
- LinkedIn authwall handling prefers compliant public fallbacks such as archives and search-indexed reference metadata before giving up
- legal paper fallback for academic URLs: DOI-aware OpenAlex, Unpaywall, Semantic Scholar, CORE discovery, and Europe PMC enrichment plus bioRxiv/medRxiv API fallback when a direct paper page is thin or blocked
- site-native public search fallbacks cover Reddit, GitHub repository discovery, npm, Hacker News, Stack Exchange, and Velog when a matching
site:query is available - structured output with
assessmentandpipelinefields for host-side decision making - local-first defaults with privacy-aware search/fetch behavior
- useful normalization for common public surfaces instead of raw HTML dumps
Tool routing
smartfetchfirst when the user already gave a URL or shortlinksmartsearchwhen the user gave a topic, keywords, or asite:querysmartcrawlwhen you already know the site and need multiple relevant pages
| Situation | Tool |
| ------------------------------------------------- | ------------- |
| User already gave a URL or shortlink | smartfetch |
| User gave a topic, keywords, or site: query | smartsearch |
| You already know the site and need multiple pages | smartcrawl |
smart-web is a retrieval layer. It is not a general-purpose click/type/form-fill browser agent.[^1]
What the tools return
All three tools are shaped for agent consumption. Common high-signal fields include:
assessment: confidence, block/auth hints, recommended handoffpipeline: resolve/acquire/normalize/assess stages, plus policy metadata for relay-style usage and authwall likelihoodbudget: output projection metadata showing which fields were truncated and by how mucherrors: structured failure reasonspost,thread,comments: normalized content surfacesassetsanddownload: file-like outputs when present
This lets hosts decide whether deterministic retrieval was good enough or whether they should escalate to an auth-aware companion flow or a broader browser-task runtime.[^2]
For smartfetch, MCP structuredContent always carries the projected JSON result. The human-readable content[0].text defaults to compact text instead of duplicating that JSON payload; set format: "json" only when a host needs JSON repeated in the text channel. format: "markdown" keeps the compact response shape and requests Markdown extraction when available.
Supported high-signal surfaces
- communities: Reddit, DCInside, LinkedIn public posts with public fallback recovery, Algumon
- search-native surfaces: Reddit, GitHub repository discovery, npm, Hacker News, Stack Exchange, Velog
- competitive-programming surfaces: solved.ac problem/search/profile routes, BOJ problem/workbook/user pages, Codeforces problem/profile pages, AtCoder task/contest pages, QOJ problem pages, Jungol problem pages
- local map place pages: Naver Map, Kakao Map, and redirected
naver.meplace-share links - reference/article pages: NamuWiki, Wikipedia, Naver Blog, Tistory, Velog
- media/social: YouTube, X, Threads, Instagram, Telegram
- commerce pages: Amazon, Coupang, Danawa, Aladin, AliExpress
- archives/reference wrappers: Wayback snapshots, archive.md lookup pages, arXiv abstract pages with direct paper links
- academic paper surfaces: arXiv, PubMed, PMC, bioRxiv, medRxiv, plus legal DOI-linked OA copies surfaced from OpenAlex, Unpaywall, Semantic Scholar, CORE, and Europe PMC when a publisher page is thin or paywalled
When a source stays blocked or under-specified, smart-web prefers partial but honest output over pretending to have verified content.
Install
npx -y smart-web-mcpWhen smartfetch or smartcrawl needs Playwright-backed page loading, smart-web checks whether the bundled Chromium revision is available. If the local Playwright browser cache is missing or outdated, smart-web warns at startup and will try npx playwright install chromium automatically on first browser use before surfacing a structured error.
Host setup
Claude Code
claude mcp add smart-web -- npx -y smart-web-mcpCodex
codex mcp add smart-web -- npx -y smart-web-mcpOr via ~/.codex/config.toml (or project-scoped .codex/config.toml):
[mcp_servers.smart-web]
command = "npx"
args = ["-y", "smart-web-mcp"]OpenCode
{
"$schema": "https://opencode.ai/config.json",
"mcp": {
"smart-web": {
"type": "local",
"command": ["npx", "-y", "smart-web-mcp"],
"enabled": true
}
}
}Custom settings file
To use a non-default settings path, append --settings-file to the command:
npx -y smart-web-mcp --settings-file /absolute/path/to/smart-web.settings.jsonFor Claude Code:
claude mcp add smart-web -- npx -y smart-web-mcp --settings-file /absolute/path/to/smart-web.settings.jsonFor OpenCode, add "args" after "command":
{
"mcp": {
"smart-web": {
"type": "local",
"command": ["npx", "-y", "smart-web-mcp", "--settings-file", "/absolute/path/to/smart-web.settings.json"],
"enabled": true
}
}
}Configuration
Settings path
Default: ~/.config/smart-web/settings.json
Print a template:
npx -y smart-web-mcp --print-settings-exampleInitialize the default file:
npx -y smart-web-mcp --init-settingssmart-web treats the settings file as the single runtime config surface.
Full settings reference
{
// "balanced" (default) or "private"
// "private" disables relay-style providers and public search helpers
"profile": "balanced",
"runtime": {
// Optional override for staging/export temp files
// Default: platform cache root (e.g. ~/.cache/smart-web/tmp)
"tempDir": ""
},
"search": {
// API keys — leave empty to skip that provider
"exaApiKey": "",
"braveSearchApiKey": "",
// Self-hosted SearXNG instance URL
"searxngBaseUrl": "",
"enableSearxng": true
},
"fetch": {
// Browser overrides — most users leave these empty
"chromeChannel": "",
"chromePath": "",
// Auto-install Playwright Chromium when missing (default: true)
"autoInstallPlaywright": true,
// Jina Reader relay — opt-in third-party path for weak article pages
"enableJinaReader": false,
"jinaReaderBaseUrl": "https://r.jina.ai/",
// Academic fallback — legal OA enrichment for paper URLs
"enableAcademicFallback": true,
"enableOpenAlex": true,
"enableEuropePmc": true,
"enableBiorxivApi": true,
// Unpaywall — requires a contact email
"enableUnpaywall": true,
"unpaywallEmail": "",
// Semantic Scholar — optional API key for higher-rate access
"enableSemanticScholar": true,
"semanticScholarApiKey": "",
// CORE — optional API key for richer search-backed enrichment
"enableCoreDiscovery": true,
"coreApiKey": "",
// FxTwitter — transparent x.com → fxtwitter redirect
"enableFxTwitter": true,
// Undetected-chromedriver — optional Python Selenium fallback for tough anti-bot pages
"enableUndetectedChromedriver": true,
"undetectedChromedriverPython": "python3",
// Site-specific fetches
"enableRedditJson": true,
"enableYoutubeTranscript": true,
// Archive fallback — Wayback and archive.md recovery
"enableArchiveFallback": true,
"enableWayback": true,
"enableArchiveMd": true,
// Output projection budget for smartfetch responses
"outputBudget": {
"maxPostTextChars": 24000,
"maxPostMarkdownChars": 32000,
"maxThreadItems": 40,
"maxCommentItems": 40,
"maxOutboundLinks": 120,
"maxAssets": 30,
"maxBlocks": 60,
"maxBlockTextChars": 1000,
"maxItemTextChars": 1200
},
// Optional default named browser profile for authenticated smartfetch calls.
// Names are 1-64 characters: letters, numbers, dots, underscores, or hyphens;
// they must start with a letter or number and cannot end with a dot.
"browserProfile": ""
},
"network": {
// Allow localhost/private/reserved fetch targets (default: false)
// Applies to smartfetch, smartcrawl, and Korean public API helpers.
"allowPrivateHosts": false
}
}Common setups
Default — no config file needed
Out of the box smart-web works with sensible defaults. Create a settings file only when you need to change something.
Minimal: profile only
{
"profile": "balanced"
}Private / local-first
{
"profile": "private",
"search": {
"searxngBaseUrl": "http://localhost:8080"
}
}In private mode, relay-style provider requests are blocked, Exa and Brave default off unless you explicitly re-enable them, public no-key search fallbacks are disabled, and SearXNG bases must resolve to localhost or private addresses.
Private with Exa allowed back in
{
"profile": "private",
"search": {
"exaApiKey": "...",
"enableExa": true
}
}With Exa and Brave API keys
{
"profile": "balanced",
"search": {
"exaApiKey": "exa-...",
"braveSearchApiKey": "BSA-..."
}
}Disable Playwright auto-install
{
"fetch": {
"autoInstallPlaywright": false
}
}The server will return an actionable playwright_browsers_missing or playwright_browsers_outdated error instead.
Opt in to Jina Reader for weak generic article pages
{
"fetch": {
"enableJinaReader": true
}
}This stays opt-in because it is a third-party relay path. When enabled, it only applies to weak generic/article direct fetches and prefers Jina JSON mode when that improves the page. profile: "private" hard-disables relay-style providers.
Relay eligibility is provider-policy based. Generic and document-like providers can opt in, while browser/auth/social/commerce providers stay off unless their provider declares support. pipeline.policy.relay_style_allowed and pipeline.policy.relay_style_used tell hosts whether a relay path was allowed and whether it was used.
Tune smartfetch output budgets
{
"fetch": {
"outputBudget": {
"maxPostTextChars": 16000,
"maxThreadItems": 25,
"maxOutboundLinks": 80
}
}
}Extraction runs before budgeting so providers can work from complete page content. The budget applies to returned smartfetch projections and records truncation under budget.fields.
Force archive fallback off
{
"fetch": {
"enableArchiveFallback": false
}
}Legal OA DOI recovery for paywalled paper pages
{
"fetch": {
"enableUnpaywall": true,
"unpaywallEmail": "[email protected]",
"enableSemanticScholar": true,
"enableCoreDiscovery": true
}
}Enable undetected-chromedriver from a dedicated virtualenv
{
"fetch": {
"enableUndetectedChromedriver": true,
"undetectedChromedriverPython": "/absolute/path/to/venv/bin/python"
}
}This helper is optional. smart-web still works without it, but when installed and reachable it can act as a second browser engine for stubborn pages where Playwright alone is not enough.
Custom temp directory
{
"runtime": {
"tempDir": "/absolute/path/to/smart-web-tmp"
}
}Advanced overrides
These keys live inside settings.json when you need to force one provider on or off:
Search
search.enableExasearch.enableBravesearch.enableSearxngsearch.enableDuckDuckGosearch.enableBraveHtml
Fetch
fetch.enableFxTwitterfetch.enableXOembedfetch.autoInstallPlaywrightfetch.enableJinaReaderfetch.jinaReaderBaseUrlfetch.enableAcademicFallbackfetch.enableOpenAlexfetch.enableEuropePmcfetch.enableBiorxivApifetch.enableUnpaywallfetch.unpaywallEmailfetch.enableSemanticScholarfetch.semanticScholarApiKeyfetch.enableCoreDiscoveryfetch.coreApiKeyfetch.enableUndetectedChromedriverfetch.undetectedChromedriverPythonfetch.enableRedditJsonfetch.enableYoutubeTranscriptfetch.enableArchiveFallbackfetch.enableWaybackfetch.enableArchiveMdfetch.outputBudget.maxPostTextCharsfetch.outputBudget.maxPostMarkdownCharsfetch.outputBudget.maxThreadItemsfetch.outputBudget.maxCommentItemsfetch.outputBudget.maxOutboundLinksfetch.outputBudget.maxAssetsfetch.outputBudget.maxBlocksfetch.outputBudget.maxBlockTextCharsfetch.outputBudget.maxItemTextChars
Compatibility
network.localOnly: advanced override that forces local-only behavior regardless of profile
Quick verification
Sanity-check the server with a few real calls:
smartfetchon a normal article URLsmartfetchon a Medium or other member-only article URL — confirm it returns either archive-backed content or an honestreference_onlypreviewsmartfetchon a Naver Map, Kakao Map, ornaver.meplace-share URLsmartfetchon an arXiv abstract URLsmartfetchon a PubMed, PMC, or bioRxiv/medRxiv paper URLsmartfetchon a known product URLsmartsearchon asite:querysmartcrawlon a docs site or board you actually use
Reporting issues
If smart-web returns reproducible incorrect or misleading output, open a GitHub issue with:
- exact input and tool arguments
- observed output
- expected behavior
smart-web-mcpversion
Skip obvious transient network, auth, or rate-limit failures unless the classification itself looks wrong.
License
All rights reserved. See LICENSE for terms.
This software is licensed under a proprietary license that permits installation and use as a local MCP server but prohibits modification, redistribution, reverse engineering, or use in competing products.
References
[^1]: Model Context Protocol, "Tools" specification — MCP tools are a retrieval and integration surface, not a requirement to emulate full browser-automation flows inside one tool.
[^2]: Model Context Protocol Blog, "Tool Annotations as Risk Vocabulary: What Hints Can and Can't Do" (2026-03-16) — structured tool output and accurate hints improve host-side routing and escalation decisions.
