webskim

v1.8.0

Published

12 days ago

Context-efficient web search and reading for AI agents. MCP server powered by Jina AI.

0High
0Medium
0Low

ciborro

mcp model-context-protocol jina search web reader ai claude

webskim

Context-efficient web search and reading for AI agents. MCP server powered by Jina AI.

Built-in WebFetch dumps entire pages into context. One page = thousands of tokens gone.

webskim saves pages to disk and returns a table of contents. Your agent reads only what it needs.

Prerequisites

webskim uses Jina AI APIs under the hood — you need a Jina API key to use it.

Get your free API key at jina.ai — 1M tokens included, no credit card required.

Quick Start

Claude Code — add to .mcp.json in your project:

{
  "mcpServers": {
    "webskim": {
      "command": "npx",
      "args": ["-y", "webskim"],
      "env": { "JINA_API_KEY": "jina_..." }
    }
  }
}

Tip: Keep your key in a .env file instead of hardcoding it in .mcp.json:
# .env (gitignored)
JINA_API_KEY=jina_...
"env": { "JINA_API_KEY": "${JINA_API_KEY}" }
Then launch Claude Code with the env loaded:
alias c='set -a; source .env 2>/dev/null; set +a; claude'

Claude Desktop — add to claude_desktop_config.json:

{
  "mcpServers": {
    "webskim": {
      "command": "npx",
      "args": ["-y", "webskim"],
      "env": {
        "JINA_API_KEY": "jina_...",
        "WEBSKIM_CACHE_DIR": "/Users/you/.webskim-pages"
      }
    }
  }
}

Desktop notes: Claude Desktop launches MCP servers with cwd /, so the default cache dir (<cwd>/.ai_pages) is not writable — always set WEBSKIM_CACHE_DIR to a writable absolute path. Claude Desktop also has no Read tool, so use inline: true (optionally with head_lines) to get page content back directly; the default file-path + TOC response is designed for agentic clients like Claude Code.

Cursor / Windsurf / other MCP clients — same pattern, point at npx -y webskim with JINA_API_KEY in env.

How It Works

Agent: webskim_search("react server components")
  → 5 results: title, URL, snippet (minimal tokens)

Agent: webskim_read("https://react.dev/reference/rsc/server-components")
  → Saved: .ai_pages/20260220_143052_react_dev__reference__rsc.md
  → Lines: 342 | ~2800 tokens
  → Table of Contents:
      L1:   # Server Components
      L18:  ## Reference
      L45:  ## Usage
      L89:  ### Fetching data
      L156: ### Streaming

Agent: Read(".ai_pages/..._rsc.md", offset=89, limit=67)
  → reads only the section it needs

No full pages in context. No wasted tokens. The agent decides what to read.

Tools

| Tool | What it does | |------|-------------| | webskim_search | Web search → titles, URLs, snippets | | webskim_read | Fetch URL/PDF → save as markdown, return TOC | | webskim_grep | Regex search in a saved page → matching lines + context |

webskim_search

| Param | Description | |-------|-------------| | query | Search query | | num_results | 1–10 (default 5) | | site | Restrict to domain, e.g. "python.org" | | country | Locale code, e.g. "US", "PL" | | format | Default markdown. json returns {results:[{i,title,url,snippet,host}]} (preferred for weak models) |

webskim_read

| Param | Description | |-------|-------------| | url | Page or PDF URL | | inline | true returns markdown directly (default false → file path + TOC) | | head_lines | With inline: true, return only the first N lines | | target_selector | CSS — extract only this element | | remove_selector | CSS — drop these elements (overrides default chrome stripper). Empty string "" opts out of the default. | | include_images | Default false. Keep <img> tags. Default off saves 30–70% tokens on news pages. | | links | Default referenced. How to render links: referenced = footer notation, discarded = plain text, inline = full markdown | | max_tokens | Server-side truncation (saves context) | | no_cache | Force a fresh fetch — bypasses both the local webskim cache and Jina's cache (see Caching) |

Defaults note: webskim removes site chrome (nav/footer/aside/ads/cookie banners) before extraction. Override with explicit remove_selector if your target selector is e.g. aside.article-content.

Inline mode

For small pages or "give me the top" lookups you can skip the follow-up Read call and get the markdown back directly. The page is still saved to disk so you can fall back to Read(file, offset, limit) if head_lines truncated.

Agent: webskim_read("https://example.com/short", inline=true)
  → **Title**
    <full markdown content>

Agent: webskim_read("https://big-doc.com", inline=true, head_lines=80)
  → **Title**
    <first 80 lines>
    --- Showing 80/420 lines. Full file: .ai_pages/..._big_doc_com.md

# now decide whether to Read more from the file or move on

head_lines requires inline: true (otherwise the saved file would not match what was returned). Lines are 1-indexed and include the  header, so they line up with Read tool offsets.

webskim_grep

Regex search in a page already saved by webskim_read. Use it when you're after a specific term (API name, version number, quoted fragment) instead of guessing which TOC section holds it.

| Param | Description | |-------|-------------| | file_path | Path returned by webskim_read. Must be inside the cache directory. | | pattern | Regex (ECMAScript), e.g. '\\bversion\\s+\\d+' | | case_sensitive | Default false | | context_lines | 0–10 lines of context around each match (default 2) | | max_matches | Truncate beyond this many matches (default 50, max 200) |

Output: L<n>: blocks with > marking the matching line; line numbers match Read tool offsets on the same file. Paths outside the cache dir and files over 10 MB are rejected.

Caching

webskim_read keeps a manifest (.manifest.json inside the cache dir). A repeat read of the same URL with the same content options within the TTL is served from disk — no Jina call, no tokens billed. The response ends with a CACHED: fetched <age> ago line so the agent knows the content may be stale.

WEBSKIM_CACHE_TTL — TTL in seconds (default 86400, i.e. 24h). 0 disables the local cache.
no_cache: true on webskim_read — force a fresh fetch (bypasses both the local cache and Jina's cache).
Only content-affecting options (target_selector, remove_selector, max_tokens, include_images, links) participate in the cache key; inline/head_lines are presentation-only, so both display modes share one cached file.
Deleting a cached file is safe: the next read re-fetches it. A corrupt manifest is archived and rebuilt automatically.

Output Contract

Stable, versioned shape of what each tool returns. Consumers (gateways, weak models) should rely on these guarantees rather than re-parsing free text.

webskim_search

format: "json" (preferred for programmatic consumers) — a single text block containing valid JSON with this exact schema:
```
{
  "results": [
    { "i": 1, "title": "string", "url": "string", "snippet": "string", "host": "string" }
  ]
}
```
Field semantics:
- i — 1-based result index.
- title, url — as returned by the source.
- snippet — source description; always a string, "" when the source provides none (the key is never omitted).
- host — hostname parsed from url ("" if url is unparseable).
- No hits → { "results": [] }.
format: "markdown" (default) — compact text, one result per block: [i] title / url / snippet. Intended for direct human/agent reading, not machine parsing. No hits → the literal string No results found.

webskim_read

Default mode (inline: false) returns a text block:

**<title>**
File: <path>
Lines: <n> | ~<tokens> tokens (estimate)

**Table of Contents:**
L<n>: <heading>
...

Use Read tool on the file path above to view content. ...

File: path — relative to the current working directory when the cache lives under cwd (the default <cwd>/.ai_pages), so the response never leaks an absolute home path a sandboxed client may not share. When WEBSKIM_CACHE_DIR points outside cwd, the absolute path is returned (the client configured that location). The path is always resolvable by the client's Read tool.
inline: true returns the markdown content directly; with head_lines: N it is truncated to the first N lines plus a footer pointing at the saved file (same path rules as above).

Versioning

The output contract follows the package version (semver). Current: 1.8.0. Schema-affecting changes bump the minor version and are noted here.

1.8.0 changes:

New tool webskim_grep — regex search in saved pages (L<n>: blocks with > on matching lines).
webskim_read (both modes) may append a final CACHED: fetched <age> ago — pass no_cache: true to force refresh. line when the response was served from the local cache manifest.
New webskim_read param no_cache; new env var WEBSKIM_CACHE_TTL; .manifest.json appears inside the cache dir.

1.7.0 changes:

webskim_read (both modes) may append a final Note: content is very short and the default chrome stripper was active... line when extracted content is under 500 chars and the default remove_selector was applied.
Cache filenames now encode a sanitized query string (?day=6 → ..._pogoda__day-6.md); fragments are still dropped.
webskim_search no longer fetches page content server-side (X-Respond-With: no-content) — same response shape, lower Jina credit usage and latency.

Why webskim?

Context efficiency — pages saved to .ai_pages/ on disk, not dumped into context. Agent reads sections via offset/limit.

Tiny footprint — two lean tool definitions in the system prompt. Minimal overhead vs. built-in alternatives.

Smart search — returns snippets, not full pages. Agent picks which URLs are worth reading.

PDF support — Jina Reader handles PDFs natively. Same API, same workflow.

Server-side token budget — max_tokens truncates on the server before content reaches your agent.

CSS selectors — target_selector / remove_selector extract exactly the part of the page you need.

Clean markdown — no HTML soup, no boilerplate, just readable content.

Fast and cheap — search returns snippets only (no server-side page fetching since 1.7.0), read ~8s. Jina API costs $0.02/1M tokens.

Make It the Default

The tool descriptions already tell the agent to prefer webskim, but for maximum reliability add this to your project's CLAUDE.md:

## Web Research

Always use webskim MCP tools as the primary choice for all web operations:
- **`webskim_search`** instead of `WebSearch` — returns lightweight snippets (title, URL, description)
- **`webskim_read`** instead of `WebFetch` — saves page to disk as markdown, returns file path + TOC

Workflow: webskim_search → webskim_read URL to disk → Read file with offset/limit.
Use WebSearch/WebFetch only as fallback when webskim tools are unavailable or fail.

Add .ai_pages/ to your .gitignore.

Configuration

| Env var | Required | Default | Description | |---------|----------|---------|-------------| | JINA_API_KEY | yes | — | Jina AI API key. Get one at https://jina.ai. | | WEBSKIM_CACHE_DIR | no | <cwd>/.ai_pages | Directory where webskim_read saves fetched pages. Created on demand. Useful for shared volumes or read-only CWDs. | | WEBSKIM_CACHE_TTL | no | 86400 | Local cache TTL in seconds for repeat reads of the same URL+options. 0 disables the local cache. |

Development

git clone <repo-url> && cd webskim
npm install && npm run build
npm test

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

webskim

Prerequisites

Quick Start

How It Works

Tools

webskim_search

webskim_read

Inline mode

webskim_grep

Caching

Output Contract

webskim_search

webskim_read

Versioning

Why webskim?

Make It the Default

Configuration

Development

License