anysite

v0.2.0

Published

a month ago

Turn any website into structured JSON. One command.

0High
0Medium
0Low

web-scraping api llm structured-data playwright cli

# Auto-extract — just give it a URL
anysite extract "https://news.ycombinator.com"

{
  "posts": [
    { "rank": 1, "title": "Show HN: I built a thing", "url": "https://...", "points": 141, "comments": 54, "author": "j0rg3" }
  ]
}

# Natural language — tell it what you want
anysite extract "https://www.amazon.com/dp/B0DCCSJN5C" --prompt "price, rating, and top 3 reviews"

{
  "price": 29.99,
  "rating": 4.6,
  "reviews": [
    { "stars": 5, "title": "Game changer", "text": "..." },
    { "stars": 5, "title": "Worth every penny", "text": "..." },
    { "stars": 4, "title": "Good but not perfect", "text": "..." }
  ]
}

# Schema mode — Zod-validated, typed output
anysite extract "https://github.com/trending" --schema gh-trending

{
  "repos": [
    { "owner": "vercel", "name": "ai", "description": "Build AI-powered apps...", "language": "TypeScript", "stars": 12400, "todayStars": 380 }
  ]
}

Install

npm install -g anysite

Or run without installing:

npx anysite extract "https://news.ycombinator.com"

Prerequisites: Node.js 18+. An LLM API key (DeepSeek, OpenAI, Anthropic, or local Ollama). On first run, anysite prompts you to pick a provider and paste your key.

Why anysite?

One command, zero config. Pass a URL, get JSON. No selectors, no parsing logic, no maintenance.
Runs locally. Playwright + your LLM key. Your data never touches a third-party scraping service.
Handles JS-heavy sites. SPAs, client-rendered pages, infinite scroll — Playwright renders them, then the LLM reads them. BeautifulSoup and Scrapy see empty <div id="root"></div>.
LLM understands semantics, not markup. Site redesign? Different CSS classes? Doesn't matter. The model reads meaning, not selectors.

Comparison

| | anysite | Firecrawl | llm-scraper | BeautifulSoup | Scrapy | |---|:---:|:---:|:---:|:---:|:---:| | One command | Yes | No (API setup) | No (code required) | No (code required) | No (code required) | | JS rendering | Yes | Yes | Yes | No | No (needs Splash) | | Runs locally | Yes | No (SaaS) | Yes | Yes | Yes | | Schema extraction | Yes (Zod) | Partial | Yes | No | No | | Anti-bot handling | Yes (UA rotation, stealth) | Yes | Partial | No | Partial | | Cost | ~$0.001/page (LLM tokens) | $0.01+/page | LLM tokens | Free | Free |

Usage

Basic — auto-extract

Pass a URL with no options. The LLM infers the most useful structure.

anysite extract "https://books.toscrape.com"

Natural language — `--prompt`

Describe what you want in plain English.

anysite extract "https://en.wikipedia.org/wiki/Rust_(programming_language)" \
  --prompt "language name, year created, creator, and key features as a list"

Schema mode — `--schema`

Use a built-in schema for validated, typed output.

anysite extract "https://news.ycombinator.com" --schema hn-top

External schema — `--schema-file`

Point to your own Zod schema file.

// my-schema.ts
import { z } from 'zod'

export const jobsSchema = z.object({
  jobs: z.array(z.object({
    title: z.string(),
    company: z.string(),
    location: z.string().nullable(),
    salary: z.string().nullable(),
  }))
})

export const description = 'Extract job listings'

anysite extract "https://some-job-board.com" --schema-file ./my-schema.ts

Provider switching — `--llm`

# Use OpenAI instead of the default
anysite extract "https://example.com" --llm openai

# Use local Ollama (free, no API key)
anysite extract "https://example.com" --llm ollama

# Use Claude
anysite extract "https://example.com" --llm claude

Supported providers: deepseek (default, cheapest), openai, claude, ollama, custom.

Caching — `--no-cache`, `--cache-ttl`

Results are cached by default (HTML for 5 minutes, LLM extractions for 1 hour).

# Skip cache entirely
anysite extract "https://example.com" --no-cache

# Set custom TTL (in seconds)
anysite extract "https://example.com" --cache-ttl 600

Cache lives in ~/.anysite/cache/.

Max content — `--max-chars`

Limit how much cleaned content is sent to the LLM.

anysite extract "https://example.com" --max-chars 8000

Default: 16,000 characters.

Vision mode — `--vision`

Use a Playwright screenshot alongside markdown for extraction. Helps with canvas-rendered content, charts, and visual-only layouts.

anysite extract "https://example.com/dashboard" --vision --llm openai

Requires a vision-capable model (GPT-4o, Claude). DeepSeek falls back to text-only.

Proxy — `--proxy`

Route all requests through a proxy (HTTP/HTTPS/SOCKS5).

anysite extract "https://example.com" --proxy "http://user:pass@proxy:8080"

Works with rotating proxy services (Bright Data, ScraperAPI, etc.) out of the box.

MCP Server

anysite includes an MCP server for AI agent integration (Claude Desktop, Cursor, etc.).

{
  "mcpServers": {
    "anysite": {
      "command": "npx",
      "args": ["-y", "anysite", "anysite-mcp"]
    }
  }
}

Available tools: extract, fetch_and_clean, list_schemas.

Programmatic API

import { anysite } from 'anysite'

// Auto-extract
const data = await anysite('https://news.ycombinator.com')

// With a prompt
const prices = await anysite('https://amazon.com/dp/B0DCCSJN5C', {
  prompt: 'price, rating, and availability'
})

// With a Zod schema
import { z } from 'zod'

const schema = z.object({
  posts: z.array(z.object({
    title: z.string(),
    url: z.string().nullable(),
    points: z.number().nullable(),
  }))
})

const result = await anysite<z.infer<typeof schema>>('https://news.ycombinator.com', {
  schema,
  description: 'Extract top posts',
  llm: 'openai',
  cache: false,
})

API Options

| Option | Type | Default | Description | |--------|------|---------|-------------| | schema | z.ZodType | — | Zod schema for validated extraction | | description | string | 'Extract data' | Describes the extraction task for the LLM | | prompt | string | — | Natural language extraction prompt | | maxChars | number | 16000 | Max markdown characters sent to LLM | | llm | string | auto-detect | LLM provider name | | cache | boolean | true | Enable/disable caching |

Built-in Schemas

| Schema | Command | Description | |--------|---------|-------------| | hn-top | --schema hn-top | Hacker News front page posts | | gh-trending | --schema gh-trending | GitHub Trending repositories | | product | --schema product | Generic product page (title, price, rating) | | amazon-product | --schema amazon-product | Amazon product with reviews, bullets, images | | google-serp | --schema google-serp | Google search results with featured snippets | | linkedin-job | --schema linkedin-job | LinkedIn job posting details |

Create your own by exporting a Zod schema (named *Schema) and a description string from a .ts file.

Configuration

anysite resolves configuration in this order (highest priority first):

CLI flags — --llm openai, --no-cache, etc.
Environment variables — DEEPSEEK_API_KEY, OPENAI_API_KEY, ANTHROPIC_API_KEY, ANYSITE_LLM, ANYSITE_MODEL
Config file — ~/.anysite/config.json (created on first run via interactive setup)
Auto-detect — picks the first provider with a valid API key

Environment Variables

| Variable | Description | |----------|-------------| | DEEPSEEK_API_KEY | DeepSeek API key | | OPENAI_API_KEY | OpenAI API key | | ANTHROPIC_API_KEY | Anthropic API key | | ANYSITE_LLM | Force a provider: deepseek, openai, claude, ollama, custom | | ANYSITE_MODEL | Override the default model for any provider | | ANYSITE_API_BASE | Custom provider base URL (use with ANYSITE_LLM=custom) | | ANYSITE_API_KEY | Custom provider API key | | BROWSER_PATH | Path to a Chromium binary for Playwright |

Config File

// ~/.anysite/config.json
{
  "DEEPSEEK_API_KEY": "sk-...",
  "ANYSITE_LLM": "deepseek"
}

How It Works

Fetch — Tries HTTP first. If the page is JS-rendered (body < 500 chars), falls back to headless Chromium via Playwright.
Clean — Strips scripts, nav, footer, SVG. Extracts <main> content. Converts HTML to Markdown. 548KB page becomes 6KB of clean text.
Extract — Sends cleaned Markdown to your LLM with a schema/prompt. Validates output with Zod. Retries with exponential backoff on failure.

Contributing

Issues, PRs, and ideas welcome. If you find a site that doesn't extract well, open an issue with the URL and expected output.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme