anysite
v0.2.0
Published
Turn any website into structured JSON. One command.
Maintainers
Readme
# Auto-extract — just give it a URL
anysite extract "https://news.ycombinator.com"{
"posts": [
{ "rank": 1, "title": "Show HN: I built a thing", "url": "https://...", "points": 141, "comments": 54, "author": "j0rg3" }
]
}# Natural language — tell it what you want
anysite extract "https://www.amazon.com/dp/B0DCCSJN5C" --prompt "price, rating, and top 3 reviews"{
"price": 29.99,
"rating": 4.6,
"reviews": [
{ "stars": 5, "title": "Game changer", "text": "..." },
{ "stars": 5, "title": "Worth every penny", "text": "..." },
{ "stars": 4, "title": "Good but not perfect", "text": "..." }
]
}# Schema mode — Zod-validated, typed output
anysite extract "https://github.com/trending" --schema gh-trending{
"repos": [
{ "owner": "vercel", "name": "ai", "description": "Build AI-powered apps...", "language": "TypeScript", "stars": 12400, "todayStars": 380 }
]
}Install
npm install -g anysiteOr run without installing:
npx anysite extract "https://news.ycombinator.com"Prerequisites: Node.js 18+. An LLM API key (DeepSeek, OpenAI, Anthropic, or local Ollama). On first run, anysite prompts you to pick a provider and paste your key.
Why anysite?
- One command, zero config. Pass a URL, get JSON. No selectors, no parsing logic, no maintenance.
- Runs locally. Playwright + your LLM key. Your data never touches a third-party scraping service.
- Handles JS-heavy sites. SPAs, client-rendered pages, infinite scroll — Playwright renders them, then the LLM reads them. BeautifulSoup and Scrapy see empty
<div id="root"></div>. - LLM understands semantics, not markup. Site redesign? Different CSS classes? Doesn't matter. The model reads meaning, not selectors.
Comparison
| | anysite | Firecrawl | llm-scraper | BeautifulSoup | Scrapy | |---|:---:|:---:|:---:|:---:|:---:| | One command | Yes | No (API setup) | No (code required) | No (code required) | No (code required) | | JS rendering | Yes | Yes | Yes | No | No (needs Splash) | | Runs locally | Yes | No (SaaS) | Yes | Yes | Yes | | Schema extraction | Yes (Zod) | Partial | Yes | No | No | | Anti-bot handling | Yes (UA rotation, stealth) | Yes | Partial | No | Partial | | Cost | ~$0.001/page (LLM tokens) | $0.01+/page | LLM tokens | Free | Free |
Usage
Basic — auto-extract
Pass a URL with no options. The LLM infers the most useful structure.
anysite extract "https://books.toscrape.com"Natural language — --prompt
Describe what you want in plain English.
anysite extract "https://en.wikipedia.org/wiki/Rust_(programming_language)" \
--prompt "language name, year created, creator, and key features as a list"Schema mode — --schema
Use a built-in schema for validated, typed output.
anysite extract "https://news.ycombinator.com" --schema hn-topExternal schema — --schema-file
Point to your own Zod schema file.
// my-schema.ts
import { z } from 'zod'
export const jobsSchema = z.object({
jobs: z.array(z.object({
title: z.string(),
company: z.string(),
location: z.string().nullable(),
salary: z.string().nullable(),
}))
})
export const description = 'Extract job listings'anysite extract "https://some-job-board.com" --schema-file ./my-schema.tsProvider switching — --llm
# Use OpenAI instead of the default
anysite extract "https://example.com" --llm openai
# Use local Ollama (free, no API key)
anysite extract "https://example.com" --llm ollama
# Use Claude
anysite extract "https://example.com" --llm claudeSupported providers: deepseek (default, cheapest), openai, claude, ollama, custom.
Caching — --no-cache, --cache-ttl
Results are cached by default (HTML for 5 minutes, LLM extractions for 1 hour).
# Skip cache entirely
anysite extract "https://example.com" --no-cache
# Set custom TTL (in seconds)
anysite extract "https://example.com" --cache-ttl 600Cache lives in ~/.anysite/cache/.
Max content — --max-chars
Limit how much cleaned content is sent to the LLM.
anysite extract "https://example.com" --max-chars 8000Default: 16,000 characters.
Vision mode — --vision
Use a Playwright screenshot alongside markdown for extraction. Helps with canvas-rendered content, charts, and visual-only layouts.
anysite extract "https://example.com/dashboard" --vision --llm openaiRequires a vision-capable model (GPT-4o, Claude). DeepSeek falls back to text-only.
Proxy — --proxy
Route all requests through a proxy (HTTP/HTTPS/SOCKS5).
anysite extract "https://example.com" --proxy "http://user:pass@proxy:8080"Works with rotating proxy services (Bright Data, ScraperAPI, etc.) out of the box.
MCP Server
anysite includes an MCP server for AI agent integration (Claude Desktop, Cursor, etc.).
{
"mcpServers": {
"anysite": {
"command": "npx",
"args": ["-y", "anysite", "anysite-mcp"]
}
}
}Available tools: extract, fetch_and_clean, list_schemas.
Programmatic API
import { anysite } from 'anysite'
// Auto-extract
const data = await anysite('https://news.ycombinator.com')
// With a prompt
const prices = await anysite('https://amazon.com/dp/B0DCCSJN5C', {
prompt: 'price, rating, and availability'
})
// With a Zod schema
import { z } from 'zod'
const schema = z.object({
posts: z.array(z.object({
title: z.string(),
url: z.string().nullable(),
points: z.number().nullable(),
}))
})
const result = await anysite<z.infer<typeof schema>>('https://news.ycombinator.com', {
schema,
description: 'Extract top posts',
llm: 'openai',
cache: false,
})API Options
| Option | Type | Default | Description |
|--------|------|---------|-------------|
| schema | z.ZodType | — | Zod schema for validated extraction |
| description | string | 'Extract data' | Describes the extraction task for the LLM |
| prompt | string | — | Natural language extraction prompt |
| maxChars | number | 16000 | Max markdown characters sent to LLM |
| llm | string | auto-detect | LLM provider name |
| cache | boolean | true | Enable/disable caching |
Built-in Schemas
| Schema | Command | Description |
|--------|---------|-------------|
| hn-top | --schema hn-top | Hacker News front page posts |
| gh-trending | --schema gh-trending | GitHub Trending repositories |
| product | --schema product | Generic product page (title, price, rating) |
| amazon-product | --schema amazon-product | Amazon product with reviews, bullets, images |
| google-serp | --schema google-serp | Google search results with featured snippets |
| linkedin-job | --schema linkedin-job | LinkedIn job posting details |
Create your own by exporting a Zod schema (named *Schema) and a description string from a .ts file.
Configuration
anysite resolves configuration in this order (highest priority first):
- CLI flags —
--llm openai,--no-cache, etc. - Environment variables —
DEEPSEEK_API_KEY,OPENAI_API_KEY,ANTHROPIC_API_KEY,ANYSITE_LLM,ANYSITE_MODEL - Config file —
~/.anysite/config.json(created on first run via interactive setup) - Auto-detect — picks the first provider with a valid API key
Environment Variables
| Variable | Description |
|----------|-------------|
| DEEPSEEK_API_KEY | DeepSeek API key |
| OPENAI_API_KEY | OpenAI API key |
| ANTHROPIC_API_KEY | Anthropic API key |
| ANYSITE_LLM | Force a provider: deepseek, openai, claude, ollama, custom |
| ANYSITE_MODEL | Override the default model for any provider |
| ANYSITE_API_BASE | Custom provider base URL (use with ANYSITE_LLM=custom) |
| ANYSITE_API_KEY | Custom provider API key |
| BROWSER_PATH | Path to a Chromium binary for Playwright |
Config File
// ~/.anysite/config.json
{
"DEEPSEEK_API_KEY": "sk-...",
"ANYSITE_LLM": "deepseek"
}How It Works
- Fetch — Tries HTTP first. If the page is JS-rendered (body < 500 chars), falls back to headless Chromium via Playwright.
- Clean — Strips scripts, nav, footer, SVG. Extracts
<main>content. Converts HTML to Markdown. 548KB page becomes 6KB of clean text. - Extract — Sends cleaned Markdown to your LLM with a schema/prompt. Validates output with Zod. Retries with exponential backoff on failure.
Contributing
Issues, PRs, and ideas welcome. If you find a site that doesn't extract well, open an issue with the URL and expected output.
