jsonrocks
v0.0.67
Published
turn any page into json
Readme
json.rocks
A web scraping service that converts any webpage into structured JSON data - perfect for generating rich link previews in chat applications.
Turn any URL into clean, structured metadata including title, description, images, videos, and more. Ideal for Slack-style link unfurling, Discord embeds, WhatsApp previews, and any chat application that needs rich link previews.
🎯 Use Case: Chat Link Previews
When users paste URLs in your chat app, json.rocks automatically extracts:
- Title - Page title for the preview card
- Description - Meta description or article excerpt
- Image - Open Graph image, featured image, or thumbnail
- Logo - Site favicon or publisher logo
- Author - Article author name
- Publisher - Site/publication name
- Videos - Embedded videos (YouTube, Vimeo, etc.)
- Date - Publication date
- URL - Canonical URL
Perfect for building chat features like:
- Slack-style link unfurling
- Discord rich embeds
- WhatsApp link previews
- iMessage link bubbles
- Telegram instant view
🚀 Quick Start
Installation
npm install -g jsonrocksRun the Server
# Start on default port 9980
jsonrocks
# Or run directly from source
./bin/server.js
# Custom port
./bin/server.js --port 8080
# HTTPS mode (requires SSL certificates)
./bin/server.js --https --key ./privkey.pem --cert ./fullchain.pemTest the API
# Extract metadata from a URL
curl "http://localhost:9980/?uri=https://github.com/spux/json.rocks"
# Search the web (returns JSON results)
curl "http://localhost:9980/?uri=chat+applications"📖 API Reference
Main Endpoint: GET /
Extract structured JSON metadata from any URL.
Parameters:
uri(required) - The URL to scrape OR a search queryrefresh(optional) - Set totrueto bypass cache
Example Request:
// Fetch link preview data
const response = await fetch('http://localhost:9980/?uri=https://example.com/article')
const data = await response.json()Example Response:
{
"title": "Amazing Article Title",
"description": "A compelling description of the article content that will appear in your chat preview.",
"image": "https://example.com/images/featured.jpg",
"logo": "https://example.com/favicon.ico",
"author": "Jane Doe",
"publisher": "Example News",
"date": "2025-01-13T10:30:00.000Z",
"url": "https://example.com/article",
"lang": "en",
"videos": [
{
"url": "https://youtube.com/watch?v=...",
"width": 1280,
"height": 720
}
],
"images": [
{
"src": "https://example.com/image1.jpg",
"alt": "Image description"
}
],
"links": [
{
"href": "https://example.com/related",
"text": "Related Article"
}
]
}Health Check: GET /health
Returns server status.
curl http://localhost:9980/health
# Response: { "status": "ok" }Admin Endpoints
View allowed domains:
curl http://localhost:9980/admin/domainsReload domain configuration:
curl -X POST http://localhost:9980/admin/reload-domains💬 Chat App Integration Examples
Basic Link Preview
async function generateLinkPreview(url) {
const response = await fetch(
`http://localhost:9980/?uri=${encodeURIComponent(url)}`
)
const data = await response.json()
return {
title: data.title,
description: data.description,
thumbnail: data.image,
favicon: data.logo,
siteName: data.publisher,
author: data.author,
publishedAt: data.date
}
}
// Usage in your chat app
const preview = await generateLinkPreview('https://github.com/spux/json.rocks')
console.log(preview)React Component Example
import { useState, useEffect } from 'react'
function LinkPreview({ url }) {
const [preview, setPreview] = useState(null)
const [loading, setLoading] = useState(true)
useEffect(() => {
fetch(`http://localhost:9980/?uri=${encodeURIComponent(url)}`)
.then(res => res.json())
.then(data => {
setPreview(data)
setLoading(false)
})
}, [url])
if (loading) return <div>Loading preview...</div>
return (
<div className="link-preview">
{preview.image && <img src={preview.image} alt={preview.title} />}
<h3>{preview.title}</h3>
<p>{preview.description}</p>
<span className="source">{preview.publisher}</span>
</div>
)
}Node.js Backend Integration
import express from 'express'
import fetch from 'node-fetch'
const app = express()
// API endpoint for your chat app to request link previews
app.get('/api/preview', async (req, res) => {
const { url } = req.query
if (!url) {
return res.status(400).json({ error: 'URL required' })
}
try {
const response = await fetch(
`http://localhost:9980/?uri=${encodeURIComponent(url)}`
)
const data = await response.json()
// Transform to your chat app's format
res.json({
title: data.title,
description: data.description,
image: data.image,
siteName: data.publisher,
url: data.url
})
} catch (error) {
res.status(500).json({ error: 'Failed to generate preview' })
}
})
app.listen(3000)Python/Flask Integration
from flask import Flask, request, jsonify
import requests
app = Flask(__name__)
@app.route('/api/preview')
def get_preview():
url = request.args.get('url')
if not url:
return jsonify({'error': 'URL required'}), 400
try:
response = requests.get(f'http://localhost:9980/?uri={url}')
data = response.json()
return jsonify({
'title': data.get('title'),
'description': data.get('description'),
'image': data.get('image'),
'siteName': data.get('publisher'),
'url': data.get('url')
})
except Exception as e:
return jsonify({'error': str(e)}), 500
if __name__ == '__main__':
app.run(port=3000)⚙️ Configuration
Domain Allowlist
For security, only whitelisted domains can be scraped. json.rocks includes 1000+ popular domains by default, covering:
- Social media (Twitter, Reddit, LinkedIn, etc.)
- Development platforms (GitHub, GitLab, npm, etc.)
- News sites (NYTimes, BBC, CNN, etc.)
- Cloud providers (AWS, Azure, Vercel, Netlify, etc.)
- Documentation sites (MDN, React, Node.js, etc.)
- And many more...
Default domains: Loaded from data/allowed-domains-top1000.json (committed to git)
Adding Custom Domains
To add your own domains without modifying the default list:
Create
data/allowed-domains-custom.json:[ "yourdomain.com", "blog.yourdomain.com", "mycompany.com" ]Restart the server - custom domains are automatically merged with defaults
Or reload without restart:
export ADMIN_PASS=your-password curl -u admin:$ADMIN_PASS -X POST http://localhost:9980/admin/reload-domains
Note: Subdomains are automatically included (e.g., github.com allows api.github.com, gist.github.com)
Total domains: Default (1000+) + Custom (your additions) = Automatically merged and deduplicated
See DOMAIN_MANAGEMENT.md for detailed configuration guide.
Rate Limiting
Default limits:
- Cached requests: 100 requests/minute per IP
- Non-cached requests: 5 requests/minute per IP
Configured in bin/server.js (lines 229-251).
Caching
Two-tier caching strategy:
- Memory cache: LRU cache (100 items, 30-minute TTL)
- Disk cache: JSON files in
data/directory organized by domain
Bypass cache:
curl "http://localhost:9980/?uri=https://example.com&refresh=true"🔒 Security Features
- Domain allowlist - Only whitelisted domains can be scraped
- IP blocking - Blocks private IP ranges, localhost, AWS metadata endpoints
- Rate limiting - Prevents abuse with per-IP limits
- Content size limits - Max 5MB per request
- Concurrent request limits - Max 5 concurrent requests per IP
- Request timeout - 5-second timeout with 3 redirect limit
- XSS protection - HTML escaping throughout
- SSRF protection - URL validation and private IP blocking
- Security headers - CSP, X-Frame-Options, X-Content-Type-Options
🏗️ Architecture
┌─────────────┐
│ Chat App │
└──────┬──────┘
│ HTTP GET /?uri=...
▼
┌─────────────────┐
│ json.rocks │
│ (Port 9980) │
├─────────────────┤
│ • Rate Limiter │
│ • Domain Check │
│ • IP Validator │
└──────┬──────────┘
│
├─→ Memory Cache (LRU)
│ └─→ Return if cached
│
├─→ Disk Cache
│ └─→ Return if cached
│
└─→ Web Scraper
├─→ Fetch HTML (axios)
├─→ Extract Content (unfluff)
├─→ Extract Metadata (metascraper)
└─→ Return JSONTech Stack:
- Fastify - Web framework
- Axios - HTTP client
- Unfluff - Content extraction
- Metascraper - Metadata extraction with platform-specific plugins
- Cheerio - HTML parsing
- LRU-Cache - In-memory caching
📦 Deployment
Docker (Recommended)
FROM node:18-alpine
WORKDIR /app
COPY package*.json ./
RUN npm ci --production
COPY . .
EXPOSE 9980
CMD ["node", "bin/server.js"]# Build and run
docker build -t json-rocks .
docker run -p 9980:9980 json-rocksDocker Compose
version: '3.8'
services:
json-rocks:
build: .
ports:
- "9980:9980"
volumes:
- ./data:/app/data
environment:
- NODE_ENV=production
restart: unless-stoppedProduction Deployment
Using PM2:
npm install -g pm2
pm2 start bin/server.js --name json-rocks
pm2 save
pm2 startupNginx reverse proxy:
server {
listen 80;
server_name json.rocks;
location / {
proxy_pass http://localhost:9980;
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}
}🛠️ Development
Requirements
- Node.js 14+
- npm or yarn
Setup
# Clone the repository
git clone https://github.com/spux/json.rocks.git
cd json.rocks
# Install dependencies
npm install
# Start development server
npm startProject Structure
json.rocks/
├── bin/
│ └── server.js # Main server (738 lines)
├── js/
│ ├── jr.js # Utility library
│ └── json-renderer.js # Frontend JSON renderer
├── data/
│ ├── allowed-domains.json # Security domain allowlist
│ └── [cached data]/ # Disk cache organized by domain
├── images/ # UI assets
├── index.html # Landing page (search interface)
├── package.json
└── README.md🎨 Features
Automatic Content Extraction
- Article text - Main content extraction
- Metadata - Title, description, author, date
- Media - Images with alt text, videos, iframes
- Links - All hyperlinks extracted
- Schema.org - Structured data with JSON-LD
Platform-Specific Extractors
- Amazon - Product info
- Spotify - Track/album metadata
- YouTube - Video details
- Twitter - Tweet embeds
- Generic fallback - Works on any webpage
Search Functionality
# Search the web for JSON content
curl "http://localhost:9980/?uri=best+javascript+frameworks"Returns search results in structured JSON format.
📊 Response Examples
GitHub Repository
{
"title": "spux/json.rocks: Search the web of JSON",
"description": "A web scraping service that converts any webpage into structured JSON data",
"image": "https://opengraph.githubassets.com/...",
"publisher": "GitHub",
"author": "spux",
"url": "https://github.com/spux/json.rocks"
}News Article
{
"title": "Breaking News: Important Event Occurs",
"description": "Detailed coverage of the significant event...",
"image": "https://news.example.com/images/featured.jpg",
"author": "Jane Reporter",
"publisher": "News Network",
"date": "2025-01-13T15:30:00.000Z",
"lang": "en"
}YouTube Video
{
"title": "Amazing Video Title",
"description": "Video description and details",
"image": "https://i.ytimg.com/vi/VIDEO_ID/maxresdefault.jpg",
"publisher": "YouTube",
"videos": [{
"url": "https://www.youtube.com/embed/VIDEO_ID",
"width": 1280,
"height": 720
}]
}🤝 Contributing
Contributions welcome! Please:
- Fork the repository
- Create a feature branch (
git checkout -b feature/amazing-feature) - Commit your changes (
git commit -m 'Add amazing feature') - Push to the branch (
git push origin feature/amazing-feature) - Open a Pull Request
📄 License
ISC License - See package.json for details
🔗 Links
- NPM Package: https://www.npmjs.com/package/jsonrocks
- GitHub: https://github.com/spux/json.rocks
- Live Demo: https://json.rocks
💡 FAQ
Q: Can I use this for my chat application?
A: Yes! That's exactly what it's designed for. See the integration examples above.
Q: How do I add support for more domains?
A: Edit data/allowed-domains.json and add your domains. See DOMAIN_MANAGEMENT.md.
Q: Is there a hosted version?
A: Yes, available at https://json.rocks (GitHub Pages deployment).
Q: What about rate limiting?
A: Default limits are 5 req/min uncached, 100 req/min cached per IP. Configure in server.js.
Q: Does it support authentication?
A: Currently no. Add authentication middleware for production use.
Q: Can I self-host?
A: Absolutely! See the deployment section above.
⚠️ Known Limitations
- Domain allowlist required for security (not all sites allowed by default)
- Rate limits apply (5 req/min uncached)
- File-based disk cache (consider Redis for production)
- No built-in authentication (add your own middleware)
- Admin endpoints are public (secure before production)
🗺️ Roadmap
- [ ] Add authentication support
- [ ] Redis cache option
- [ ] GraphQL API
- [ ] Webhook support
- [ ] Batch URL processing
- [ ] Screenshot capture
- [ ] PDF generation
- [ ] Custom extraction rules
Built with ❤️ for the chat app community
