convex-firecrawl-scrape
v0.1.2
Published
A Convex component for scraping web pages using the Firecrawl API with durable caching and reactive queries.
Maintainers
Readme
Convex Firecrawl Scrape Component
Scrape any URL and get clean markdown, HTML, screenshots, or structured JSON - with durable caching and reactive queries.
const { jobId } = await scrape({ url: "https://example.com" });
// Status updates reactively as the scrape completes
const status = useQuery(api.firecrawl.getStatus, { id: jobId });- Durable caching with configurable TTL (default 30 days)
- Reactive status updates via Convex subscriptions
- Multiple output formats: markdown, HTML, raw HTML, screenshots, links, images, AI summaries
- JSON extraction via schema-based LLM processing
- Built-in SSRF protection blocks private IPs and localhost
- Secure by default with required auth wrapper
Play with the example:
git clone https://github.com/gitmaxd/convex-firecrawl-scrape.git
cd convex-firecrawl-scrape
npm install
npm run devPre-requisite: Convex
You'll need an existing Convex project. Convex is a hosted backend platform with a database, serverless functions, and more. Learn more here.
Run npm create convex or follow any of the
quickstarts to set one up.
Installation
npm install convex-firecrawl-scrapeInstall the component in your convex/convex.config.ts:
// convex/convex.config.ts
import { defineApp } from "convex/server";
import firecrawlScrape from "convex-firecrawl-scrape/convex.config.js";
const app = defineApp();
app.use(firecrawlScrape);
export default app;Set your Firecrawl API key:
npx convex env set FIRECRAWL_API_KEY your_api_key_hereGet your API key at firecrawl.dev.
Usage
Always use exposeApi() to expose component functionality. This wrapper
enforces authentication and controls API key access.
// convex/firecrawl.ts
import { exposeApi } from "convex-firecrawl-scrape";
import { components } from "./_generated/api";
export const { scrape, getCached, getStatus, getContent, invalidate } =
exposeApi(components.firecrawlScrape, {
auth: async (ctx, operation) => {
const identity = await ctx.auth.getUserIdentity();
if (!identity) throw new Error("Unauthorized");
return process.env.FIRECRAWL_API_KEY!;
},
});React Integration
import { useMutation, useQuery } from "convex/react";
import { api } from "../convex/_generated/api";
import { useState } from "react";
function ScrapeButton({ url }: { url: string }) {
const [jobId, setJobId] = useState<string | null>(null);
const scrape = useMutation(api.firecrawl.scrape);
const status = useQuery(
api.firecrawl.getStatus,
jobId ? { id: jobId } : "skip",
);
const content = useQuery(
api.firecrawl.getContent,
jobId && status?.status === "completed" ? { id: jobId } : "skip",
);
return (
<div>
<button
onClick={async () => setJobId((await scrape({ url })).jobId)}
disabled={status?.status === "scraping"}
>
{status?.status === "scraping" ? "Scraping..." : "Scrape"}
</button>
{status?.status === "completed" && <pre>{content?.markdown}</pre>}
{status?.status === "failed" && <p>Error: {status.error}</p>}
</div>
);
}Output Formats
const { jobId } = await scrape({
url: "https://example.com",
options: {
formats: ["markdown", "html", "links", "images", "screenshot"],
storeScreenshot: true,
},
});| Format | Description |
| ------------ | ------------------------------------------------------- |
| markdown | Clean markdown content (default) |
| html | Cleaned HTML |
| rawHtml | Original HTML source |
| links | URLs found on the page |
| images | Image URLs found on the page |
| summary | AI-generated page summary |
| screenshot | Screenshot URL (use storeScreenshot: true to persist) |
JSON Extraction
Extract structured data using a JSON schema:
const { jobId } = await scrape({
url: "https://example.com/product",
options: {
extractionSchema: {
type: "object",
properties: {
name: { type: "string" },
price: { type: "number" },
},
required: ["name", "price"],
},
},
});
const content = await getContent({ id: jobId });
console.log(content.extractedJson); // { name: "Widget", price: 99.99 }Cache Management
Cached results use superset matching: a cache entry with
["markdown", "screenshot"] satisfies a request for ["markdown"].
// Check cache
const cached = await getCached({ url: "https://example.com" });
// Force refresh
const { jobId } = await scrape({ url, options: { force: true } });
// Invalidate cache
await invalidate({ url: "https://example.com" });Proxy Options
For anti-bot protected sites:
const { jobId } = await scrape({
url: "https://protected-site.com",
options: {
proxy: "stealth", // Residential proxy
waitFor: 3000, // Wait for dynamic content
},
});Security
Always use exposeApi() - never expose component functions directly to
clients. Server-side code can call component internals directly, but doing so
bypasses authentication. It ensures:
- Authentication before any operation
- API key controlled by your callback, not callers
- Operation-specific authorization support
// ❌ DANGEROUS - bypasses auth
export const scrape = components.firecrawlScrape.lib.startScrape;
// ✅ SAFE - auth enforced
export const { scrape } = exposeApi(components.firecrawlScrape, { auth: ... });SSRF Protection: Built-in validation blocks localhost, private IPs, and non-HTTP schemes.
For domain allowlists, rate limiting, and detailed security guidance, see docs/SECURITY.md.
Error Handling
const status = await getStatus({ id: jobId });
if (status?.status === "failed") {
console.error(status.error, status.errorCode);
// errorCode is the HTTP status from Firecrawl (e.g., 402, 403, 429, 500)
}Found a bug? Feature request? File it here.
Advanced Usage
For configuration options, the FirecrawlScrape class API, and URL utilities,
see docs/ADVANCED.md.
Development
npm install
npm run devLicense
Apache-2.0
