@scrappey/langchain
v0.1.1
Published
LangChain.js document loader for Scrappey — scrape web pages as Markdown for LLM ingestion, bypassing anti-bot protections.
Maintainers
Readme
@scrappey/langchain
LangChain.js document loader for Scrappey — scrape web pages as clean Markdown for RAG / LLM ingestion, bypassing anti-bot protections (Cloudflare, DataDome, PerimeterX, etc.).
Scrappey returns pre-converted Markdown on the server, so no local HTML → Markdown conversion is needed — the output drops straight into a splitter + vector store.
- Zero runtime dependencies (native
fetch). @langchain/coreis the only peer dep.- ESM + CJS dual build with first-class TypeScript types.
- Node 18+.
Installation
npm install @scrappey/langchain @langchain/coreSetup
- Sign up at scrappey.com and grab your API key.
- Set it in your environment (or pass it directly to the loader):
export SCRAPPEY_API_KEY="your_api_key"Security: never hardcode your API key in committed code. Use a secret manager or
.envfile (the bundled.gitignorecovers.env*). See SECURITY.md.
Quickstart
import { ScrappeyLoader } from "@scrappey/langchain";
const loader = new ScrappeyLoader({
urls: ["https://example.com", "https://news.ycombinator.com"],
});
const docs = await loader.load();
console.log(docs[0].pageContent.slice(0, 120));End-to-end RAG
import { ScrappeyLoader } from "@scrappey/langchain";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";
const loader = new ScrappeyLoader({
urls: ["https://en.wikipedia.org/wiki/Web_scraping"],
concurrency: 2,
});
const docs = await loader.load();
const splits = await new RecursiveCharacterTextSplitter({
chunkSize: 1000,
chunkOverlap: 150,
}).splitDocuments(docs);
const store = await MemoryVectorStore.fromDocuments(splits, new OpenAIEmbeddings());
const hits = await store.similaritySearch("What is web scraping?", 3);Streaming with lazyLoad
const loader = new ScrappeyLoader({ urls: bigUrlList });
for await (const doc of loader.lazyLoad()) {
// embed / persist as soon as each page lands
}Document schema
Each URL produces one Document:
| Field | Type | Value |
| ---------------------------------- | --------- | ---------------------------------------------------------- |
| pageContent | string | Markdown (default) or HTML body |
| metadata.source | string | The source URL |
| metadata.statusCode | number | Upstream HTTP status (inferred from data: "success" when Scrappey omits it) |
| metadata.verified | boolean | Scrappey's anti-bot verification flag |
| metadata.timeElapsedMs | number | How long Scrappey took to fetch the page |
| metadata.scrappey.cmd | string | Always "request.get" in v0.1 |
| metadata.scrappey.data? | string | Scrappey's top-level success marker, typically "success" |
| metadata.scrappey.session? | string | Session ID Scrappey used for the request |
| metadata.scrappey.type? | string | How Scrappey executed the scrape ("browser", "request", …) |
| metadata.scrappey.currentUrl? | string | Final URL after redirects |
| metadata.scrappey.cookieString? | string | Cookies set by the page, if any |
Constructor options
| Option | Default | Purpose |
| -------------- | ----------------------------------------- | ----------------------------------------------------------------- |
| apiKey | process.env.SCRAPPEY_API_KEY | Scrappey API key |
| urls | required | URL or array of URLs to fetch |
| mode | "markdown" | "markdown" (server-side) or "html" (raw) |
| apiUrl | https://publisher.scrappey.com/api/v1 | Override for self-hosted / proxied endpoints |
| timeoutMs | 120_000 | Per-request timeout (AbortController) |
| concurrency | 1 | Parallel fetches |
| skipOnError | false | If true, failed URLs are console.warn'd and omitted from output |
Development
npm install
npm run typecheck # strict TS, no emit
npm test # vitest, mocked
npm run build # tsup -> ESM + CJS + d.tsTo run the live integration test, set SCRAPPEY_LIVE_API_KEY — see
CONTRIBUTING.md.
Security
The loader never logs, persists, or transmits your API key to any host
other than https://publisher.scrappey.com/api/v1 (or the apiUrl you
explicitly configure). See SECURITY.md for the
reporting process and threat model.
Roadmap
v0.1 is intentionally minimal (URLs → Markdown Documents). Planned for later releases:
- Session reuse (
sessions.create/sessions.destroy) for cheaper crawls and consistent fingerprinting proxyCountry,premiumProxy,browserconfigurationbrowserActions,customHeaders,cookies,postDatapassthrough for JS-heavy and auth-gated pages- POST-body scraping via
cmd: "request.post" - An agent
Toolwrapper for live web access inside LLM tool-calling loops
License
MIT © pim
Links
- Scrappey homepage
- Scrappey wiki
- Companion LlamaIndex reader:
llama-index-readers-scrappey - Contributing · Security · Changelog
