@scrappey/langchain

v0.1.1

Published

19 days ago

LangChain.js document loader for Scrappey — scrape web pages as Markdown for LLM ingestion, bypassing anti-bot protections.

0High
0Medium
0Low

dormic97

langchain langchainjs document-loader scrappey web-scraping markdown rag cloudflare

@scrappey/langchain

LangChain.js document loader for Scrappey — scrape web pages as clean Markdown for RAG / LLM ingestion, bypassing anti-bot protections (Cloudflare, DataDome, PerimeterX, etc.).

Scrappey returns pre-converted Markdown on the server, so no local HTML → Markdown conversion is needed — the output drops straight into a splitter + vector store.

Zero runtime dependencies (native fetch).
@langchain/core is the only peer dep.
ESM + CJS dual build with first-class TypeScript types.
Node 18+.

Installation

npm install @scrappey/langchain @langchain/core

Setup

Sign up at scrappey.com and grab your API key.
Set it in your environment (or pass it directly to the loader):

export SCRAPPEY_API_KEY="your_api_key"

Security: never hardcode your API key in committed code. Use a secret manager or .env file (the bundled .gitignore covers .env*). See SECURITY.md.

Quickstart

import { ScrappeyLoader } from "@scrappey/langchain";

const loader = new ScrappeyLoader({
  urls: ["https://example.com", "https://news.ycombinator.com"],
});

const docs = await loader.load();
console.log(docs[0].pageContent.slice(0, 120));

End-to-end RAG

import { ScrappeyLoader } from "@scrappey/langchain";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

const loader = new ScrappeyLoader({
  urls: ["https://en.wikipedia.org/wiki/Web_scraping"],
  concurrency: 2,
});

const docs = await loader.load();
const splits = await new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 150,
}).splitDocuments(docs);

const store = await MemoryVectorStore.fromDocuments(splits, new OpenAIEmbeddings());
const hits = await store.similaritySearch("What is web scraping?", 3);

Streaming with `lazyLoad`

const loader = new ScrappeyLoader({ urls: bigUrlList });
for await (const doc of loader.lazyLoad()) {
  // embed / persist as soon as each page lands
}

Document schema

Each URL produces one Document:

| Field | Type | Value | | ---------------------------------- | --------- | ---------------------------------------------------------- | | pageContent | string | Markdown (default) or HTML body | | metadata.source | string | The source URL | | metadata.statusCode | number | Upstream HTTP status (inferred from data: "success" when Scrappey omits it) | | metadata.verified | boolean | Scrappey's anti-bot verification flag | | metadata.timeElapsedMs | number | How long Scrappey took to fetch the page | | metadata.scrappey.cmd | string | Always "request.get" in v0.1 | | metadata.scrappey.data? | string | Scrappey's top-level success marker, typically "success" | | metadata.scrappey.session? | string | Session ID Scrappey used for the request | | metadata.scrappey.type? | string | How Scrappey executed the scrape ("browser", "request", …) | | metadata.scrappey.currentUrl? | string | Final URL after redirects | | metadata.scrappey.cookieString? | string | Cookies set by the page, if any |

Constructor options

| Option | Default | Purpose | | -------------- | ----------------------------------------- | ----------------------------------------------------------------- | | apiKey | process.env.SCRAPPEY_API_KEY | Scrappey API key | | urls | required | URL or array of URLs to fetch | | mode | "markdown" | "markdown" (server-side) or "html" (raw) | | apiUrl | https://publisher.scrappey.com/api/v1 | Override for self-hosted / proxied endpoints | | timeoutMs | 120_000 | Per-request timeout (AbortController) | | concurrency | 1 | Parallel fetches | | skipOnError | false | If true, failed URLs are console.warn'd and omitted from output |

Development

npm install
npm run typecheck    # strict TS, no emit
npm test             # vitest, mocked
npm run build        # tsup -> ESM + CJS + d.ts

To run the live integration test, set SCRAPPEY_LIVE_API_KEY — see CONTRIBUTING.md.

Security

The loader never logs, persists, or transmits your API key to any host other than https://publisher.scrappey.com/api/v1 (or the apiUrl you explicitly configure). See SECURITY.md for the reporting process and threat model.

Roadmap

v0.1 is intentionally minimal (URLs → Markdown Documents). Planned for later releases:

Session reuse (sessions.create / sessions.destroy) for cheaper crawls and consistent fingerprinting
proxyCountry, premiumProxy, browser configuration
browserActions, customHeaders, cookies, postData passthrough for JS-heavy and auth-gated pages
POST-body scraping via cmd: "request.post"
An agent Tool wrapper for live web access inside LLM tool-calling loops

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

@scrappey/langchain

Installation

Setup

Quickstart

End-to-end RAG

Streaming with lazyLoad

Document schema

Constructor options

Development

Security

Roadmap

License

Links

Streaming with `lazyLoad`