npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

@scrappey/langchain

v0.1.1

Published

LangChain.js document loader for Scrappey — scrape web pages as Markdown for LLM ingestion, bypassing anti-bot protections.

Readme

@scrappey/langchain

CI npm version license types node

LangChain.js document loader for Scrappey — scrape web pages as clean Markdown for RAG / LLM ingestion, bypassing anti-bot protections (Cloudflare, DataDome, PerimeterX, etc.).

Scrappey returns pre-converted Markdown on the server, so no local HTML → Markdown conversion is needed — the output drops straight into a splitter + vector store.


  • Zero runtime dependencies (native fetch).
  • @langchain/core is the only peer dep.
  • ESM + CJS dual build with first-class TypeScript types.
  • Node 18+.

Installation

npm install @scrappey/langchain @langchain/core

Setup

  1. Sign up at scrappey.com and grab your API key.
  2. Set it in your environment (or pass it directly to the loader):
export SCRAPPEY_API_KEY="your_api_key"

Security: never hardcode your API key in committed code. Use a secret manager or .env file (the bundled .gitignore covers .env*). See SECURITY.md.

Quickstart

import { ScrappeyLoader } from "@scrappey/langchain";

const loader = new ScrappeyLoader({
  urls: ["https://example.com", "https://news.ycombinator.com"],
});

const docs = await loader.load();
console.log(docs[0].pageContent.slice(0, 120));

End-to-end RAG

import { ScrappeyLoader } from "@scrappey/langchain";
import { RecursiveCharacterTextSplitter } from "@langchain/textsplitters";
import { MemoryVectorStore } from "langchain/vectorstores/memory";
import { OpenAIEmbeddings } from "@langchain/openai";

const loader = new ScrappeyLoader({
  urls: ["https://en.wikipedia.org/wiki/Web_scraping"],
  concurrency: 2,
});

const docs = await loader.load();
const splits = await new RecursiveCharacterTextSplitter({
  chunkSize: 1000,
  chunkOverlap: 150,
}).splitDocuments(docs);

const store = await MemoryVectorStore.fromDocuments(splits, new OpenAIEmbeddings());
const hits = await store.similaritySearch("What is web scraping?", 3);

Streaming with lazyLoad

const loader = new ScrappeyLoader({ urls: bigUrlList });
for await (const doc of loader.lazyLoad()) {
  // embed / persist as soon as each page lands
}

Document schema

Each URL produces one Document:

| Field | Type | Value | | ---------------------------------- | --------- | ---------------------------------------------------------- | | pageContent | string | Markdown (default) or HTML body | | metadata.source | string | The source URL | | metadata.statusCode | number | Upstream HTTP status (inferred from data: "success" when Scrappey omits it) | | metadata.verified | boolean | Scrappey's anti-bot verification flag | | metadata.timeElapsedMs | number | How long Scrappey took to fetch the page | | metadata.scrappey.cmd | string | Always "request.get" in v0.1 | | metadata.scrappey.data? | string | Scrappey's top-level success marker, typically "success" | | metadata.scrappey.session? | string | Session ID Scrappey used for the request | | metadata.scrappey.type? | string | How Scrappey executed the scrape ("browser", "request", …) | | metadata.scrappey.currentUrl? | string | Final URL after redirects | | metadata.scrappey.cookieString? | string | Cookies set by the page, if any |

Constructor options

| Option | Default | Purpose | | -------------- | ----------------------------------------- | ----------------------------------------------------------------- | | apiKey | process.env.SCRAPPEY_API_KEY | Scrappey API key | | urls | required | URL or array of URLs to fetch | | mode | "markdown" | "markdown" (server-side) or "html" (raw) | | apiUrl | https://publisher.scrappey.com/api/v1 | Override for self-hosted / proxied endpoints | | timeoutMs | 120_000 | Per-request timeout (AbortController) | | concurrency | 1 | Parallel fetches | | skipOnError | false | If true, failed URLs are console.warn'd and omitted from output |

Development

npm install
npm run typecheck    # strict TS, no emit
npm test             # vitest, mocked
npm run build        # tsup -> ESM + CJS + d.ts

To run the live integration test, set SCRAPPEY_LIVE_API_KEY — see CONTRIBUTING.md.

Security

The loader never logs, persists, or transmits your API key to any host other than https://publisher.scrappey.com/api/v1 (or the apiUrl you explicitly configure). See SECURITY.md for the reporting process and threat model.

Roadmap

v0.1 is intentionally minimal (URLs → Markdown Documents). Planned for later releases:

  • Session reuse (sessions.create / sessions.destroy) for cheaper crawls and consistent fingerprinting
  • proxyCountry, premiumProxy, browser configuration
  • browserActions, customHeaders, cookies, postData passthrough for JS-heavy and auth-gated pages
  • POST-body scraping via cmd: "request.post"
  • An agent Tool wrapper for live web access inside LLM tool-calling loops

License

MIT © pim

Links