scrapefrom
v2.7.3
Published
Scrape data from any publicly accessible webpage.
Downloads
515
Maintainers
Readme
scrapefrom
Scrape data from any publicly accessible webpage.
Installation
npm install scrapefrom
# or
yarn add scrapefromNode v16 Support
npm install [email protected]
# or
yarn add [email protected]Usage
const scrapefrom = require("scrapefrom"); // commonjs
// or
import scrapefrom from "scrapefrom"; // esmScrape full page data
Returns the document head and body as JSON, a dot path map of all available properties, and an extract function for pulling values by path.
await scrapefrom("https://example.com");
// { head: {...}, body: {...}, map: [...], extract: (path) => unknown }Extract elements by selector
await scrapefrom({ url: "https://example.com", extract: "h1" });
// { h1: [...] }Extract with a custom name
await scrapefrom({
url: "https://example.com",
extract: { name: "titles", selector: "h1" },
});
// { titles: [...] }Extract as a delimited string
await scrapefrom({
url: "https://example.com",
extract: { name: "title", selector: "h1", delimiter: "," },
});
// { title: "...,..." }Extract an attribute value
await scrapefrom({
url: "https://example.com",
extract: { name: "dates", selector: "time", attribute: "datetime" },
});
// { dates: [...] }Extract multiple selectors
await scrapefrom({
url: "https://example.com",
extracts: [
{ name: "titles", selector: "h1" },
{ name: "dates", selector: "time", attribute: "datetime" },
],
});
// { titles: [...], dates: [...] }Extract from multiple URLs
await scrapefrom(
{
url: "https://example.com",
extracts: [{ name: "titles", selector: "h1" }],
},
{
url: "https://example.org",
extracts: [{ name: "titles", selector: "h1" }],
},
);
// [{ titles: [...] }, { titles: [...] }]Extract structured row data
await scrapefrom({
url: "https://example.com",
extract: {
selector: "tbody tr",
name: "rows",
extracts: [
{ selector: "td:nth-child(1)", name: "key" },
{ selector: "td:nth-child(2)", name: "type" },
{ selector: "td:nth-child(3)", name: "definition" },
],
},
});
// { rows: [{ key: "...", type: "...", definition: "..." }, ...] }Extract JSON-LD / application/json script tags
await scrapefrom({
url: "https://example.com",
extract: { name: "schema", json: true },
});
// { schema: [...] }Use a custom extractor function
await scrapefrom({
url: "https://example.com",
extractor: ($) => $("h1").text(),
});Extract values by dot path key map
await scrapefrom({
url: "https://example.com",
parser: "json",
keyPath: { title: "data.title", author: "data.author" },
});
// { title: "...", author: "..." }If a page requires JavaScript
By default scrapefrom uses fetch under the hood. If a page requires JavaScript to render, use puppeteer instead — it runs a headless Chrome browser to bypass this requirement.
First install puppeteer:
npm install puppeteer
# or
yarn add puppeteerThen set use: "puppeteer" in your config:
await scrapefrom({
url: "https://example.com",
use: "puppeteer",
extracts: [{ name: "titles", selector: "h1" }],
});Config options
| Option | Type | Default | Description |
| ------------------------ | ------------------------ | -------------- | -------------------------------------- |
| url | string \| URL | | URL to scrape |
| name | string | url.hostname | Custom name assigned to config |
| use | "fetch" \| "puppeteer" | "fetch" | HTTP strategy to use |
| log | boolean | false | Enable request/response logging |
| timeout | number | 30000 | Request timeout in milliseconds |
| fetch | RequestInit | | Options passed to the fetch call |
| parser | "json" \| "text" | "text" | Response parser |
| launch | LaunchOptions | | Puppeteer launch options |
| cookies | CookieData[] | | Cookies to set before page request |
| preNavigate | { url, pageGoTo? } | | Optional pre-navigation page request |
| pageGoTo | GoToOptions | | Puppeteer page navigation options |
| waitForSelector | string | | Selector to wait for before parsing |
| waitForSelectorOptions | WaitForSelectorOptions | | Options for waitForSelector |
| select | string[] | | Selector and values for page.select |
| selects | string[][] | | Multiple page.select calls |
| keyPath | PathResolver | | Dot path key map for JSON responses |
| extractor | (res, raw) => unknown | | Custom extractor function |
| extract | ExtractConfig | | Single extraction config |
| extracts | ExtractConfig[] | | Multiple extraction configs |
| delimiter | string \| null | | Delimiter for joining extracted arrays |
| includeResponse | boolean | false | Include raw response in result |
| includeTimeout | boolean | false | Include timeout in result |
ExtractConfig options
| Option | Type | Description |
| ----------- | ----------------------------- | ----------------------------------------------- |
| name | string | Config name assigned to result |
| delimiter | string \| null | Join extracted array with this delimiter |
| selector | string | CSS selector to query |
| attribute | string | Element attribute to extract instead of content |
| json | boolean | Extract JSON-LD or application/json script tags |
| filter | (res) => boolean | Filter function for JSON extraction results |
| keyPath | PathResolver | Dot path key map for JSON extraction results |
| extract | ExtractConfig | Nested extraction config |
| extracts | ExtractConfig[] | Multiple nested extraction configs |
| extractor | ($, parentNode?) => unknown | Custom extractor function |
License
MIT
