@kreisler/js-scraper
v1.0.0
Published
A powerful TypeScript/JavaScript web scraping library with JSDOM HTML parsing and Cloudflare bypass support
Maintainers
Readme
@kreisler/js-scraper
A powerful TypeScript/JavaScript web scraping library with JSDOM HTML parsing and Cloudflare bypass support.
Features
- 🚀 Easy to use - Simple API for web scraping
- 🛡️ Cloudflare bypass - Built-in support for Cloudflare protection
- 📄 HTML Parsing - JSDOM-based DOM manipulation with jQuery-like syntax
- 🔄 Automatic retries - Configurable retry logic with exponential backoff
- 📊 Metadata extraction - Get response metadata (status, headers, content type, size, response time)
- 🎯 TypeScript support - Full TypeScript type definitions
- ⚡ Zero dependencies - Minimal footprint with essential libraries only
Installation
npm install @kreisler/js-scraper
# or
yarn add @kreisler/js-scraper
# or
pnpm add @kreisler/js-scraperQuick Start
Basic Scraping
import { scrape } from '@kreisler/js-scraper'
const dom = await scrape('https://example.com')
// Extract elements
const title = dom.$('title')?.textContent
const paragraphs = dom.$$('p')With Metadata
import { scrapeWithMetadata } from '@kreisler/js-scraper'
const response = await scrapeWithMetadata('https://example.com')
console.log(`Status: ${response.statusCode}`)
console.log(`Content-Type: ${response.contentType}`)
console.log(`Response Time: ${response.responseTime}ms`)
console.log(`Content Size: ${response.size} bytes`)Advanced Usage
import { RequestService, jQuery } from '@kreisler/js-scraper'
// Custom options
const html = await RequestService.fetchData({
url: 'https://example.com',
method: 'GET',
timeout: 30000,
retries: 3,
headers: {
'Custom-Header': 'value'
}
})
// Parse HTML
const dom = jQuery(html)
// DOM manipulation
const $ = dom.$ // querySelector
const $$ = dom.$$ // querySelectorAll
const title = dom.title
const document = dom.documentUtility Functions
import { getTexts, getAttrs, getText, getAttr } from '@kreisler/js-scraper'
const dom = await scrape('https://example.com')
// Get text from single element
const heading = getText(dom.$('h1'))
// Get attribute from single element
const link = getAttr(dom.$('a'), 'href')
// Get text from multiple elements
const paragraphs = getTexts(dom.$$('p'))
// Get attributes from multiple elements
const links = getAttrs(dom.$$('a'), 'href')API Reference
scrape(url, options?)
Fetch and parse HTML content from a URL.
Parameters:
url(string) - The URL to scrapeoptions(object, optional)timeout(number) - Request timeout in ms (default: 30000)retries(number) - Number of retry attempts (default: 3)headers(object) - Custom headers
Returns: DOM object with jQuery-like syntax
scrapeWithMetadata(url, options?)
Fetch content with metadata.
Parameters:
url(string) - The URL to fetchoptions(object, optional)timeout(number) - Request timeout in msheaders(object) - Custom headers
Returns: ParsedResponse object with:
content(string) - HTML contentstatusCode(number) - HTTP status codeheaders(object) - Response headerscontentType(string) - Content typesize(number) - Content size in bytesresponseTime(number) - Response time in ms
ScraperService.fetchData(options)
Low-level fetch with retry logic.
Parameters: FetchOptions
url(string)method(string) - 'GET', 'POST', 'HEAD'headers(object, optional)timeout(number, optional)retries(number, optional)
Returns: Promise - HTML content
ScraperService.fetchWithMetadata(options)
Fetch with full metadata.
Returns: Promise
jQuery(html)
Parse HTML string and return DOM object.
Returns: DOM object with:
$(selector)- querySelector$$(selector)- querySelectorAlldocument- JSDOM documenttitle- Page title
Utility Functions
getText(element)- Get text content from elementgetAttr(element, attr)- Get attribute from elementgetTexts(elements)- Get text from multiple elementsgetAttrs(elements, attr)- Get attributes from multiple elements
Error Handling
import { scrape, FetchError } from '@kreisler/js-scraper'
try {
const dom = await scrape('https://example.com')
} catch (error) {
if (error instanceof FetchError) {
console.error(`Failed to fetch ${error.url}: ${error.message}`)
console.error(`Status: ${error.statusCode}`)
}
}Examples
Scrape Product Information
import { scrape, getTexts, getAttrs } from '@kreisler/js-scraper'
async function scrapeProducts(url) {
const dom = await scrape(url)
const products = dom.$$('.product-item')
return Array.from(products).map(product => ({
name: product.querySelector('.name')?.textContent,
price: product.querySelector('.price')?.textContent,
link: product.querySelector('a')?.getAttribute('href')
}))
}Scrape Data Table
import { scrape } from '@kreisler/js-scraper'
async function scrapeTable(url) {
const dom = await scrape(url)
const rows = dom.$$('table tbody tr')
return Array.from(rows).map(row => {
const cells = row.querySelectorAll('td')
return Array.from(cells).map(cell => cell.textContent)
})
}Configuration
Disable SSL Certificate Verification (Development Only)
process.env.NODE_TLS_REJECT_UNAUTHORIZED = '0'Custom User-Agent
import { scrape } from '@kreisler/js-scraper'
const dom = await scrape('https://example.com', {
headers: {
'User-Agent': 'Custom User Agent'
}
})License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
