xcrap

v2.0.1

Published

6 days ago

**xcrap** is a powerful and flexible web scraping framework for Node.js. It bundles the best-in-class tools for data extraction and transformation into a single, easy-to-use package.

0High
0Medium
0Low

marcuth

web scraping xcrap scrapy

🕷️ xcrap

xcrap is a powerful and flexible web scraping framework for Node.js. It bundles the best-in-class tools for data extraction and transformation into a single, easy-to-use package.

With xcrap, you can extract data from HTML, JSON, and Markdown using declarative models, and then clean, validate, and transform that data using a robust pipeline system.

📦 Installation

Install xcrap using your preferred package manager:

npm install xcrap

Or with yarn/pnpm:

yarn add xcrap
pnpm add xcrap

🚀 Features

Declarative Extraction: Extract data from HTML, JSON, and Markdown using simple, readable models (@xcrap/extractor).
Data Transformation: Clean, validate, and transform your extracted data with a powerful pipeline (@xcrap/transformer).
Modular Design: Built on top of a solid core (@xcrap/core) for reliability and extensibility.
Type-Safe: Written in TypeScript with full type definitions included.

📚 Quick Start

Here's a complete example showing how to extract data from an HTML string and then transform it into a clean, structured format.

1. Extraction

First, let's extract raw data from an HTML source using HtmlParser and HtmlExtractionModel.

import { HtmlParser, HtmlExtractionModel, css, extract } from "xcrap/extractor"

const html = `
<html>
  <body>
    <div class="product">
      <h1 id="title">  Cool Gadget  </h1>
      <span class="price">$99.99</span>
      <div class="details">
        <span data-spec="weight">250g</span>
        <a href="/specs.pdf">Download Specs</a>
      </div>
    </div>
  </body>
</html>
`

// Define the extraction model
const extractionModel = new HtmlExtractionModel({
  name: {
    query: css("#title"),
    extractor: extract("innerText")
  },
  price: {
    query: css(".price"),
    extractor: extract("innerText")
  },
  weight: {
    query: css("[data-spec='weight']"),
    extractor: extract("innerText")
  },
  specsUrl: {
    query: css("a"),
    extractor: extract("href")
  }
})

// Run the extraction
const parser = new HtmlParser(html)
const rawData = await parser.extractModel({ model: extractionModel })

console.log(rawData)
/*
Output:
{
  name: "  Cool Gadget  ",
  price: "$99.99",
  weight: "250g",
  specsUrl: "/specs.pdf"
}
*/

2. Transformation

Now, let's clean up that raw data using Transformer and TransformingModel.

import { Transformer, TransformingModel, transform, StringTransformer, StringValidator } from "xcrap/transformer"

// Define the transformation model
const transformerModel = new TransformingModel({
  name: [
    transform({
      key: "name", // Use the 'name' field from rawData
      transformer: StringTransformer.trim // Trim whitespace
    })
  ],
  price: [
    transform({
      key: "price",
      transformer: (val) => parseFloat(val.replace("$", "")) // Custom cleanup
    })
  ],
  specsUrl: [
    transform({
      key: "specsUrl",
      transformer: StringTransformer.resolveUrl("https://myshop.com") // Resolve relative URL
    })
  ]
})

// Run the transformation
const transformer = new Transformer(rawData)
const cleanData = await transformer.transform(transformerModel)

console.log(cleanData)
/*
Output:
{
    name: "Cool Gadget",
    price: 99.99,
    specsUrl: "https://myshop.com/specs.pdf",
    weight: "250g",
}
*/

📖 Core Concepts

Extractor (`@xcrap/extractor`)

The extraction engine allows you to parse structured data from various sources.

HtmlParser: For parsing HTML documents.
JsonParser: For traversing and extracting from JSON objects.
MarkdownParser: For extracting content from Markdown files.
HtmlExtractionModel: Define the structure of the data you want to extract using query selectors (css, xpath) and extractors (extract).

Transformer (`@xcrap/transformer`)

The transformation engine allows you to process raw data into its final form.

Transformer: The main class that applies transformations to a dataset.
TransformingModel: Defines a declarative pipeline of transformations for each field.
StringTransformer: A collection of utility functions for common string operations (trim, replace, split, etc.).
StringValidator: A collection of utility functions for validating string content (isNumeric, isEmail, etc.).

🤝 Contributing

We welcome contributions! Whether it's fixing bugs, improving documentation, or adding new features.

Fork the repository.
Create your feature branch (git checkout -b feature/amazing-feature).
Commit your changes (git commit -m 'Add some amazing feature').
Push to the branch (git push origin feature/amazing-feature).
Open a Pull Request.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme