xtor

v0.1.0

Published

8 months ago

Declarative HTML data extraction library with schema-based selectors

Downloads

0High
0Medium
0Low

daijiahua

html extractor scraper cheerio data-extraction web-scraping schema declarative

xtor

Declarative HTML data extraction library with schema-based selectors.

Features

Declarative Schema - Define data structure with simple JSON schemas
Cheerio-based - Fast and reliable HTML parsing
Loop Accumulation - Built-in support for paginated data extraction
Merge Strategies - Automatic data merging with concat/collect/merge strategies
Deduplication - Remove duplicates by single or multiple fields
TypeScript - Full type safety and IntelliSense support

Installation

npm install xtor
# or
pnpm add xtor
# or
yarn add xtor

Quick Start

Single Extraction

import { Extractor } from 'xtor';

const html = `
  <div class="product">
    <h3>iPhone 15</h3>
    <span class="price">$999</span>
  </div>
`;

const schema = {
  name: 'h3',
  price: '.price'
};

const extractor = new Extractor(schema);
const result = extractor.extract(html);

console.log(result);
// { name: 'iPhone 15', price: '$999' }

Array Extraction

const html = `
  <div class="products">
    <div class="product">
      <h3>iPhone 15</h3>
      <span class="price">$999</span>
    </div>
    <div class="product">
      <h3>MacBook Pro</h3>
      <span class="price">$2499</span>
    </div>
  </div>
`;

const schema = {
  products: ['.product', {
    name: 'h3',
    price: '.price'
  }]
};

const extractor = new Extractor(schema);
const result = extractor.extract(html);

console.log(result);
// {
//   products: [
//     { name: 'iPhone 15', price: '$999' },
//     { name: 'MacBook Pro', price: '$2499' }
//   ]
// }

Loop Accumulation

Perfect for pagination scenarios:

const schema = {
  products: ['.product', {
    id: '@data-id',
    name: 'h3',
    price: '.price'
  }]
};

const extractor = new Extractor(schema, {
  merge: 'concat',      // Concatenate arrays
  unique: 'id'          // Remove duplicates by id
});

const accumulator = extractor.loop();

// Extract from page 1
accumulator.extract(page1Html);

// Extract from page 2
accumulator.extract(page2Html);

// Get final merged result
const result = accumulator.getResult();

Schema Syntax

Basic Selectors

{
  title: 'h1',              // Text content
  image: 'img@src',         // Attribute value
  html: 'div@html',         // Inner HTML
  description: 'p'          // Text content
}

Arrays

// Simple array
{ links: ['a@href'] }

// Object array
{
  products: ['.product', {
    name: 'h3',
    price: '.price'
  }]
}

Nested Objects

{
  author: {
    name: '.author-name',
    avatar: '.author-avatar@src'
  }
}

Current Element

Use empty string to extract current element:

['.product', {
  id: '@data-id',
  text: ''              // Extract current element text
}]

Merge Strategies

concat (default for arrays)

Concatenates arrays from multiple extractions:

// Page 1: ['A', 'B']
// Page 2: ['C', 'D']
// Result: ['A', 'B', 'C', 'D']

collect (default for objects)

Collects objects into an array:

// Page 1: { name: 'Alice' }
// Page 2: { name: 'Bob' }
// Result: [{ name: 'Alice' }, { name: 'Bob' }]

merge

Merges objects using Object.assign:

// Page 1: { name: 'Alice' }
// Page 2: { age: 25 }
// Result: { name: 'Alice', age: 25 }

Deduplication

By Single Field

const extractor = new Extractor(schema, {
  merge: 'concat',
  unique: 'id'              // Keep first by default
});

By Multiple Fields

const extractor = new Extractor(schema, {
  merge: 'concat',
  unique: ['id', 'type']    // Composite key
});

Keep Last

const extractor = new Extractor(schema, {
  merge: 'concat',
  unique: {
    by: 'id',
    keep: 'last'            // Keep last occurrence
  }
});

API Reference

Extractor

class Extractor {
  constructor(schema: XRaySchema, strategy?: LoopStrategy);

  // Single extraction
  extract(html: string): XRayResult | any[];

  // Create accumulator for loop extraction
  loop(): ExtractionAccumulator;
}

ExtractionAccumulator

class ExtractionAccumulator {
  // Extract and accumulate
  extract(html: string): any;

  // Get current result without extraction
  getResult(): any;

  // Get iteration count
  getIterationCount(): number;

  // Reset accumulated state
  reset(): void;
}

Types

interface XRaySchemaObject {
  [key: string]: XRayValue;
}

type XRaySchema = XRaySchemaObject | [string, XRaySchemaObject];

type XRayValue =
  | string                          // Simple selector
  | XRaySchemaObject                // Nested object
  | Array<string>                   // Simple array
  | [string, XRaySchemaObject];     // Object array

interface LoopStrategy {
  merge: 'concat' | 'collect' | 'merge';
  unique?: string | string[] | {
    by: string | string[];
    keep?: 'first' | 'last';
  };
}

License

MIT

Contributing

Contributions are welcome! Please feel free to submit a Pull Request.

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

xtor

Features

Installation

Quick Start

Single Extraction

Array Extraction

Loop Accumulation

Schema Syntax

Basic Selectors

Arrays

Nested Objects

Current Element

Merge Strategies

concat (default for arrays)

collect (default for objects)

merge

Deduplication

By Single Field

By Multiple Fields

Keep Last

API Reference

Extractor

ExtractionAccumulator

Types

License

Contributing