xtor
v0.1.0
Published
Declarative HTML data extraction library with schema-based selectors
Maintainers
Readme
xtor
Declarative HTML data extraction library with schema-based selectors.
Features
- Declarative Schema - Define data structure with simple JSON schemas
- Cheerio-based - Fast and reliable HTML parsing
- Loop Accumulation - Built-in support for paginated data extraction
- Merge Strategies - Automatic data merging with concat/collect/merge strategies
- Deduplication - Remove duplicates by single or multiple fields
- TypeScript - Full type safety and IntelliSense support
Installation
npm install xtor
# or
pnpm add xtor
# or
yarn add xtorQuick Start
Single Extraction
import { Extractor } from 'xtor';
const html = `
<div class="product">
<h3>iPhone 15</h3>
<span class="price">$999</span>
</div>
`;
const schema = {
name: 'h3',
price: '.price'
};
const extractor = new Extractor(schema);
const result = extractor.extract(html);
console.log(result);
// { name: 'iPhone 15', price: '$999' }Array Extraction
const html = `
<div class="products">
<div class="product">
<h3>iPhone 15</h3>
<span class="price">$999</span>
</div>
<div class="product">
<h3>MacBook Pro</h3>
<span class="price">$2499</span>
</div>
</div>
`;
const schema = {
products: ['.product', {
name: 'h3',
price: '.price'
}]
};
const extractor = new Extractor(schema);
const result = extractor.extract(html);
console.log(result);
// {
// products: [
// { name: 'iPhone 15', price: '$999' },
// { name: 'MacBook Pro', price: '$2499' }
// ]
// }Loop Accumulation
Perfect for pagination scenarios:
const schema = {
products: ['.product', {
id: '@data-id',
name: 'h3',
price: '.price'
}]
};
const extractor = new Extractor(schema, {
merge: 'concat', // Concatenate arrays
unique: 'id' // Remove duplicates by id
});
const accumulator = extractor.loop();
// Extract from page 1
accumulator.extract(page1Html);
// Extract from page 2
accumulator.extract(page2Html);
// Get final merged result
const result = accumulator.getResult();Schema Syntax
Basic Selectors
{
title: 'h1', // Text content
image: 'img@src', // Attribute value
html: 'div@html', // Inner HTML
description: 'p' // Text content
}Arrays
// Simple array
{ links: ['a@href'] }
// Object array
{
products: ['.product', {
name: 'h3',
price: '.price'
}]
}Nested Objects
{
author: {
name: '.author-name',
avatar: '.author-avatar@src'
}
}Current Element
Use empty string to extract current element:
['.product', {
id: '@data-id',
text: '' // Extract current element text
}]Merge Strategies
concat (default for arrays)
Concatenates arrays from multiple extractions:
// Page 1: ['A', 'B']
// Page 2: ['C', 'D']
// Result: ['A', 'B', 'C', 'D']collect (default for objects)
Collects objects into an array:
// Page 1: { name: 'Alice' }
// Page 2: { name: 'Bob' }
// Result: [{ name: 'Alice' }, { name: 'Bob' }]merge
Merges objects using Object.assign:
// Page 1: { name: 'Alice' }
// Page 2: { age: 25 }
// Result: { name: 'Alice', age: 25 }Deduplication
By Single Field
const extractor = new Extractor(schema, {
merge: 'concat',
unique: 'id' // Keep first by default
});By Multiple Fields
const extractor = new Extractor(schema, {
merge: 'concat',
unique: ['id', 'type'] // Composite key
});Keep Last
const extractor = new Extractor(schema, {
merge: 'concat',
unique: {
by: 'id',
keep: 'last' // Keep last occurrence
}
});API Reference
Extractor
class Extractor {
constructor(schema: XRaySchema, strategy?: LoopStrategy);
// Single extraction
extract(html: string): XRayResult | any[];
// Create accumulator for loop extraction
loop(): ExtractionAccumulator;
}ExtractionAccumulator
class ExtractionAccumulator {
// Extract and accumulate
extract(html: string): any;
// Get current result without extraction
getResult(): any;
// Get iteration count
getIterationCount(): number;
// Reset accumulated state
reset(): void;
}Types
interface XRaySchemaObject {
[key: string]: XRayValue;
}
type XRaySchema = XRaySchemaObject | [string, XRaySchemaObject];
type XRayValue =
| string // Simple selector
| XRaySchemaObject // Nested object
| Array<string> // Simple array
| [string, XRaySchemaObject]; // Object array
interface LoopStrategy {
merge: 'concat' | 'collect' | 'merge';
unique?: string | string[] | {
by: string | string[];
keep?: 'first' | 'last';
};
}License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
