@nikx/dory-util
v1.0.10
Published
Shared utilities for Dory platform
Readme
@nikx/dory-util
Shared utilities for the Dory web scraping platform. This package contains common code used across dory-core, dory-api, and dory-ui to maintain a single source of truth and reduce code duplication.
Features
- Unique Key Generation: Deterministic hash generation for duplicate detection
- Merged Output Format: Standardized output structure with metadata and content
- Type Definitions: Shared TypeScript types across all Dory projects
Installation
npm install @nikx/dory-utilUsage
Generate Unique Keys
import { generateUniqueKey } from '@nikx/dory-util';
// Generate key from entire dataset item
const key1 = generateUniqueKey(datasetItem);
// Generate key from specific fields
const key2 = generateUniqueKey(datasetItem, ['url', 'title']);
// Generate key using request metadata
const key3 = generateUniqueKey(datasetItem, [], requestData);Build Merged Output
import { buildMergedOutput, buildSimpleMergedOutput } from '@nikx/dory-util';
// With request queue data
const mergedOutput = buildMergedOutput(
datasets,
requestQueueData,
['url'] // optional uniqueKeySelector
);
// Simple format (without request queue)
const simpleOutput = buildSimpleMergedOutput(
datasets,
['url'] // optional uniqueKeySelector
);Types
import type { MergedOutputItem, RequestQueueData } from '@dory/util';
const output: MergedOutputItem = {
meta: {
contentType: 'text/html',
uniqueKey: 'abc123...',
sourceUrl: 'https://example.com',
},
content: {
title: 'Example',
// ... scraped data
}
};Development
# Install dependencies
npm install
# Build the package
npm run build
# Watch for changes
npm run watch
# Clean build artifacts
npm run cleanOutput Format
The merged output format provides a consistent structure:
{
meta: {
contentType: string; // MIME type of the source
uniqueKey: string; // Content-based unique identifier
sourceUrl: string | null; // Original URL
requestMethod?: string; // HTTP method used
loadedUrl?: string; // Final URL after redirects
label?: string; // Request label from Crawlee
requestDepth?: number; // Crawl depth
retryCount?: number; // Number of retries
},
content: {
// ... actual scraped data
}
}License
ISC
