@fustilio/data-loader
v0.7.0
Published
[](https://www.typescriptlang.org/) [](https://nodejs.org/) [;
const result = await processor.processData(arrayBuffer, {
projectId: 'my-project:1.0',
enableTarExtraction: true
});
if (result.success) {
console.log('XML content:', result.xmlContent);
console.log('Processing steps:', result.processingSteps);
console.log('Content type:', result.contentType);
console.log('Confidence:', result.confidence);
} else {
console.error('Processing failed:', result.error);
}ContentTypeDetector - Format Detection
Intelligently detects file formats from decompressed content with confidence levels.
import { ContentTypeDetector } from '@fustilio/data-loader';
const detector = new ContentTypeDetector();
const analysis = detector.detectContentType(xmlText, 'my-project:1.0');
console.log('Detected type:', analysis.type); // 'xml', 'tar', 'tsv', 'unknown'
console.log('Confidence:', analysis.confidence); // 'high', 'medium', 'low'
console.log('Indicators:', analysis.indicators); // Detailed detection infoGzipHandler - Gzip Decompression
Handles gzip decompression with detailed logging and timeout protection.
import { GzipHandler } from '@fustilio/data-loader';
const gzipHandler = new GzipHandler();
if (gzipHandler.isGzipCompressed(data)) {
const result = await gzipHandler.decompress(data);
if (result.success) {
console.log('Decompressed size:', result.decompressedSize);
console.log('XML content:', result.data);
} else {
console.error('Decompression failed:', result.error);
}
}XzHandler - XZ Decompression
Handles XZ decompression using the xz-decompress library.
import { XzHandler } from '@fustilio/data-loader';
const xzHandler = new XzHandler();
if (xzHandler.isXzCompressed(data)) {
const result = await xzHandler.decompress(data);
if (result.success) {
console.log('Decompressed size:', result.decompressedSize);
console.log('XML content:', result.data);
}
}TarHandler - Archive Extraction
Extracts tar archives to find XML files with fallback methods.
import { TarHandler } from '@fustilio/data-loader';
const tarHandler = new TarHandler();
if (tarHandler.isTarArchive(content)) {
const result = await tarHandler.extractTarArchive(data);
if (result.success) {
console.log('Extracted files:', result.extractedFiles.length);
console.log('XML content:', result.xmlContent);
}
}🌍 Data Processing Examples
Common Use Cases
The system is designed to handle various types of compressed and archived data:
- XML Datasets: Large XML files with complex structures
- Compressed Archives: Multi-file packages with different compression methods
- Tabular Data: TSV files with structured content
- Configuration Files: Various text-based configuration formats
- Data Archives: Multi-file packages and distributions
URL Processing
import { FormatProcessor } from '@fustilio/data-loader';
const processor = new FormatProcessor();
// Download and process compressed data
const response = await fetch('https://example.com/data.xml.gz');
const arrayBuffer = await response.arrayBuffer();
const result = await processor.processData(arrayBuffer, {
projectId: 'my-project:1.0',
enableTarExtraction: true
});
if (result.success) {
// Process the extracted content
console.log('Successfully processed data');
console.log('Processing steps:', result.processingSteps);
}Format Detection Examples
// XML files
const xmlResult = await processor.processData(xmlData, {
projectId: 'xml-project:1.0',
enableTarExtraction: true
});
// Result: { contentType: 'xml', confidence: 'high' }
// Compressed tar archives
const tarResult = await processor.processData(tarData, {
projectId: 'archive-project:1.0',
enableTarExtraction: true
});
// Result: { contentType: 'tar', confidence: 'high' }
// Tabular data
const tsvResult = await processor.processData(tsvData, {
projectId: 'data-project:1.0',
enableTarExtraction: true
});
// Result: { contentType: 'tsv', confidence: 'high' }🧪 Testing
Comprehensive Test Coverage
The package includes extensive tests with real data:
# Run all format handler tests
pnpm test src/formats/
# Run specific handler tests
pnpm test src/formats/__tests__/format-processor.test.ts
pnpm test src/formats/__tests__/gzip-handler.test.ts
pnpm test src/formats/__tests__/xz-handler.test.ts
pnpm test src/formats/__tests__/tar-handler.test.tsTest Categories
- Unit Tests: Individual handler functionality
- Integration Tests: End-to-end processing pipelines
- Real Data Tests: Actual data processing with various formats
- Error Handling Tests: Graceful failure scenarios
- Performance Tests: Large file processing
Test Data
Tests use real data from various sources:
- XML files: Various XML structures and schemas
- Compressed formats: gzip, xz, tar combinations
- Large files: 100MB+ datasets
- Edge cases: Malformed data, network errors
🔧 Configuration Options
FormatProcessingOptions
interface FormatProcessingOptions {
projectId: string; // Project identifier for format detection
forceType?: ContentType; // Force specific content type
enableTarExtraction?: boolean; // Enable tar archive extraction
}ContentType
type ContentType = 'xml' | 'tar' | 'tsv' | 'unknown';FormatProcessingResult
interface FormatProcessingResult {
success: boolean; // Processing success status
xmlContent?: string; // Extracted XML content
error?: string; // Error message if failed
contentType: ContentType; // Detected content type
confidence: 'high' | 'medium' | 'low'; // Detection confidence
processingSteps: string[]; // Processing step log
totalProcessingTime: number; // Total processing time (ms)
originalSize: number; // Original data size (bytes)
finalSize: number; // Final content size (chars)
extractedXmlFiles?: Array<{name: string, size: number}>; // Info about extracted files
}🚀 Performance
Optimization Features
- Streaming Processing: Handle files of any size
- Memory Management: Efficient memory usage for large files
- Timeout Protection: Prevent hanging on corrupted data
- Error Recovery: Graceful handling of processing failures
- Caching: Reuse decompressed data when possible
Performance Metrics
- Gzip Decompression: ~50MB/s on modern hardware
- XZ Decompression: ~20MB/s on modern hardware
- Tar Extraction: ~100MB/s on modern hardware
- Memory Usage: <100MB for 1GB+ files
- Processing Time: <30s for typical large files
🛠️ Development
Adding New Format Handlers
- Create Handler Class:
export class NewFormatHandler {
isNewFormat(data: Uint8Array): boolean {
// Detection logic
}
async process(data: ArrayBuffer): Promise<ProcessingResult> {
// Processing logic
}
}- Integrate with FormatProcessor:
// Add to FormatProcessor constructor
private newFormatHandler: NewFormatHandler;
// Add to processData method
if (this.newFormatHandler.isNewFormat(view)) {
// Processing logic
}- Add Tests:
describe('NewFormatHandler', () => {
it('should detect new format', () => {
// Test detection
});
it('should process new format', async () => {
// Test processing
});
});Error Handling
try {
const result = await processor.processData(data, options);
if (!result.success) {
console.error('Processing failed:', result.error);
console.log('Processing steps:', result.processingSteps);
return;
}
// Use result.xmlContent
} catch (error) {
console.error('Unexpected error:', error);
}🔍 Troubleshooting
Common Issues
Decompression Failures:
- Check data integrity and format
- Verify compression method compatibility
- Check available memory and disk space
Format Detection Issues:
- Verify content type indicators
- Check project ID configuration
- Review detection confidence levels
Tar Extraction Problems:
- Verify tar archive integrity
- Check for expected files in archive
- Review extraction permissions
Debug Information
const result = await processor.processData(data, options);
console.log('Processing steps:', result.processingSteps);
console.log('Content type:', result.contentType);
console.log('Confidence:', result.confidence);
console.log('Processing time:', result.totalProcessingTime + 'ms');
console.log('Size reduction:', result.originalSize + ' → ' + result.finalSize);📚 API Reference
FormatProcessor
Methods
processData(data: ArrayBuffer, options: FormatProcessingOptions): Promise<FormatProcessingResult>getProcessingStats(): ProcessingStats
ContentTypeDetector
Methods
detectContentType(content: string, projectId: string): ContentAnalysis
GzipHandler
Methods
isGzipCompressed(data: Uint8Array): booleandecompress(data: Uint8Array): Promise<GzipDecompressionResult>
XzHandler
Methods
isXzCompressed(data: Uint8Array): booleandecompress(data: Uint8Array): Promise<XzDecompressionResult>
TarHandler
Methods
isTarArchive(content: string): booleanextractTarArchive(data: ArrayBuffer): Promise<TarExtractionResult>
🤝 Contributing
We welcome contributions! Please see our Contributing Guide for details.
Development Workflow
- Fork the repository
- Create a feature branch
- Add tests for new functionality
- Ensure all tests pass
- Submit a pull request
📄 License
MIT License - see LICENSE file for details.
🙏 Acknowledgments
- xz-decompress: For XZ decompression capabilities
- pako: For gzip decompression
- tar-stream: For tar archive handling
🔌 Extensibility & Plugin Architecture
The data-loader is designed to be highly extensible and overridable, allowing you to add custom handlers, modify behavior, and integrate with your specific use cases.
🚀 Quick Start with Extensibility
import { createDefaultProcessorFactory, ProcessorFactory } from '@fustilio/data-loader';
// Use default handlers
const factory = createDefaultProcessorFactory();
const processor = factory.createProcessor();
// Or create custom configuration
const customFactory = new ProcessorFactory({
processing: {
maxFileSize: 50 * 1024 * 1024,
enableLogging: true
}
});🛠️ Custom Handlers
Create your own handlers for new formats:
import { CompressionHandler, ContentTypeHandler, ArchiveHandler } from '@fustilio/data-loader';
// Custom compression handler
class ZipCompressionHandler implements CompressionHandler {
getFormatName() { return "zip"; }
getMimeType() { return "application/zip"; }
getFileExtensions() { return [".zip"]; }
isCompressed(data: Uint8Array) { /* detection logic */ }
async decompress(data: Uint8Array) { /* decompression logic */ }
}
// Custom content type handler
class JsonContentTypeHandler implements ContentTypeHandler {
getFormatName() { return "json"; }
getMimeType() { return "application/json"; }
getFileExtensions() { return [".json"]; }
detectContentType(content: string, projectId: string) { /* detection logic */ }
}
// Register custom handlers
const factory = new ProcessorFactory();
factory.registerCompressionHandler(new ZipCompressionHandler());
factory.registerContentTypeHandler(new JsonContentTypeHandler());⚙️ Configuration & Overrides
import { DataLoaderConfig, HandlerPriority } from '@fustilio/data-loader';
const config: DataLoaderConfig = {
// Enable/disable specific handlers
enabledHandlers: {
compression: ["gzip", "zip"],
contentType: ["content-type", "json"]
},
// Set handler priorities
handlerPriorities: {
"zip": HandlerPriority.HIGH,
"json": HandlerPriority.CRITICAL
},
// Processing options
processing: {
maxFileSize: 100 * 1024 * 1024,
processingTimeout: 60000,
enableLogging: true
}
};
const factory = createDefaultProcessorFactory(config);🔄 Handler Management
// Enable/disable handlers dynamically
factory.setHandlerEnabled("gzip", false);
factory.setHandlerEnabled("zip", true);
// Unregister handlers
factory.unregisterHandler("xz");
// Get statistics
const stats = factory.getStats();
console.log("Available handlers:", stats);
// Get available handlers
const processor = factory.createProcessor();
const available = processor.getAvailableHandlers();📚 Advanced Examples
See the extensibility documentation for comprehensive examples including:
- Custom handler implementations
- Handler override patterns
- Dynamic handler loading
- Configuration management
- Integration examples
🔗 Chainable Processing
The data-loader supports chainable processing patterns for complex data workflows:
import { createChainManager, ChainAPI } from '@fustilio/data-loader';
// Quick chain creation
const decompressionChain = ChainAPI.decompress(["gzip", "xz"]);
const validationChain = ChainAPI.validate(["xml-validator", "schema-validator"]);
// Custom chain building
const manager = createChainManager();
const customChain = manager.createCustomChain()
.addStep({
id: "detect-format",
operation: "detect",
handler: "content-type",
next: ["decompress", "validate"]
})
.addStep({
id: "decompress",
operation: "decompress",
handler: "auto",
conditions: [{
type: "if",
expression: "context.metadata.contentType === 'compressed'"
}],
retry: { attempts: 3, delay: 1000 }
})
.addStep({
id: "validate",
operation: "validate",
handler: "validator",
next: []
})
.setEntryPoint("detect-format")
.setExitPoints(["validate"])
.build();
// Execute chain
const result = await manager.executePattern(customChain.name, data);🎯 Use Cases
The extensibility and chainability systems enable:
- Domain-specific formats: Add support for specialized data formats
- Custom validation: Implement project-specific validation logic
- Performance optimization: Override handlers with optimized implementations
- Integration: Seamlessly integrate with existing data processing pipelines
- Testing: Mock handlers for comprehensive testing scenarios
- Complex workflows: Build sophisticated processing chains with conditions and retries
- Pattern reuse: Create reusable processing patterns and templates
Made with ❤️ by fustilio
