unimodaly-ingest
v1.0.0
Published
A unified data-ingestion CLI that auto-detects and converts text, image, audio and tabular sources into standardized training datasets
Maintainers
Readme
Unimodaly Ingest
A unified data-ingestion CLI that auto-detects and converts text, image, audio and tabular sources into standardized training datasets with schema validation, sampling, and augmentation capabilities.
Features
- Multi-modal Data Detection: Automatically detects and processes text, image, audio, and tabular data formats
- Schema Validation: Validates output datasets against custom or default schemas
- Data Augmentation: Built-in augmentation techniques for each data type
- Flexible Sampling: Control dataset size with sampling ratios
- Multiple Output Formats: Export to JSON, JSONL, or CSV formats
- Batch Processing: Efficient processing of large datasets
- Configuration Management: Customizable processing pipelines
- Comprehensive Metadata: Rich metadata and feature extraction for each data type
Installation
npm install -g unimodaly-ingestQuick Start
# Process all data in a directory
unimodaly-ingest ingest ./data --output ./processed
# Process specific data types with augmentation
unimodaly-ingest ingest ./images --type image --augment --output ./processed
# Sample 50% of data and export to CSV
unimodaly-ingest ingest ./data --sample 0.5 --format csv
# Initialize configuration
unimodaly-ingest config --initSupported Data Types
Text Files
.txt,.md,.json,.xml,.html- Encoding detection and validation
- Language detection
- Text augmentation (synonym replacement, random operations)
Image Files
.jpg,.jpeg,.png,.gif,.webp,.svg,.bmp,.tiff- Metadata extraction (dimensions, color space, etc.)
- Feature extraction (intensity statistics, aspect ratio)
- Image augmentation (rotation, brightness, contrast, flipping)
Audio Files
.mp3,.wav,.flac,.ogg,.m4a,.aac- Audio metadata extraction
- Duration, sample rate, channel analysis
- Audio augmentation capabilities
Tabular Data
.csv,.tsv,.xlsx,.json- Schema inference
- Statistical analysis
- Data type detection
- Duplicate and null value analysis
Commands
ingest
Main command for processing data sources.
unimodaly-ingest ingest <input> [options]Options:
-o, --output <path>- Output directory (default: ./output)-f, --format <format>- Output format: json, jsonl, csv (default: json)-s, --sample <ratio>- Sampling ratio 0-1 (default: 1.0)-a, --augment- Enable data augmentation--schema <path>- Custom schema validation file--config <path>- Configuration file path-v, --verbose- Verbose output-t, --type <types...>- Specific data types: text, image, audio, tabular--batch-size <size>- Batch processing size (default: 100)
config
Manage configuration settings.
unimodaly-ingest config [options]Options:
--init- Initialize default configuration--show- Show current configuration--set <key=value>- Set configuration value
validate
Validate dataset against schema.
unimodaly-ingest validate <dataset> [options]Options:
--schema <path>- Schema file path
Configuration
Initialize a configuration file to customize processing behavior:
unimodaly-ingest config --initThis creates unimodaly.config.json with settings for:
- Data type specific processing options
- Augmentation parameters
- Output formats and compression
- Performance settings
- Schema validation rules
Example configuration:
{
"text": {
"encoding": "utf8",
"maxSize": "10MB",
"augmentation": {
"enabled": false,
"synonymReplacement": 0.1,
"randomInsertion": 0.1
}
},
"image": {
"maxSize": "50MB",
"augmentation": {
"enabled": false,
"rotation": 15,
"brightness": 0.2,
"flip": true
}
}
}Output Format
The CLI generates standardized datasets with rich metadata:
[
{
"type": "text",
"source": "/path/to/file.txt",
"timestamp": "2025-01-27T10:30:00.000Z",
"content": "processed content...",
"metadata": {
"originalLength": 1500,
"fileSize": 1024,
"lines": 25,
"words": 200
},
"features": {
"wordCount": 200,
"sentenceCount": 12,
"language": "en"
}
}
]Schema Validation
Define custom schemas for validation:
{
"type": "array",
"items": {
"type": "object",
"required": ["type", "source", "content"],
"properties": {
"type": {
"type": "string",
"enum": ["text", "image", "audio", "tabular"]
},
"source": {
"type": "string"
},
"content": {
"type": ["string", "object"]
}
}
}
}Examples
Process Mixed Media Directory
unimodaly-ingest ingest ./media_folder \
--output ./datasets \
--format json \
--augment \
--sample 0.8 \
--verboseText-Only Processing with Custom Schema
unimodaly-ingest ingest ./documents \
--type text \
--schema ./text_schema.json \
--output ./text_dataset \
--format jsonlImage Dataset with Augmentation
unimodaly-ingest ingest ./images \
--type image \
--augment \
--batch-size 50 \
--output ./image_datasetLicense
MIT
