@dataset.sh/file
v0.1.1
Published
TypeScript library for reading and writing DatasetFile ZIP-based archive format
Maintainers
Readme
@dataset.sh/file
A TypeScript library for reading and writing DatasetFile ZIP-based archive format. This library provides an efficient way to store and access structured datasets with support for collections, type annotations, and binary files.
Features
- 📦 ZIP-based format - Efficient compression and packaging of datasets
- 📊 Multiple collections - Organize data into named collections (train, test, validation, etc.)
- 🏷️ Type annotations - Include type information for each collection
- 🔤 Typelang support - Define schemas using TypeScript-like syntax for cross-platform compatibility
- 📎 Binary files - Attach model weights, images, or other binary assets
Installation
pnpm add @dataset.sh/fileOptional: Typelang Compiler
For enhanced type validation and cross-platform type generation, you can also install the Typelang compiler:
pnpm add @dataset.sh/typelangQuick Start
Writing a Dataset
import {DatasetFile, DatasetFileWriter} from '@dataset.sh/file';
// Create a new dataset file
const writer = DatasetFile.open('my-dataset.dataset', 'w') as DatasetFileWriter;
// Add metadata
writer.updateMeta({
author: 'Your Name',
authorEmail: '[email protected]',
description: 'My awesome dataset',
tags: ['nlp', 'classification'],
dataset_metadata: {
version: '1.0.0',
created: new Date().toISOString()
}
});
// Add a collection with data and Typelang schema
const trainData = [
{id: 1, text: 'Hello world', label: 'greeting'},
{id: 2, text: 'How are you?', label: 'question'}
];
// Define schema using Typelang syntax
const typeSchema = `// use TrainItem
type TrainItem = {
id: int
text: string
label: string
}`;
await writer.addCollection('train', trainData, typeSchema);
// Add binary files (optional)
const modelWeights = Buffer.from('...');
writer.addBinaryFile('model.bin', modelWeights);
await writer.close();Reading a Dataset
import {DatasetFile, DatasetFileReader} from '@dataset.sh/file';
// Open an existing dataset
const reader = DatasetFile.open('my-dataset.dataset', 'r') as DatasetFileReader;
// Access metadata
console.log('Author:', reader.meta.author);
console.log('Collections:', reader.collections());
// Read a collection
const trainCollection = reader.collection('train');
// Get type annotation (raw Typelang schema)
const typeAnnotation = await trainCollection.typeAnnotation();
console.log('Type annotation:', typeAnnotation);
// Generate code from type annotation
const codeUsage = await trainCollection.generateCode();
if (codeUsage) {
console.log('Type name:', codeUsage.useClass);
console.log('Compilation result:', codeUsage.result);
}
// Access data
console.log('First 5 items:', trainCollection.top(5));
console.log('Random sample:', trainCollection.randomSample(3));
// Iterate through data
for (const item of trainCollection) {
console.log(item);
}
// Convert to array
const allData = trainCollection.toList();
// Access binary files
const modelData = reader.openBinaryFile('model.bin');
reader.close();API Reference
DatasetFile
Main entry point for opening dataset files.
DatasetFile.open(filePath: string, mode: 'r' | 'w')
Opens a dataset file for reading or writing.
- filePath: Path to the dataset file
- mode:
'r'for reading,'w'for writing - Returns:
DatasetFileReaderorDatasetFileWriter
DatasetFileWriter
Used for creating new dataset files.
Methods
updateMeta(meta: Partial<DatasetFileMeta>): Update dataset metadataasync addCollection(name: string, data: any[], type_annotation?: string): Add a data collection with optional Typelang schemaaddBinaryFile(fileName: string, data: Buffer): Add a binary fileasync close(): Close and save the dataset file
DatasetFileReader
Used for reading existing dataset files.
Properties
meta: Dataset metadata
Methods
collections(): Get list of collection namescollection(name: string): Get a collection readercoll(name: string): Shorthand forcollection()binaryFiles(): List binary file namesopenBinaryFile(fileName: string): Read a binary fileclose(): Close the dataset file
CollectionReader
Reader for individual collections within a dataset.
Properties
length: Number of items in the collection
Methods
async typeAnnotation(): Get raw Typelang schema stringasync generateCode(): Generate code usage information from type annotation (returnsCodeUsagewith source, useClass, and compile result)top(n: number): Get first n itemsrandomSample(n: number): Get random sampletoList(): Convert to array[Symbol.iterator](): Iterate through items
File Format
DatasetFile uses a ZIP archive with the following structure:
dataset.dataset/
├── meta.json # Dataset metadata
├── coll/ # Collections folder
│ ├── train/
│ │ ├── data.jsonl # Data in JSON Lines format
│ │ └── type.tl # Typelang schema (optional)
│ └── test/
│ ├── data.jsonl
│ └── type.tl
└── bin/ # Binary files folder
└── model.binTypelang Support
This library supports Typelang, a TypeScript-flavored schema definition language for cross-platform type generation.
Using Typelang Schemas
import {DatasetFile, DatasetFileWriter} from '@dataset.sh/file';
const writer = DatasetFile.open('typed-dataset.dataset', 'w') as DatasetFileWriter;
// Define complex types with Typelang
const userSchema = `// use User
type Address = {
street: string
city: string
country: string
postalCode?: string
}
type User = {
id: string
name: string
email: string
age: int
address: Address
tags: string[]
status: "active" | "inactive" | "pending"
}`;
const userData = [{
id: 'u1',
name: 'Alice',
email: '[email protected]',
age: 30,
address: {
street: '123 Main St',
city: 'San Francisco',
country: 'USA'
},
tags: ['developer', 'team-lead'],
status: 'active'
}];
await writer.addCollection('users', userData, userSchema);
await writer.close();Generic Types
const responseSchema = `// use ApiResponse
type Response<T> = {
success: bool
data?: T
error?: string
timestamp: string
}
type UserData = {
userId: string
username: string
}
type ApiResponse = Response<UserData>`;
await writer.addCollection('responses', responseData, responseSchema);Examples
Working with NLP Datasets
const writer = DatasetFile.open('nlp-dataset.dataset', 'w') as DatasetFileWriter;
writer.updateMeta({
description: 'Sentiment analysis dataset',
tags: ['nlp', 'sentiment', 'classification']
});
const data = [
{text: 'This movie is great!', sentiment: 'positive'},
{text: 'Terrible experience.', sentiment: 'negative'}
];
const sentimentSchema = `// use SentimentItem
type SentimentItem = {
text: string
sentiment: "positive" | "negative" | "neutral"
}`;
await writer.addCollection('train', data, sentimentSchema);
await writer.close();Reading Python-created Datasets
This library is fully compatible with datasets created using the Python dataset-sh library, including those with Typelang type annotations:
const reader = DatasetFile.open('python-dataset.dataset', 'r') as DatasetFileReader;
// Read collections created in Python
const collection = reader.collection('data');
// Check for type annotation and generate code
const typeAnnotation = await collection.typeAnnotation();
if (typeAnnotation) {
console.log('Type annotation:', typeAnnotation);
const codeUsage = await collection.generateCode();
if (codeUsage) {
console.log('Type name:', codeUsage.useClass);
console.log('Validation errors:', codeUsage.result.errors);
}
}
// Iterate through data
for (const item of collection) {
console.log(item);
}
reader.close();Development
Building
pnpm buildTesting
pnpm test
pnpm test:watch
pnpm test:coverageRunning Examples
pnpm example
pnpm verify-pythonRequirements
- Node.js >= 16.0.0
- TypeScript >= 5.0.0
License
MIT
Contributing
Contributions are welcome! Please feel free to submit a Pull Request.
Support
For issues and feature requests, please use the GitHub issue tracker.
