sampleshard
v0.1.0
Published
SampleShard - Training sample storage format
Maintainers
Readme
SampleShard TypeScript/JavaScript
TypeScript/JavaScript implementation of the SampleShard format for storing training samples.
Installation
npm install @sampleshard/coreQuick Start
import { SampleShardWriter, SampleShardReader } from '@sampleshard/core';
// Writing samples
const writer = new SampleShardWriter('train.smpl');
await writer.open();
await writer.addSample(1, { input: [1, 2, 3], label: 0 });
await writer.addSample(2, { input: [4, 5, 6], label: 1 });
await writer.addSample(3, { input: [7, 8, 9], label: 2 });
await writer.close();
// Reading samples
const reader = new SampleShardReader('train.smpl');
await reader.open();
// Get sample count
console.log(`Total samples: ${reader.sampleCount()}`);
// Random access by ID
const sample = await reader.getSample(1);
console.log(sample); // { input: [1, 2, 3], label: 0 }
// Check if sample exists
if (reader.hasSample(2)) {
console.log('Sample 2 exists!');
}
// Iterate all samples
for await (const [sampleId, sample] of reader) {
console.log(`Sample ${sampleId}:`, sample);
}
// Batch access
const batch = await reader.getBatch([1, 2, 3]);
const rangeBatch = await reader.getBatchByRange(0, 10);
await reader.close();Features
- Fast random access by sample ID (O(1) lookup)
- Deterministic iteration order
- Metadata-safe: Reserved entries (starting with
__) excluded from sample counts - Async/await API with async iterators
- CRC32 checksums for data integrity
- BigInt support for sample IDs
File Format
SampleShard uses the .smpl extension and the Shard v2 binary format:
- 64-byte header with magic bytes
SHRD - Role byte = 0x02 (Sample)
- 48-byte index entries with name hashes
- JSON-encoded sample data
- CRC32 checksums per entry
Interoperability
SampleShard files created with TypeScript can be read by:
- Go:
agentscope/cowrie/ucodec.OpenSampleShard() - Python:
sampleshard.SampleShardReader()
API Reference
SampleShardWriter
class SampleShardWriter {
constructor(path: string, options?: { alignment?: number; compression?: number });
async open(): Promise<void>;
async addSample(sampleId: number | bigint, sample: unknown): Promise<void>;
async addSampleRaw(sampleId: number | bigint, data: Buffer): Promise<void>;
async close(): Promise<void>;
}SampleShardReader
class SampleShardReader {
constructor(path: string);
async open(): Promise<void>;
sampleCount(): number;
getSampleIds(): bigint[];
sampleIdByIndex(index: number): bigint;
hasSample(sampleId: number | bigint): boolean;
async getSample(sampleId: number | bigint): Promise<unknown>;
async getSampleByIndex(index: number): Promise<unknown>;
async getBatch(sampleIds: (number | bigint)[]): Promise<unknown[]>;
async getBatchByRange(start: number, end: number): Promise<unknown[]>;
async close(): Promise<void>;
[Symbol.asyncIterator](): AsyncIterableIterator<[bigint, unknown]>;
}License
MIT
