pretrain
v0.1.0
Published
Orchestrate reproducible model pretraining pipelines with dataset, checkpoint, and resume utilities.
Readme
pretrain
Orchestrate reproducible model pretraining pipelines with dataset, checkpoint, and resume utilities.
Features
- Config-driven pipeline definition for datasets, model initialization, and training steps
- Built-in checkpointing and resumable runs to support long pretraining jobs
- Utilities for deterministic data shuffling, split management, and synthetic data generation
- Integrations with common experiment trackers and optional cloud storage for artifacts
Install
npm install pretrain
Quick Start
Create a simple pretraining pipeline and run it programmatically:
const { Pipeline } = require('pretrain');
const pipeline = new Pipeline({
name: 'bert-pretrain',
dataset: {
loader: './loaders/textLines',
path: './data/corpus.txt',
batchSize: 64,
deterministic: true
},
model: {
type: 'transformer',
config: './models/bert-small.json'
},
trainer: {
epochs: 10,
optimizer: 'adamw',
checkpointDir: './checkpoints'
}
});
// Run the pipeline (automatically checkpoints)
pipeline.run()
.then(() => console.log('Pretraining complete'))
.catch(err => console.error(err));
// Resume from latest checkpoint
// pipeline.resume();You can also run pipelines from the CLI with a YAML config:
pretrain run --config pretrain.config.ymlLicense
MIT
