create-llm
v2.2.3
Published
The fastest way to start training your own Language Model. Create production-ready LLM training projects in seconds.
Maintainers
Readme
create-llm
CLI tool for scaffolding LLM training projects
Create production-ready LLM training projects in seconds. Similar to create-next-app but for training custom language models.
npm Package • Documentation • Report Bug • Request Feature
npx @theanikrtgiri/create-llm my-awesome-llm
cd my-awesome-llm
pip install -r requirements.txt
python training/train.pyWhy create-llm?
Training a language model from scratch requires:
- Model architecture (GPT, BERT, T5...)
- Data preprocessing pipeline
- Tokenizer training
- Training loop with callbacks
- Checkpoint management
- Evaluation metrics
- Text generation
- Deployment tools
create-llm provides all of this in one command.
Features
Right-Sized Templates
Choose from 4 templates optimized for different use cases:
- NANO (1M params) - Learn in 2 minutes on any laptop
- TINY (6M params) - Prototype in 15 minutes on CPU
- SMALL (100M params) - Production models in hours
- BASE (1B params) - Research-grade in days
Complete Toolkit
Everything you need out of the box:
- PyTorch training infrastructure
- Data preprocessing pipeline
- Tokenizer training (BPE, WordPiece, Unigram)
- Checkpoint management with auto-save
- TensorBoard integration
- Live training dashboard
- Interactive chat interface
- Model comparison tools
- Deployment scripts
Smart Defaults
Intelligent configuration that:
- Auto-detects vocab size from tokenizer
- Automatically handles sequence length mismatches
- Warns about model/data size mismatches
- Detects overfitting during training
- Suggests optimal hyperparameters
- Handles cross-platform paths
- Provides detailed diagnostic messages for errors
Plugin System
Optional integrations:
- WandB - Experiment tracking
- HuggingFace - Model sharing
Quick Start
One-Command Setup
# Using npx (recommended - no installation needed)
npx @theanikrtgiri/create-llm my-llm
# Or install globally
npm install -g @theanikrtgiri/create-llm
create-llm my-llmInteractive Setup
npx @theanikrtgiri/create-llmYou'll be prompted for:
- Project name
- Template (NANO, TINY, SMALL, BASE)
- Tokenizer type (BPE, WordPiece, Unigram)
- Optional plugins (WandB, HuggingFace)
Quick Mode
# Specify everything upfront
npx create-llm my-llm --template tiny --tokenizer bpe --skip-installTemplates
NANO
For learning and quick experiments
Parameters: ~1M
Hardware: Any CPU (2GB RAM)
Time: 1-2 minutes
Data: 100+ examples
Use: Learning, testing, demosWhen to use:
- First time training an LLM
- Quick experiments and testing
- Educational purposes
- Understanding the pipeline
- Limited data (100-1000 examples)
TINY
For prototyping and small projects
Parameters: ~6M
Hardware: CPU or basic GPU (4GB RAM)
Time: 5-15 minutes
Data: 1,000+ examples
Use: Prototypes, small projectsWhen to use:
- Small-scale projects
- Limited data (1K-10K examples)
- Prototyping before scaling
- Personal experiments
- CPU-only environments
SMALL
For production applications
Parameters: ~100M
Hardware: RTX 3060+ (12GB VRAM)
Time: 1-3 hours
Data: 10,000+ examples
Use: Production, real appsWhen to use:
- Production applications
- Domain-specific models
- Real-world deployments
- Good data availability
- GPU available
BASE
For research and high-quality models
Parameters: ~1B
Hardware: A100 or multi-GPU
Time: 1-3 days
Data: 100,000+ examples
Use: Research, high-qualityWhen to use:
- Research projects
- High-quality requirements
- Large datasets available
- Multi-GPU setup
- Competitive performance needed
Complete Workflow
1. Create Your Project
npx @theanikrtgiri/create-llm my-llm --template tiny --tokenizer bpe
cd my-llm2. Install Dependencies
pip install -r requirements.txt3. Add Your Data
Place your text files in data/raw/:
# Example: Download Shakespeare
curl https://www.gutenberg.org/files/100/100-0.txt > data/raw/shakespeare.txt
# Or add your own files
cp /path/to/your/data.txt data/raw/Tip: Start with at least 1MB of text for meaningful results
4. Train Tokenizer
python tokenizer/train.py --data data/raw/This creates a vocabulary from your data.
5. Prepare Dataset
python data/prepare.pyThis tokenizes and prepares your data for training.
6. Start Training
# Basic training
python training/train.py
# With live dashboard
python training/train.py --dashboard
# Then open http://localhost:5000
# Resume from checkpoint
python training/train.py --resume checkpoints/checkpoint-1000.pt7. Evaluate Your Model
python evaluation/evaluate.py --checkpoint checkpoints/checkpoint-best.pt8. Generate Text
python evaluation/generate.py \
--checkpoint checkpoints/checkpoint-best.pt \
--prompt "Once upon a time" \
--temperature 0.89. Interactive Chat
python chat.py --checkpoint checkpoints/checkpoint-best.pt10. Deploy
# To Hugging Face
python deploy.py --to huggingface --repo-id username/my-model
# To Replicate
python deploy.py --to replicate --model-name my-modelProject Structure
my-llm/
├── data/
│ ├── raw/ # Your training data goes here
│ ├── processed/ # Tokenized data (auto-generated)
│ ├── dataset.py # PyTorch dataset classes
│ └── prepare.py # Data preprocessing script
│
├── models/
│ ├── architectures/ # Model implementations
│ │ ├── gpt.py # GPT architecture
│ │ ├── nano.py # 1M parameter model
│ │ ├── tiny.py # 6M parameter model
│ │ ├── small.py # 100M parameter model
│ │ └── base.py # 1B parameter model
│ ├── __init__.py
│ └── config.py # Configuration loader
│
├── tokenizer/
│ ├── train.py # Tokenizer training script
│ └── tokenizer.json # Trained tokenizer (auto-generated)
│
├── training/
│ ├── train.py # Main training script
│ ├── trainer.py # Trainer class
│ ├── callbacks/ # Training callbacks
│ └── dashboard/ # Live training dashboard
│
├── evaluation/
│ ├── evaluate.py # Model evaluation
│ └── generate.py # Text generation
│
├── plugins/ # Optional integrations
├── checkpoints/ # Saved models (auto-generated)
├── logs/ # Training logs (auto-generated)
│
├── llm.config.js # Main configuration file
├── requirements.txt # Python dependencies
├── chat.py # Interactive chat interface
├── deploy.py # Deployment script
└── README.md # Project documentationConfiguration
Everything is controlled via llm.config.js:
module.exports = {
// Model architecture
model: {
type: 'gpt',
size: 'tiny',
vocab_size: 10000, // Auto-detected from tokenizer
max_length: 512,
layers: 4,
heads: 4,
dim: 256,
dropout: 0.2,
},
// Training settings
training: {
batch_size: 16,
learning_rate: 0.0006,
warmup_steps: 500,
max_steps: 10000,
eval_interval: 500,
save_interval: 2000,
},
// Plugins
plugins: [
// 'wandb',
// 'huggingface',
],
};CLI Reference
Commands
npx @theanikrtgiri/create-llm [project-name] [options]Options
| Option | Description | Default |
|--------|-------------|---------|
| --template <name> | Template to use (nano, tiny, small, base, custom) | Interactive |
| --tokenizer <type> | Tokenizer type (bpe, wordpiece, unigram) | Interactive |
| --skip-install | Skip npm/pip installation | false |
| -y, --yes | Skip all prompts, use defaults | false |
| -h, --help | Show help | - |
| -v, --version | Show version | - |
Examples
# Interactive mode (recommended for first time)
npx @theanikrtgiri/create-llm
# Quick start with defaults
npx @theanikrtgiri/create-llm my-project
# Specify everything
npx @theanikrtgiri/create-llm my-project --template nano --tokenizer bpe --skip-install
# Skip prompts
npx @theanikrtgiri/create-llm my-project -yBest Practices
Data Preparation
Minimum Data Requirements:
- NANO: 100+ examples (good for learning)
- TINY: 1,000+ examples (minimum for decent results)
- SMALL: 10,000+ examples (recommended)
- BASE: 100,000+ examples (for quality)
Data Quality:
- Use clean, well-formatted text
- Remove HTML, markdown, or special formatting
- Ensure consistent encoding (UTF-8)
- Remove duplicates
- Balance different content types
Training Tips
Avoid Overfitting:
- Watch for perplexity < 1.5 (warning sign)
- Use validation split (10% recommended)
- Increase dropout if overfitting
- Add more data if possible
- Use smaller model for small datasets
Optimize Training:
- Start with NANO to test pipeline
- Use mixed precision on GPU (
mixed_precision: true) - Increase
gradient_accumulation_stepsif OOM - Monitor GPU usage with dashboard
- Save checkpoints frequently
Troubleshooting
Common Issues
"Vocab size mismatch detected"
- This is normal. The tool auto-detects and fixes it.
- The model will use the actual tokenizer vocab size.
"Position embedding index error" or sequences too long
- Automatically handled. Sequences exceeding max_length are truncated.
- The model logs warnings when truncation occurs.
- Check your data preprocessing if you see frequent truncation warnings.
- Consider increasing
max_lengthin config if you need longer sequences.
"Model may be too large for dataset"
- Warning: Risk of overfitting
- Solutions: Add more data, use smaller template, increase dropout
"CUDA out of memory"
- Reduce
batch_sizein llm.config.js - Enable
mixed_precision: true - Increase
gradient_accumulation_steps - Use smaller model template
"Training loss not decreasing"
- Check learning rate (try 1e-4 to 1e-3)
- Verify data is loading correctly
- Check for data preprocessing issues
- Try longer warmup period
Getting Help
Requirements
For CLI Tool
- Node.js 18.0.0 or higher
- npm 8.0.0 or higher
For Training
- Python 3.8 or higher
- PyTorch 2.0.0 or higher
- 4GB RAM minimum (NANO/TINY)
- 12GB VRAM recommended (SMALL)
- 40GB+ VRAM for BASE
Operating Systems
- Windows 10/11
- macOS 10.15+
- Linux (Ubuntu 20.04+)
Development
See DEVELOPMENT.md for development setup and guidelines.
Contributing
We welcome contributions! See CONTRIBUTING.md for guidelines.
Areas We Need Help
| Area | Description | Difficulty | |------|-------------|------------| | Bug Fixes | Fix issues and improve stability | Easy | | Documentation | Improve guides and examples | Easy | | New Templates | Add BERT, T5, custom architectures | Medium | | Plugins | Integrate new services | Medium | | Testing | Increase test coverage | Medium | | i18n | Internationalization support | Hard |
License
MIT © Aniket Giri
See LICENSE for more information.
Acknowledgments
Built with:
- PyTorch - Deep learning framework
- Transformers - Model implementations
- Tokenizers - Fast tokenization
- Commander.js - CLI framework
- Inquirer.js - Interactive prompts
If you find this project useful, please consider giving it a star!
