npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2025 – Pkg Stats / Ryan Hefner

create-llm

v2.2.3

Published

The fastest way to start training your own Language Model. Create production-ready LLM training projects in seconds.

Readme

create-llm

CLI tool for scaffolding LLM training projects

Create production-ready LLM training projects in seconds. Similar to create-next-app but for training custom language models.

npm version npm downloads GitHub stars License: MIT

npm PackageDocumentationReport BugRequest Feature

npx @theanikrtgiri/create-llm my-awesome-llm
cd my-awesome-llm
pip install -r requirements.txt
python training/train.py

Why create-llm?

Training a language model from scratch requires:

  • Model architecture (GPT, BERT, T5...)
  • Data preprocessing pipeline
  • Tokenizer training
  • Training loop with callbacks
  • Checkpoint management
  • Evaluation metrics
  • Text generation
  • Deployment tools

create-llm provides all of this in one command.


Features

Right-Sized Templates

Choose from 4 templates optimized for different use cases:

  • NANO (1M params) - Learn in 2 minutes on any laptop
  • TINY (6M params) - Prototype in 15 minutes on CPU
  • SMALL (100M params) - Production models in hours
  • BASE (1B params) - Research-grade in days

Complete Toolkit

Everything you need out of the box:

  • PyTorch training infrastructure
  • Data preprocessing pipeline
  • Tokenizer training (BPE, WordPiece, Unigram)
  • Checkpoint management with auto-save
  • TensorBoard integration
  • Live training dashboard
  • Interactive chat interface
  • Model comparison tools
  • Deployment scripts

Smart Defaults

Intelligent configuration that:

  • Auto-detects vocab size from tokenizer
  • Automatically handles sequence length mismatches
  • Warns about model/data size mismatches
  • Detects overfitting during training
  • Suggests optimal hyperparameters
  • Handles cross-platform paths
  • Provides detailed diagnostic messages for errors

Plugin System

Optional integrations:

  • WandB - Experiment tracking
  • HuggingFace - Model sharing

Quick Start

One-Command Setup

# Using npx (recommended - no installation needed)
npx @theanikrtgiri/create-llm my-llm

# Or install globally
npm install -g @theanikrtgiri/create-llm
create-llm my-llm

Interactive Setup

npx @theanikrtgiri/create-llm

You'll be prompted for:

  • Project name
  • Template (NANO, TINY, SMALL, BASE)
  • Tokenizer type (BPE, WordPiece, Unigram)
  • Optional plugins (WandB, HuggingFace)

Quick Mode

# Specify everything upfront
npx create-llm my-llm --template tiny --tokenizer bpe --skip-install

Templates

NANO

For learning and quick experiments

Parameters: ~1M
Hardware:   Any CPU (2GB RAM)
Time:       1-2 minutes
Data:       100+ examples
Use:        Learning, testing, demos

When to use:

  • First time training an LLM
  • Quick experiments and testing
  • Educational purposes
  • Understanding the pipeline
  • Limited data (100-1000 examples)

TINY

For prototyping and small projects

Parameters: ~6M
Hardware:   CPU or basic GPU (4GB RAM)
Time:       5-15 minutes
Data:       1,000+ examples
Use:        Prototypes, small projects

When to use:

  • Small-scale projects
  • Limited data (1K-10K examples)
  • Prototyping before scaling
  • Personal experiments
  • CPU-only environments

SMALL

For production applications

Parameters: ~100M
Hardware:   RTX 3060+ (12GB VRAM)
Time:       1-3 hours
Data:       10,000+ examples
Use:        Production, real apps

When to use:

  • Production applications
  • Domain-specific models
  • Real-world deployments
  • Good data availability
  • GPU available

BASE

For research and high-quality models

Parameters: ~1B
Hardware:   A100 or multi-GPU
Time:       1-3 days
Data:       100,000+ examples
Use:        Research, high-quality

When to use:

  • Research projects
  • High-quality requirements
  • Large datasets available
  • Multi-GPU setup
  • Competitive performance needed

Complete Workflow

1. Create Your Project

npx @theanikrtgiri/create-llm my-llm --template tiny --tokenizer bpe
cd my-llm

2. Install Dependencies

pip install -r requirements.txt

3. Add Your Data

Place your text files in data/raw/:

# Example: Download Shakespeare
curl https://www.gutenberg.org/files/100/100-0.txt > data/raw/shakespeare.txt

# Or add your own files
cp /path/to/your/data.txt data/raw/

Tip: Start with at least 1MB of text for meaningful results

4. Train Tokenizer

python tokenizer/train.py --data data/raw/

This creates a vocabulary from your data.

5. Prepare Dataset

python data/prepare.py

This tokenizes and prepares your data for training.

6. Start Training

# Basic training
python training/train.py

# With live dashboard
python training/train.py --dashboard
# Then open http://localhost:5000

# Resume from checkpoint
python training/train.py --resume checkpoints/checkpoint-1000.pt

7. Evaluate Your Model

python evaluation/evaluate.py --checkpoint checkpoints/checkpoint-best.pt

8. Generate Text

python evaluation/generate.py \
  --checkpoint checkpoints/checkpoint-best.pt \
  --prompt "Once upon a time" \
  --temperature 0.8

9. Interactive Chat

python chat.py --checkpoint checkpoints/checkpoint-best.pt

10. Deploy

# To Hugging Face
python deploy.py --to huggingface --repo-id username/my-model

# To Replicate
python deploy.py --to replicate --model-name my-model

Project Structure

my-llm/
├── data/
│   ├── raw/              # Your training data goes here
│   ├── processed/        # Tokenized data (auto-generated)
│   ├── dataset.py        # PyTorch dataset classes
│   └── prepare.py        # Data preprocessing script
│
├── models/
│   ├── architectures/    # Model implementations
│   │   ├── gpt.py       # GPT architecture
│   │   ├── nano.py      # 1M parameter model
│   │   ├── tiny.py      # 6M parameter model
│   │   ├── small.py     # 100M parameter model
│   │   └── base.py      # 1B parameter model
│   ├── __init__.py
│   └── config.py        # Configuration loader
│
├── tokenizer/
│   ├── train.py         # Tokenizer training script
│   └── tokenizer.json   # Trained tokenizer (auto-generated)
│
├── training/
│   ├── train.py         # Main training script
│   ├── trainer.py       # Trainer class
│   ├── callbacks/       # Training callbacks
│   └── dashboard/       # Live training dashboard
│
├── evaluation/
│   ├── evaluate.py      # Model evaluation
│   └── generate.py      # Text generation
│
├── plugins/             # Optional integrations
├── checkpoints/         # Saved models (auto-generated)
├── logs/               # Training logs (auto-generated)
│
├── llm.config.js       # Main configuration file
├── requirements.txt    # Python dependencies
├── chat.py            # Interactive chat interface
├── deploy.py          # Deployment script
└── README.md          # Project documentation

Configuration

Everything is controlled via llm.config.js:

module.exports = {
  // Model architecture
  model: {
    type: 'gpt',
    size: 'tiny',
    vocab_size: 10000,      // Auto-detected from tokenizer
    max_length: 512,
    layers: 4,
    heads: 4,
    dim: 256,
    dropout: 0.2,
  },

  // Training settings
  training: {
    batch_size: 16,
    learning_rate: 0.0006,
    warmup_steps: 500,
    max_steps: 10000,
    eval_interval: 500,
    save_interval: 2000,
  },

  // Plugins
  plugins: [
    // 'wandb',
    // 'huggingface',
  ],
};

CLI Reference

Commands

npx @theanikrtgiri/create-llm [project-name] [options]

Options

| Option | Description | Default | |--------|-------------|---------| | --template <name> | Template to use (nano, tiny, small, base, custom) | Interactive | | --tokenizer <type> | Tokenizer type (bpe, wordpiece, unigram) | Interactive | | --skip-install | Skip npm/pip installation | false | | -y, --yes | Skip all prompts, use defaults | false | | -h, --help | Show help | - | | -v, --version | Show version | - |

Examples

# Interactive mode (recommended for first time)
npx @theanikrtgiri/create-llm

# Quick start with defaults
npx @theanikrtgiri/create-llm my-project

# Specify everything
npx @theanikrtgiri/create-llm my-project --template nano --tokenizer bpe --skip-install

# Skip prompts
npx @theanikrtgiri/create-llm my-project -y

Best Practices

Data Preparation

Minimum Data Requirements:

  • NANO: 100+ examples (good for learning)
  • TINY: 1,000+ examples (minimum for decent results)
  • SMALL: 10,000+ examples (recommended)
  • BASE: 100,000+ examples (for quality)

Data Quality:

  • Use clean, well-formatted text
  • Remove HTML, markdown, or special formatting
  • Ensure consistent encoding (UTF-8)
  • Remove duplicates
  • Balance different content types

Training Tips

Avoid Overfitting:

  • Watch for perplexity < 1.5 (warning sign)
  • Use validation split (10% recommended)
  • Increase dropout if overfitting
  • Add more data if possible
  • Use smaller model for small datasets

Optimize Training:

  • Start with NANO to test pipeline
  • Use mixed precision on GPU (mixed_precision: true)
  • Increase gradient_accumulation_steps if OOM
  • Monitor GPU usage with dashboard
  • Save checkpoints frequently

Troubleshooting

Common Issues

"Vocab size mismatch detected"

  • This is normal. The tool auto-detects and fixes it.
  • The model will use the actual tokenizer vocab size.

"Position embedding index error" or sequences too long

  • Automatically handled. Sequences exceeding max_length are truncated.
  • The model logs warnings when truncation occurs.
  • Check your data preprocessing if you see frequent truncation warnings.
  • Consider increasing max_length in config if you need longer sequences.

"Model may be too large for dataset"

  • Warning: Risk of overfitting
  • Solutions: Add more data, use smaller template, increase dropout

"CUDA out of memory"

  • Reduce batch_size in llm.config.js
  • Enable mixed_precision: true
  • Increase gradient_accumulation_steps
  • Use smaller model template

"Training loss not decreasing"

  • Check learning rate (try 1e-4 to 1e-3)
  • Verify data is loading correctly
  • Check for data preprocessing issues
  • Try longer warmup period

Getting Help


Requirements

For CLI Tool

  • Node.js 18.0.0 or higher
  • npm 8.0.0 or higher

For Training

  • Python 3.8 or higher
  • PyTorch 2.0.0 or higher
  • 4GB RAM minimum (NANO/TINY)
  • 12GB VRAM recommended (SMALL)
  • 40GB+ VRAM for BASE

Operating Systems

  • Windows 10/11
  • macOS 10.15+
  • Linux (Ubuntu 20.04+)

Development

See DEVELOPMENT.md for development setup and guidelines.


Contributing

We welcome contributions! See CONTRIBUTING.md for guidelines.

Areas We Need Help

| Area | Description | Difficulty | |------|-------------|------------| | Bug Fixes | Fix issues and improve stability | Easy | | Documentation | Improve guides and examples | Easy | | New Templates | Add BERT, T5, custom architectures | Medium | | Plugins | Integrate new services | Medium | | Testing | Increase test coverage | Medium | | i18n | Internationalization support | Hard |


License

MIT © Aniket Giri

See LICENSE for more information.


Acknowledgments

Built with:


If you find this project useful, please consider giving it a star!

GitHubnpmIssues