kothaset
v1.1.0
Published
High-quality dataset generation CLI for LLM training
Maintainers
Readme
KothaSet
KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.
Features
- Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
- Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
- Streaming Output — Real-time generation with progress tracking
- Resumable — Atomic checkpointing, never lose progress
- Multiple Formats — JSONL, Native Parquet, HuggingFace datasets
- Reproducible — Required seed for deterministic LLM generation
- Diversity Control — Input files for sequential topic coverage
- Validation — Validate configs, schemas, datasets, and provider connectivity
Installation
pip (Python)
pip install kothasetnpm (Node.js)
npm install -g kothasetHomebrew (macOS/Linux)
brew install shantoislamdev/tap/kothasetBinary Download
Download from GitHub Releases.
From Source
go install github.com/shantoislamdev/kothaset/cmd/kothaset@latestQuick Start
Initialize configuration:
kothaset initSet your API key:
# Windows PowerShell $env:OPENAI_API_KEY = "sk-..." # Linux/macOS export OPENAI_API_KEY="sk-..."Generate a dataset:
kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl
Configuration
KothaSet uses a two-file configuration system for better security and organization:
1. kothaset.yaml (Public)
Contains shared settings, context, and instructions. Safe to commit to git.
version: "1.0"
global:
provider: openai
schema: instruction
model: gpt-5.2
concurrency: 4
output_dir: ./output
# Context: Background info or persona injected into every prompt
context: |
Generate high-quality training data for an AI assistant.
The data should be helpful, accurate, and well-formatted.
# Instructions: Specific rules and guidelines for generation
instructions:
- Be creative and diverse in topics and approaches
- Vary the style and complexity of responses
- Use clear and concise language2. .secrets.yaml (Private)
Contains sensitive provider credentials. Add this to your .gitignore!
providers:
- name: openai
type: openai
api_key: env.OPENAI_API_KEY # Reads from environment variable
# api_key: sk-... # Or hardcode key directly
timeout: 1m
rate_limit:
requests_per_minute: 60
# Custom endpoint example (DeepSeek, vLLM)
- name: local
type: openai
base_url: http://localhost:8000/v1
api_key: not-neededUsage
Selecting a Schema
| Schema | Description | Use Case |
|--------|-------------|----------|
| instruction | Alpaca-style {instruction, input, output} | SFT |
| chat | ShareGPT multi-turn conversations | Chat fine-tuning |
| preference | {prompt, chosen, rejected} pairs | DPO/RLHF |
| classification | {text, label} pairs | Classifiers |
# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl
# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl
# Preference pairs for DPO
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonlOutput Formats
# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl
# Parquet
kothaset generate -n 100 --seed 42 -i topics.txt -f parquet -o dataset.parquet
# HuggingFace datasets format
kothaset generate -n 100 --seed 42 -i topics.txt -f hf -o ./my_datasetAdvanced Options
# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl
# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl
# Resume interrupted generation
kothaset generate --resume dataset.jsonl.checkpoint
# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txtDocumentation
Getting Started
Reference
Help
Contributing
Contributions welcome! See CONTRIBUTING.md.
License
Apache 2.0 License. See LICENSE.
