npm package discovery and stats viewer.

Discover Tips

  • General search

    [free text search, go nuts!]

  • Package details

    pkg:[package-name]

  • User packages

    @[username]

Sponsor

Optimize Toolset

I’ve always been into building performant and accessible sites, but lately I’ve been taking it extremely seriously. So much so that I’ve been building a tool to help me optimize and monitor the sites that I build to make sure that I’m making an attempt to offer the best experience to those who visit them. If you’re into performant, accessible and SEO friendly sites, you might like it too! You can check it out at Optimize Toolset.

About

Hi, 👋, I’m Ryan Hefner  and I built this site for me, and you! The goal of this site was to provide an easy way for me to check the stats on my npm packages, both for prioritizing issues and updates, and to give me a little kick in the pants to keep up on stuff.

As I was building it, I realized that I was actually using the tool to build the tool, and figured I might as well put this out there and hopefully others will find it to be a fast and useful way to search and browse npm packages as I have.

If you’re interested in other things I’m working on, follow me on Twitter or check out the open source projects I’ve been publishing on GitHub.

I am also working on a Twitter bot for this site to tweet the most popular, newest, random packages from npm. Please follow that account now and it will start sending out packages soon–ish.

Open Software & Tools

This site wouldn’t be possible without the immense generosity and tireless efforts from the people who make contributions to the world and share their work via open source initiatives. Thank you 🙏

© 2026 – Pkg Stats / Ryan Hefner

kothaset

v1.1.0

Published

High-quality dataset generation CLI for LLM training

Readme

KothaSet

Go Version npm version PyPI version License

KothaSet is a powerful CLI tool for generating high-quality datasets using Large Language Models (LLMs) as teacher models. Create diverse training data for fine-tuning smaller models.

Features

  • Multi-Provider — OpenAI, and OpenAI-compatible APIs (DeepSeek, vLLM, Ollama)
  • Flexible Schemas — Instruction (Alpaca), Chat (ShareGPT), Preference (DPO), Classification
  • Streaming Output — Real-time generation with progress tracking
  • Resumable — Atomic checkpointing, never lose progress
  • Multiple Formats — JSONL, Native Parquet, HuggingFace datasets
  • Reproducible — Required seed for deterministic LLM generation
  • Diversity Control — Input files for sequential topic coverage
  • Validation — Validate configs, schemas, datasets, and provider connectivity

Installation

pip (Python)

pip install kothaset

npm (Node.js)

npm install -g kothaset

Homebrew (macOS/Linux)

brew install shantoislamdev/tap/kothaset

Binary Download

Download from GitHub Releases.

From Source

go install github.com/shantoislamdev/kothaset/cmd/kothaset@latest

Quick Start

  1. Initialize configuration:

    kothaset init
  2. Set your API key:

    # Windows PowerShell
    $env:OPENAI_API_KEY = "sk-..."
       
    # Linux/macOS
    export OPENAI_API_KEY="sk-..."
  3. Generate a dataset:

    kothaset generate -n 100 -s instruction --seed 42 -i topics.txt -o dataset.jsonl

Configuration

KothaSet uses a two-file configuration system for better security and organization:

1. kothaset.yaml (Public)

Contains shared settings, context, and instructions. Safe to commit to git.

version: "1.0"
global:
  provider: openai
  schema: instruction
  model: gpt-5.2
  concurrency: 4
  output_dir: ./output

# Context: Background info or persona injected into every prompt
context: |
  Generate high-quality training data for an AI assistant.
  The data should be helpful, accurate, and well-formatted.

# Instructions: Specific rules and guidelines for generation
instructions:
  - Be creative and diverse in topics and approaches
  - Vary the style and complexity of responses
  - Use clear and concise language

2. .secrets.yaml (Private)

Contains sensitive provider credentials. Add this to your .gitignore!

providers:
  - name: openai
    type: openai
    api_key: env.OPENAI_API_KEY  # Reads from environment variable
    # api_key: sk-...            # Or hardcode key directly
    timeout: 1m
    rate_limit:
      requests_per_minute: 60

  # Custom endpoint example (DeepSeek, vLLM)
  - name: local
    type: openai
    base_url: http://localhost:8000/v1
    api_key: not-needed

Usage

Selecting a Schema

| Schema | Description | Use Case | |--------|-------------|----------| | instruction | Alpaca-style {instruction, input, output} | SFT | | chat | ShareGPT multi-turn conversations | Chat fine-tuning | | preference | {prompt, chosen, rejected} pairs | DPO/RLHF | | classification | {text, label} pairs | Classifiers |

# Instruction dataset
kothaset generate -n 1000 -s instruction --seed 42 -i topics.txt -o instructions.jsonl

# Chat conversations
kothaset generate -n 500 -s chat --seed 123 -i conversations.txt -o conversations.jsonl

# Preference pairs for DPO  
kothaset generate -n 500 -s preference --seed 456 -i pairs.txt -o dpo_data.jsonl

Output Formats

# JSONL (default)
kothaset generate -n 100 --seed 42 -i topics.txt -f jsonl -o dataset.jsonl

# Parquet
kothaset generate -n 100 --seed 42 -i topics.txt -f parquet -o dataset.parquet

# HuggingFace datasets format
kothaset generate -n 100 --seed 42 -i topics.txt -f hf -o ./my_dataset

Advanced Options

# Use custom provider
kothaset generate -n 100 --seed 42 -i topics.txt -p local -o dataset.jsonl

# Control diversity with input file
kothaset generate -n 1000 --seed 42 -i topics.txt -o diverse.jsonl

# Resume interrupted generation
kothaset generate --resume dataset.jsonl.checkpoint

# Dry run (validate config)
kothaset generate --dry-run -n 100 --seed 42 -i topics.txt

Documentation

Getting Started

Reference

Help


Contributing

Contributions welcome! See CONTRIBUTING.md.

License

Apache 2.0 License. See LICENSE.