convokit
v1.0.2
Published
A flexible TypeScript framework for ingesting, processing, and exporting chat/conversation data for LLM training and analysis.
Maintainers
Readme
ConvoKit: Flexible Conversation Processing & Export Toolkit
ConvoKit is a TypeScript-based framework for ingesting, normalizing, filtering, sampling, formatting, and exporting chat or conversation data for LLM training and analysis. It provides:
- A provider registry to plug in new data sources (Discord, Slack, custom exports, etc.).
- A plugin registry for formatters, converters, and filters to transform and export data to ChatML, Gemini JSONL, custom context formats, and more.
- A fully configurable, extensible pipeline: ingest → normalize → filter → importance‑score → sample → format → export.
ConvoKit saves you time building data preprocessing pipelines and lets you focus on models and prompts.
Table of Contents
- Key Features
- What It Can & Cannot Do
- Who Should Use It
- Installation
- Quick Start
- Configuration
- CLI Usage
- Provider Registry
- Built‑in Providers
- Writing Your Own Provider
- Plugin Registry
- Formatters
- Converters
- Filters
- Writing Your Own Plugin
- Contributing
- License
Key Features
Dynamic Provider Loading
Automatically discover and load data providers from your project’s providers folder.Normalized Conversation Format
All data converges to aConvoKitConversationinterface: metadata + message arrays.Context Formatting
Generate a single, line-delimited training string (CKContext) with options for time‑gaps, new‑conversation markers, and importance scoring.Turn‑List Conversion
Break context into turn lists (CKTurnListConversation) for sampling or LLM‑specific export.Weighted Sampling
Sample by conversation importance to focus on high‑value exchanges.Export Plugins
Export to ChatML JSONL, Gemini JSONL, or add your own converter for other LLM formats.Filter Plugins
Drop unwanted messages (e.g. links‑only, emoji‑only, code‑only) via a simple plugin API.
What It Can & Cannot Do
Can:
- Ingest JSON exports from Discord (via DiscordChatExporter), or any custom source you add via the Provider Registry.
- Normalize and filter conversations by message content, length, or custom rules.
- Score message & conversation importance automatically based on time, length, and frequency.
- Sample highly‑important conversations for training budgets.
- Export to popular LLM chat formats (ChatML, Gemini), or easily extendable.
Cannot:
- Perform LLM inference or model training directly. - Yet ;)
- Resolve references across conversations (thread linking across channels).
- Guarantee perfect import schema for every data source—you may need to write a provider to handle custom formats.
- Handle binary or non‑JSON data without extending a provider to preprocess it.
Who Should Use It
- NLP / ML Engineers preparing chat‑based LLM fine‑tuning or analysis datasets.
- Bot / Chat Service Developers needing to transform raw chat logs into structured training data.
- Researchers studying conversation dynamics or designing importance‑based sampling strategies.
- Community Contributors eager to add support for new platforms or export formats.
Possibly upcomming features
- Personality Generate a deep and comprehensive personality prompt based off your output ck_context
- Fine-tuning Fine-tune models with exported training data (Currently mainly looking at Gemini) (Contributions welcome!)
- Model Testing Test your fine-tuned model via the terminal (Currently mainly looking at Gemini) (Contributions welcome!)
- Unit Tests Adding unit tests would help keep everything maintainable and stable (or so i've heard)
Installation
# Install globally (recommended for CLI use)
npm install -g convokit
# Or install locally in your project
npm install convokitQuick Start (Using the Library)
import { ConvoKit, loadConfig, getConfig } from 'convokit';
import { config } from 'dotenv';
config();
await loadConfig();
async function run() {
const ck = new ConvoKit();
await ck.loadProviders(); // This will load all included providers, and the providers in the LocalProvidersDir if set (in config)
// We also automatically load all included plugins & the plugins in LocalPluginsDir if set (in config)
const convoData = await ck.processDataFromProviders();
const context = await ck.parseToContext({ targetUsers: getConfig().targetUsers });
await ck.convertToCKTurnList();
await ck.getWeightedSample(getConfig().sampleSize);
const chatml = await ck.exportToChatML(getConfig().systemPrompt);
const gemini = await ck.exportToGemini(getConfig().systemPrompt);
// Do whatever you want with the outputs
}
run();Make sure you have set up providers and dir structure first
Configuration
By default, ConvoKit reads convokit.config.json or environment variables - Here is an example config file
{
"inputDataDirName": "InputData",
"outputDataDirName": "OutputData",
"targetUsers": [
{ "providerId": "discord", "id": "YOUR_DISCORD_USER_ID" }
],
"sampleSize": 5000,
"systemPrompt": "You are a helpful assistant.",
"minImportanceChat": 120,
"minImportanceMessage": 100,
"enableDebugging": false,
"enablePerformanceStats": false,
"shouldMergeConsecutiveMessages": true,
"enableWarnings": true,
"anonymizeProviderConversationIds": false,
"localProvidersDir": "LocalProviders",
"localPluginsDir": "LocalPlugins",
}| Key | Description | |-----------------------------------------|--------------------------------------------------------------------------------------| | inputDataDirName | Directory containing raw chat exports (relative to project root). | | outputDataDirName | Directory to write formatted outputs. | | targetUsers | JSON array mapping each provider to a target user ID for context generation. | | sampleSize | Number of conversations to sample by importance. | | systemPrompt | System prompt used in ChatML/Gemini exports. | | minImportanceChat (optional) | Minimum average importance score for a conversation (default: 120). | | minImportanceMessage (optional) | Minimum importance score for a single message (default: 100). | | enableDebugging (optional) | Enable or disable debug-level logs. | | enablePerformanceStats (optional) | Enable or disable performance stats (timers). | | shouldMergeConsecutiveMessages (optional)| Merge consecutive messages when converting to CKTurnList. | | enableWarnings (optional) | Toggle the display of warning messages. | | anonymizeProviderConversationIds (optional)| Anonymize provider conversation IDs to protect sensitive data. | | localProviderDirectory (optional)| Directory name of where to load custom providers from. | | localPluginDirectory (optional)| Directory name of where to load custom plugins from. (Contains a folder for each plugin type (formatters, filters, converters)! ) |
Directory Structure
In your convokit.config.json file you set a inputDataDirName, in here you will need to have a directory for each provider. In there you should store the exported data.
Example for use with the Discord provider, with inputDataDirName set to InputData:
convokit/
├── index.ts
├── convokit.config.json
├── ... other files and folders
└── InputData
└── discord
└── Direct Messages - fishylunar [000000000000000].jsonNote: the filenames of the exported data doesnt matter, but the extension does.
CLI Usage
ConvoKit provides a command-line interface (CLI) for running the processing pipeline without writing TypeScript code. Ensure you have a valid convokit.config.json file in your project root or have set the corresponding environment variables.
Running Commands:
# If installed globally
convokit <command> [options]
# If installed locally, using npx
npx convokit <command> [options]
# Or via package.json script
# "scripts": { "ck": "convokit" }
# npm run ck -- <command> [options]Common Options:
-p, --providers <ids>: Specify a comma-separated list of provider IDs (e.g.,discord,telegram) to process data from. If omitted, ConvoKit will attempt to use data from all providers found in yourinputDataDirNamethat are registered.-o, --output <file>: Specify an output file path to save the results of commands likecontextorexport. If omitted, results are generated but not saved to a file (stats/logs will still be shown).
Commands:
create-config(alias:cfg): Creates an exampleconvokit.config.jsonfile in the current directory. Run this first if you don't have a config file.convokit create-configproviders: Lists all registered providers (built-in and local) found by ConvoKit, including their ID, name, version, and expected input directory/extension. Useful for verifying provider setup and getting IDs for the--providersoption.convokit providersplugins: Lists all registered plugins (formatters, converters, filters), including built-in and local ones. Shows plugin ID, name, and version. Useful for finding the<converter_id>for theexportcommand.convokit pluginscontext: Processes data from specified (or all) providers and generates theCKContextoutput based on your configuration (targetUsers, importance scores, etc.).# Generate context from all providers and save to context.txt convokit context -o context.txt # Generate context using only 'discord' provider data and save convokit context --providers discord -o discord_context.txt # Generate context from all providers and save to context.json including stats convokit context -o context.json --statsexport <converter_id>: Runs the full pipeline: loads data, processes it, generates context, converts to turn list, performs weighted sampling (usingsampleSizefrom config), and finally exports the data using the specified<converter_id>.# Export data using the 'chatml' converter, save to chatml_export.jsonl convokit export chatml -o chatml_export.jsonl # Export using 'gemini' converter from 'telegram' provider only, save output convokit export gemini --providers telegram -o telegram_gemini.jsonl
Example Workflow:
# 1. Create a config file if you don't have one
convokit create-config
# (Edit convokit.config.json with your settings: input dir, target users, etc.)
# 2. Check which providers are available
convokit providers
# Output might show: ID: discord, ID: telegram
# 3. Check available export formats (converters)
convokit plugins
# Output might show Converters: ID: chatml, ID: gemini
# 4. Run the full export pipeline for ChatML using all providers
convokit export chatml -o training_data.jsonl
# 5. (Alternative) Generate only the CKContext for analysis
convokit context -o analysis_context.jsonProvider Registry
ConvoKit discovers providers from providers via ProviderRegistry. Each provider must:
- Implement
ConvoKitProviderwithTest()andConvert(). - Export a static
ProviderInfoobject. - Register itself via
ProviderRegistry.register(id, ProviderClass, ProviderInfo).
Built‑in Providers
- Discord (
providers/discord.ts): Reads JSON exports from DiscordChatExporter. - Telegram (
providers/telegram.ts): Reads JSON exports from the Telegram Desktop app.
Contributions are more than welcome! <3
Writing Your Own Provider
- Create
/providers/MyPlatform.ts.
To make a local provider, put the
MyPlatform.tsfile in the LocalProvidersDir you specified in your config. If you are contributing and making a provider to be included in ConvoKit, put it in/providers/MyPlatform.ts
- Define your data schema, compatibility check, and conversion:
export const ProviderInfo = {
name: "MyPlatform Exporter",
description: "Imports MyPlatform chat JSON.",
version: "1.0.0",
author: "You",
InputDataInfo: { directoryName: "MyPlatform", fileExtension: ".json" }
};
export class Provider implements ConvoKitProvider {
constructor(private raw: any) {}
Test(): boolean {
// return true if raw matches your schema
}
Convert(): ConvoKitConversation {
// transform raw → ConvoKitConversation
}
}
// Self-register
ProviderRegistry.register("myplatform", Provider, ProviderInfo);- Place your exports in
InputData/MyPlatform/*.json. - Run
ck.loadProviders()andck.processDataFromProviders()to include your data.
Plugin Registry
Plugins extend ConvoKit’s pipeline at three points:
- Formatters (formatters)
- Converters (converters)
- Filters (filters)
They self‑register via PluginRegistry.registerFormatter/Converter/Filter().
Formatters
- Context Formatter (
id: context): Builds the CKContext string with importance and markers.
Converters
- ChatML Converter (
id: chatml): Exports LLM chatml JSONL. - Gemini Converter (
id: gemini): Exports Gemini‑style JSONL.
Filters
- LinkOnlyFilter (
id: link-only): Excludes messages that are URLs only.
Writing Your Own Plugin
Formatters
export class MyFormatter implements FormatterPluginClass { PluginInfo = { id: "myfmt", name: "...", type: "formatter", version: "1.0.0" }; apply(data, options) { /* return CKContextResult */ } } PluginRegistry.registerFormatter(MyFormatter);Converters
export class MyConverter implements ConverterPluginClass { PluginInfo = { id: "myconv", name: "...", type: "converter", version: "1.0.0" }; async apply(convs, prompt) { /* return string[] */ } } PluginRegistry.registerConverter(MyConverter);Filters
export class MyFilter implements FilterPluginClass { PluginInfo = { id: "myfilter", name: "...", type: "filter", version: "1.0.0" }; filterType: 'MUST' | 'MUST_NOT' = 'MUST_NOT'; apply(content) { /* return boolean */ } } PluginRegistry.registerFilter(MyFilter);
Contributing
Contributions are very welcome!
- Suggest a feature via GitHub Issues.
- Report bugs or raise PRs to fix them.
- Add new providers (Slack, Teams, custom exports).
- Write plugins for new formats or filters.
License
This project is licensed under the MIT License.
Feel free to use, modify, and distribute as you see fit!
