onkol
v0.5.6
Published
Decentralized on-call agent system powered by Claude Code
Downloads
1,044
Readme
Onkol
Your AI on-call team. One command per VM, and you get an autonomous agent on Discord that handles bugs, features, analysis, and ops so you don't have to.
Onkol turns Claude Code into a decentralized on-call system. Each VM runs an orchestrator that listens on Discord. You describe a problem in plain English, it spins up a dedicated worker session to solve it, and reports back when it's done.
How it works
You on Discord: "the auth endpoint is returning 403 after token refresh"
|
Orchestrator (Claude Code)
reads your message, understands intent,
prepares context, spawns a worker
|
Worker (new Claude Code session)
diagnoses the bug, fixes auth.py,
runs tests, commits to a branch
|
You on Discord: "Fixed. Clock skew between auth server and app server.
Added 5s tolerance. Tests pass. Branch: fix/auth-403"What makes it different:
- Decentralized. Each VM is self-contained. No central server. 10 VMs = 10 independent agents.
- Intent-driven. Say "fix this" and it fixes autonomously. Say "look into this" and it investigates without touching code. Your phrasing controls the behavior.
- Gets smarter. Every resolved task leaves behind a learning. Next time a similar issue comes up, the agent already knows what to look for.
- Works behind firewalls. All connections are outbound to Discord. No inbound ports, no SSH tunnels, no VPN required.
Real-world setup
The intended way to use Onkol is with a dedicated Discord server that becomes your ops control center.
I manage about 10 applications across prod and staging. I created one Discord server and set it up exclusively for Onkol. Each VM I onboard creates its own category with an orchestrator channel. My Discord sidebar looks like this:
MY-INFRA (Discord server)
│
├── API-SERVER-PROD ← VM running in GCP
│ ├── #orchestrator ← talk to this VM's brain here
│ ├── #fix-auth-403 ← active worker (auto-created)
│ └── #analyze-error-logs ← active worker (auto-created)
│
├── WEB-APP-STAGING ← VM running in AWS
│ └── #orchestrator
│
├── BACKEND-PROD ← VM behind corporate VPN
│ ├── #orchestrator
│ └── #add-export-endpoint ← active worker
│
├── DATA-PIPELINE-STAGING ← Another GCP VM
│ └── #orchestrator
│
└── ... (as many VMs as you have)The workflow
From your phone, laptop, or anywhere with Discord:
- Open the server, go to
#orchestratorunder the VM you care about - Type what you need: "there's a bug where users get 403 after token refresh"
- The orchestrator creates a new channel
#fix-auth-403and spawns a worker - The worker posts its progress and findings in
#fix-auth-403 - You can jump into that channel to give more context or redirect
- When it's done, the orchestrator dissolves the worker, the channel disappears, learnings are saved
You can do this from a party, a flight, or bed at 2 AM. You're just texting on Discord. The agent does the SSH, the debugging, the code reading, the fixing.
Multiple VMs, one view
Every VM is a category. Every task is a channel. You see your entire infrastructure at a glance in the Discord sidebar. No dashboards to build, no web apps to deploy. Discord IS the dashboard.
The VMs don't need to know about each other. Each one connects outbound to Discord independently. If a VM is behind a VPN you can only reach from one specific laptop, doesn't matter. As long as it has outbound HTTPS, it can connect to Discord and you can talk to it.
Setting up a new VM
# SSH into the VM (one time only)
ssh user@my-new-vm
# Run setup (2 minutes)
npx onkol@latest setup
# Answer the questions, done.
# A new category appears in your Discord server.
# You never need to SSH into this VM again.Quick start
Prerequisites
You need these on the VM where you're setting up:
| Tool | Why | Install |
|------|-----|---------|
| Node.js 18+ | Runs the setup CLI | nodejs.org |
| Bun | Runs the Discord channel plugin | curl -fsSL https://bun.sh/install \| bash |
| Claude Code | The AI that does the work | docs.anthropic.com |
| tmux | Keeps sessions alive | apt install tmux / yum install tmux |
| jq | JSON processing in scripts | apt install jq / yum install jq |
Claude Code must be logged in via claude.ai OAuth on the VM (not API key).
The setup wizard checks all dependencies before asking any questions. If something's missing, it tells you exactly what to install and exits without wasting your time.
Create a Discord bot
- Go to discord.com/developers/applications
- New Application, name it, Create
- Bot, Reset Token, copy it (you only see it once)
- Bot, Privileged Gateway Intents, enable Message Content Intent, Save
- OAuth2, URL Generator, check
bot, check permissions:- View Channels, Send Messages, Read Message History, Attach Files, Manage Channels
- Copy the URL, open in browser, invite to your Discord server
The setup wizard validates your bot token and checks that Message Content Intent is enabled before proceeding. If something's wrong, it tells you exactly what to fix.
Run setup
npx onkol@latest setupThe wizard walks you through everything:
Welcome to Onkol Setup
Checking dependencies...
✓ claude
✓ bun
✓ tmux
✓ jq
✓ curl
All dependencies found.
✔ Where should Onkol live? ~/onkol
✔ What should this node be called? api-server-prod
✔ Discord bot token: ****
✔ Discord server (guild) ID: 1234567890
✔ Your Discord user ID: 9876543210
✔ Registry file? Write a prompt — tell Claude what to find
✔ Describe: Find the API endpoints and database URLs from .env
✔ Service summary? Auto-discover
✔ CLAUDE.md? Yes — This is a Node.js API server deployed via docker...
✔ Plugins? context7, superpowers, code-simplifier
✓ Bot token is valid
✓ Message Content intent is enabled
✓ Discord category and #orchestrator channel created
✓ 6 scripts installed
✓ Plugin installed with 4 files + dependencies
✓ Systemd service installed and enabled
✓ Orchestrator started in tmux session "onkol-api-server-prod"
✓ Onkol node "api-server-prod" is live!Go to your Discord server. You'll see a new category with an #orchestrator channel. Send it a message.
Usage
Talking to the orchestrator
The orchestrator lives in the #orchestrator channel of your node's category. It reads your intent from how you phrase things:
| You say | What happens | |---------|-------------| | "fix the 403 bug in auth" | Spawns a worker that diagnoses, fixes, tests, and commits | | "look into why response times are high" | Spawns a worker that investigates and reports, no code changes | | "add retry logic to the webhook handler" | Spawns a worker that implements, tests, and waits for your approval | | "analyze transferred calls for the last 3 weeks" | Spawns a worker that reads logs/data and produces an analysis | | "just ship it" | Fully autonomous, pushes and deploys (asks for confirmation first) |
How workers work
When the orchestrator spawns a worker:
- A new Discord channel appears in your category (e.g.,
#fix-auth-bug) - A new Claude Code session starts in tmux on the VM
- The worker posts progress and results in its Discord channel
- You can talk to the worker directly in that channel
- When done, tell the orchestrator to dissolve it. The channel disappears, learnings are saved.
Managing workers
From the orchestrator channel:
- "dissolve fix-auth-bug" kills the worker, saves learnings, deletes channel
- "list workers" shows all active workers
- "check on fix-auth-bug" gets the worker's current status
Setup prompts
During setup, you can describe things in plain English instead of providing config files:
- Registry: "Find the API endpoints from .env and the S3 bucket from AWS CLI"
- Services: Auto-discovers running services, or you describe what to look for
- CLAUDE.md: "This is a Node.js API server, Express, deployed via docker..."
The orchestrator executes these prompts on first boot and generates the structured files.
Architecture
Your Discord Server
├── Category: api-server-prod ← VM 1
│ ├── #orchestrator ← persistent Claude Code session
│ ├── #fix-auth-bug ← worker (temporary)
│ └── #analyze-error-logs ← worker (temporary)
├── Category: web-app-staging ← VM 2
│ └── #orchestrator
└── Category: backend-prod ← VM 3
└── #orchestratorEach VM runs independently:
- Orchestrator. Long-running Claude Code session in tmux. Receives Discord messages, spawns workers, manages lifecycle.
- Workers. Ephemeral Claude Code sessions. One per task. Each gets its own Discord channel, its own context, its own instructions.
- discord-filtered plugin. Custom MCP channel server that routes Discord messages by channel ID. All sessions share one bot but each only hears its own channel.
On-disk structure
~/onkol/
├── config.json # Node config (bot token, server ID, etc.)
├── registry.json # VM-specific secrets, endpoints, ports
├── services.md # What runs on this VM
├── CLAUDE.md # Orchestrator instructions
├── knowledge/ # Learnings from dissolved workers
│ ├── index.json
│ └── 2026-03-22-fix-auth-clock-skew.md
├── workers/
│ ├── tracking.json # Active workers
│ └── fix-auth-bug/ # Worker directory (while active)
├── scripts/ # Lifecycle scripts
└── plugins/
└── discord-filtered/ # MCP channel pluginKnowledge base
Every dissolved worker leaves behind a learning:
## What happened
Token validation rejected valid tokens for 2-3 seconds after refresh.
## Root cause
No clock skew tolerance between auth server and app server.
## Fix
Added 5-second CLOCK_SKEW_TOLERANCE in auth.py:47.
## For next time
If 403 errors appear after token operations, check clock sync first.The orchestrator includes relevant past learnings when spawning new workers. The system gets better at diagnosing issues over time.
Resumable setup
If setup fails midway (missing dependency, network error, wrong bot token), your answers are saved automatically. Next time you run npx onkol setup, it offers to resume:
? Found a previous setup attempt (4 steps completed). What do you want to do?
❯ Resume from where it left off (node: api-server-prod)
Start freshNo re-entering bot tokens or server IDs. It picks up right where it left off.
Commands
npx onkol setup # Interactive setup wizard
npx onkol@latest setup # Force latest versionOn the VM after setup:
# Attach to the orchestrator
tmux attach -t onkol-<node-name>
# Check service status
systemctl status onkol-<node-name>
# Restart orchestrator
sudo systemctl restart onkol-<node-name>
# View active workers
bash ~/onkol/scripts/list-workers.sh
# Manually dissolve a worker
bash ~/onkol/scripts/dissolve-worker.sh --name "worker-name"Requirements
- Claude Code with
claude.aiOAuth login (Max plan recommended for concurrent sessions) - Node.js 18+ and Bun on each VM
- tmux and jq on each VM
- A Discord server with a bot that has Manage Channels permission
- VMs need outbound HTTPS access (no inbound ports needed)
How it's built
| Component | Tech | Lines | |-----------|------|-------| | Setup wizard | Node.js, TypeScript, Inquirer | ~500 | | Discord channel plugin | Bun, MCP SDK, discord.js | ~300 | | Worker lifecycle scripts | Bash | ~400 | | Orchestrator/worker templates | Handlebars | ~150 |
The core mechanism is Claude Code Channels, an MCP-based system that pushes Discord messages into Claude Code sessions. The discord-filtered plugin is a custom channel that routes by Discord channel ID, allowing multiple sessions to share one bot.
License
MIT
