@hyperdrive.bot/resilient-channel
v0.2.4
Published
Transport-agnostic self-healing MCP channel middleware for Claude Code
Maintainers
Readme
Resilient Channel
Transport-agnostic self-healing MCP channel middleware for Claude Code.
Events buffer during session disconnects and replay automatically on reconnection. Works with Telegram, Slack, webhooks, or any custom transport.
Why
Claude Code channels are session-scoped — when a session ends, events are lost. Resilient Channel adds a persistence layer so nothing gets dropped:
- Events buffer to disk when Claude disconnects
- Automatic replay when a new session connects
- Context summaries survive across sessions via the
update_contexttool - Audit trail logs every event per session
Inspired by NVIDIA NemoClaw's state persistence patterns.
Quick Start
1. Add to your .mcp.json
Pick a transport and add it to your project or user-level .mcp.json:
Webhook (simplest — test with curl):
{
"mcpServers": {
"resilient-webhook": {
"command": "bunx",
"args": ["-y", "@hyperdrive.bot/resilient-channel/src/server.ts"]
}
}
}Telegram:
{
"mcpServers": {
"resilient-telegram": {
"command": "bunx",
"args": ["-y", "@hyperdrive.bot/resilient-channel/src/server-telegram.ts"]
}
}
}Slack:
{
"mcpServers": {
"resilient-slack": {
"command": "bunx",
"args": ["-y", "@hyperdrive.bot/resilient-channel/src/server-slack.ts"]
}
}
}Or clone locally:
git clone https://github.com/hyperdrive-bot/claude-code-resilient-channel.git
cd claude-code-resilient-channel
bun installThen use absolute paths in .mcp.json:
{
"mcpServers": {
"resilient-telegram": {
"command": "bun",
"args": ["/path/to/claude-code-resilient-channel/src/server-telegram.ts"]
}
}
}2. Launch Claude Code
# Set env vars for your transport, then:
claude --dangerously-load-development-channels server:resilient-webhook3. Send a message
# Webhook
curl -X POST localhost:8788 -d "build failed on main"
# Telegram — DM your bot
# Slack — DM @your-bot or @mention it in a channelTransports
Webhook (one-way)
HTTP POST listener on localhost. No auth, no external deps. Perfect for CI webhooks, monitoring alerts, or testing.
claude --dangerously-load-development-channels server:resilient-webhook# Send events
curl -X POST localhost:8788 -d "deploy succeeded"
curl -X POST localhost:8788 -H "x-chat-id: ci" -d "tests passed (42/42)"Telegram (two-way)
Polls Telegram for messages. Supports text, voice notes (auto-transcribed via Whisper), photos, documents, and video (with key frame extraction + transcription).
TELEGRAM_BOT_TOKEN=<token> \
TELEGRAM_ALLOWED_IDS=<your-user-id> \
claude --dangerously-skip-permissions \
--dangerously-load-development-channels server:resilient-telegramSetup:
- Create a bot via @BotFather
- Find your user ID via @userinfobot
- Set
OPENAI_API_KEYfor voice/video transcription (optional)
File support:
| Type | Handling |
|------|----------|
| Text | Forwarded directly |
| Voice/Audio | Downloaded + transcribed via Whisper |
| Photo | Downloaded, passed as file_path (Claude sees it) |
| Video | Audio transcribed + up to 4 key frames extracted |
| Document | Downloaded, passed as file_path |
Slack (two-way)
Uses Socket Mode — no public URL needed. Reacts with :eyes: to channel mentions, replies via DM.
SLACK_BOT_TOKEN=xoxb-... \
SLACK_APP_TOKEN=xapp-... \
SLACK_ALLOWED_IDS=U12345678 \
claude --dangerously-skip-permissions \
--dangerously-load-development-channels server:resilient-slackAlso reads from CLAUDE_CODE_SLACK_BOT_TOKEN, CLAUDE_CODE_SLACK_APP_TOKEN, CLAUDE_CODE_SLACK_USER_ID env vars.
Slack App requirements:
- Socket Mode enabled
- App-Level Token with
connections:write - Event subscriptions:
message.im,message.channels,app_mention - Bot scopes:
chat:write,channels:history,im:history,app_mentions:read
Behavior: | Interaction | Response | |-------------|----------| | DM | Reply in same DM thread | | @mention in channel | :eyes: react + reply via DM with thread link | | Thread follow-up | Forwarded (no re-mention needed) |
Architecture
External System → ChannelTransport → ResilientChannel → MCP stdio → Claude Code
(thin adapter) (self-healing core)Self-Healing Mechanisms
| Mechanism | How |
|-----------|-----|
| State persistence | ~/.claude/channels/{name}/state.json updated after every event |
| Event buffering | Events queue to disk when Claude disconnects |
| Session replay | Buffered events replayed with is_replay="true" on reconnect |
| Instructions survival | Injected into system prompt, survives context compaction |
| Deduplication | externalId tracking prevents double-processing |
| Audit trail | Per-session events.jsonl in runs/{runId}/ |
| Context handoff | update_context tool persists summaries across sessions |
State File
{
"runId": "ch-20260321-143022-a1b2c3d4",
"sessionCount": 3,
"lastEventId": "evt-42",
"pendingEvents": [],
"contextSummary": "User debugging CI pipeline failure",
"status": "connected"
}Custom Transport
Implement the ChannelTransport interface:
import type { ChannelTransport, InboundEvent } from './src/types.js'
class MyTransport implements ChannelTransport {
name = 'my-transport'
onEvent: (event: InboundEvent) => void = () => {}
async start() { /* connect to your system */ }
async stop() { /* disconnect */ }
// Optional: two-way
async send(destination: string, text: string) { /* send reply */ }
}Then create an entry point:
import { ResilientChannel } from './src/resilient-channel.js'
import { MyTransport } from './my-transport.js'
const channel = new ResilientChannel({
name: 'my-channel',
transport: new MyTransport(),
twoWay: true,
allowlist: new Set(['allowed-sender-id']),
})
channel.start()Register in .mcp.json:
{
"mcpServers": {
"my-channel": {
"command": "bun",
"args": ["./server-my-channel.ts"]
}
}
}Lifecycle hooks
Heartbeats
The orchestrator can distinguish "alive but busy in a long Bash invocation" from "dead, OS killed it" via heartbeat events emitted from each Claude Code session's PreToolUse:Bash hook.
Add call-home-heartbeat.sh to your ~/.claude/settings.json:
{
"hooks": {
"PreToolUse": [
{
"matcher": "Bash",
"hooks": [
{
"type": "command",
"command": "$HOME/Developer/ds/super-repo/scripts/telemetry/call-home-heartbeat.sh"
}
]
}
]
}
}- Fires every Bash invocation, ≤60s effective cadence during active tool use.
- Silent — no MCP notification per heartbeat (NFR3 token-cost guard). The orchestrator updates
registry.json[sessionId].lastHeartbeatAtand short-circuits before the MCP fan-out. - Adaptive prune:
pruneStale()usesmax(30min, 3 × heartbeatIntervalMs)so genuinely slow sessions survive.
Crash-loop detection
Persists per-session crash records to ~/.claude/channels/orchestrator/crash-history.json (capped at 20 entries per session, oldest dropped on overflow). When ≥3 crashes occur within a 15-minute sliding window for the same session, the supervisor fires exactly one alert via the first available transport (Slack → Telegram → stderr fallback) with a 30-minute cooldown that survives supervisor restarts (lastAlertAt is wall-clock-persisted in the JSON file, not a setTimeout that dies with the process).
Crashes classified as killed-by-supervisor (intentional rolling restarts, manual kill, idle-timeout cleanups) do not count toward the threshold. This requires the caller to populate classification correctly — Story 1.3's .terminated markers and the SessionTerminated classifier are the upstream source of that field.
Inspect the per-session crash buffer with relative ages:
$ claude-channels crash-history a1b2c3d4
Session a1b2c3d4 — 3 crash(es) in window, lastAlertAt=2026-04-25T03:14:09.000Z
[2026-04-25T03:11:09.000Z] exit=139 signal=SIGSEGV classification=unexpected "boom" (3min ago)
[2026-04-25T03:12:09.000Z] exit=139 signal=SIGSEGV classification=unexpected "boom" (2min ago)
[2026-04-25T03:13:09.000Z] exit=139 signal=SIGSEGV classification=unexpected "boom" (1min ago)--json emits the raw SessionCrashState. Prefix matching (8-char prefix → unique session) mirrors claude-channels session log; ambiguous prefixes exit 1 with the matching session ids.
Silent — no MCP <channel> notification per crash (NFR3 token-cost guard). The single alert fires only on the threshold transition, never on every crash.
Smoke walkthrough
# Populate 3 crashes 1s apart into a fresh sessionId, then read back via CLI
SID=$(uuidgen)
HIST=~/.claude/channels/orchestrator/crash-history.json
for i in 1 2 3; do
node -e "
const { recordCrash } = require('./packages/cli/claude-channels/dist/supervisor/health-monitor.js');
recordCrash('$SID', { ts: Date.now(), errStr: 'boom #$i', exitCode: 139, signal: 'SIGSEGV', classification: 'unexpected' });
"
sleep 1
done
claude-channels crash-history "$SID"
# Cleanup
node -e "
const fs=require('fs');
const path=process.env.HOME+'/.claude/channels/orchestrator/crash-history.json';
const s=JSON.parse(fs.readFileSync(path,'utf-8'));
delete s.sessions['$SID'];
fs.writeFileSync(path, JSON.stringify(s,null,2));
"
rm -rf ~/.claude/channels/orchestrator/sessions/$SIDThe smoke calls recordCrash directly (not handleCrash), so lastAlertAt stays null — no transport dispatch is exercised.
Termination signals
Every session termination is classified into one of five reasons and recorded silently in the registry's recentlyTerminated[] ring buffer (cap 50, oldest dropped on overflow). The orchestrator never emits an MCP notification per termination — operators see unexpected terminations on-demand via formatStatus().
Classification table — applied top-down by scripts/telemetry/call-home-session-end.sh, first match wins:
| Rule | Condition | Reason | killedBy |
|---|---|---|---|
| 1 | ~/.claude/channels/session-webhook-<port>/.terminated exists, mtime ≤ 60s | from marker reason= | from marker killed_by= |
| 2 | exitCode == 0 (or Claude Code reason ∈ {clear, logout, prompt_input_exit}) | user-exit | null |
| 3 | signal == "SIGKILL" AND ~/.claude/channels/orchestrator/supervisor.log mtime ≤ 30s | supervisor-kill | "supervisor" |
| 4 | signal == "SIGKILL" AND no recent supervisor.log | oom (heuristic, best-effort) | null |
| 5 | else (SIGTERM, SIGINT, SIGHUP, …) | os-signal | null |
| Fallback | legacy SessionEnd envelope (no reason field) | unknown | null |
The marker file is single-use — call-home-session-end.sh deletes it after read so a stale marker can't falsely classify the next exit.
_marker-write.sh — invocation contract
Callers (e.g. supervisor cleanup, manual ops) thread shutdown context to the SessionEnd hook by writing a marker file:
scripts/telemetry/_marker-write.sh <port> <reason>port— numeric (matches^[0-9]+$); per-session webhook port from~/.claude/channels/session-ports/<claude_pid>.json.reason— one of:user-exit | supervisor-kill | os-signal | oom | crash-loop-breaker | manual-cli | idle-timeout.- Atomic write: tmp-file + chmod 0600 +
mv -frename (POSIX rename is atomic). - Parent dir is created with mode 0700 if absent.
- Exit codes:
0success,1validation failure,2filesystem error.
Override the target dir for tests via MARKER_BASE_DIR=/tmp/... (defaults to $HOME/.claude/channels).
recentlyTerminated[] ring buffer
The registry persists the last 50 terminations to ~/.claude/channels/orchestrator/registry.json. Pre-rollout registries (no field) read as [] — no migration script required (NFR1).
formatStatus() appends one warning line when at least one entry has terminatedAt > now − 1h AND reason ∈ {oom, os-signal, supervisor-kill}:
⚠️ N session(s) terminated unexpectedly in the last hour: <name1> (<reason1>), <name2> (<reason2>), ...Lists at most 5 names; if more match, appends , +<extra> more. reason=user-exit and reason=unknown are never counted as unexpected (a healthy system stays quiet).
Per-session restarts.log
Every termination — new SessionTerminated envelope OR legacy SessionEnd fallback — appends one line to ~/.claude/channels/orchestrator/sessions/<sessionId>/restarts.log:
[2026-04-25T07:48:33.000Z] TERMINATED: reason=user-exit exitCode=0 signal=null killedBy=nullErrors from the log writer are swallowed silently — logging must never break a recovery flow.
Backwards compatibility
Clients that have NOT upgraded to the new call-home-session-end.sh continue to work. The orchestrator's existing case 'SessionEnd': arm pushes reason=unknown onto recentlyTerminated[] and writes a TERMINATED reason=unknown line to restarts.log — so the audit trail is preserved even for un-upgraded clients.
Security
All transports enforce deny-first sender gating. No allowlist = no messages get through.
- Telegram:
TELEGRAM_ALLOWED_IDSis mandatory - Slack:
SLACK_ALLOWED_IDS/CLAUDE_CODE_SLACK_USER_IDis mandatory - Webhook: optional
x-sender-idheader + allowlist
HMAC Authentication
Wire-level authentication for the orchestrator and per-session webhook surfaces. Optional, off by default, opt-in via key-file presence. Designed for the single-machine threat model — symmetric HMAC-SHA256 with a shared key on disk. Cross-machine deployments can swap the verifier without changing the wire format.
Key generation (orchestrator):
openssl rand -hex 32 > ~/.claude/channels/orchestrator/key
chmod 600 ~/.claude/channels/orchestrator/keyWire format: every authenticated POST carries x-sender-sig: <hex> where <hex> is the lowercase hex output of hmac-sha256(key, body). The hook scripts under scripts/telemetry/ compute this header automatically when the key file exists; when absent, they POST as before (no header added).
Behaviour matrix:
| Key file | Header | Result |
|---|---|---|
| absent | absent | 200 + warn-once on stderr per process boot |
| absent | present (any value) | 200 (verification skipped) |
| present | absent | 401 unsigned request rejected |
| present | bad signature | 401 invalid signature |
| present | valid signature | 200 |
Strict mode (RESILIENT_CHANNEL_REQUIRE_AUTH=1): when set, the orchestrator refuses to boot without a key file — process.exit(2) with FATAL: RESILIENT_CHANNEL_REQUIRE_AUTH=1 but no key file at <path> — aborting on stderr. Default behaviour (env unset or 0) is the warn-and-accept fallback above.
Per-session keys: each server-session-webhook.ts instance generates its own 32-byte hex key on startup at ~/.claude/channels/session-webhook-<port>/key (mode 0o600). The key path is auto-managed: created on session boot, removed on SIGTERM/SIGINT. The key value is advertised to the orchestrator via the webhookKey field on the SessionStart call-home payload and persisted in ~/.claude/channels/orchestrator/registry.json under sessions.<sessionId>.webhookKey. The orchestrator's approval-response relay (Story 3.1) signs POSTs back into the session using this per-session key.
Approvals — inline-keyboard human-in-the-loop (Story 3.1, Steal #4)
Surface Claude Code's AskUserQuestion (and opt-in sensitive PreToolUse) events as inline-keyboard messages on Telegram and Slack. The user taps a button on their phone; the orchestrator HMAC-signs the response and POSTs it back to the originating session's webhook.
Sequence
sequenceDiagram
participant CC as Claude Code (session)
participant Hook as call-home-ask-question.sh
participant Orch as Orchestrator
participant Slack
participant Telegram
participant User as Operator (phone)
CC->>Hook: AskUserQuestion event
Hook->>Orch: POST approval-request envelope (HMAC)
Orch->>Orch: PendingApprovals.add()
Orch->>Slack: dispatchApprovalRequest (Block Kit)
Orch->>Telegram: dispatchApprovalRequest (inline keyboard)
User->>Telegram: tap button
Telegram->>Orch: callback_query
Orch->>Orch: PendingApprovals.resolve()
Orch->>CC: POST approval-response envelope (HMAC sig)
CC->>CC: receive via session-webhook → MCP notificationHook wiring (~/.claude/settings.json)
{
"hooks": {
"AskUserQuestion": [
{ "type": "command", "command": "/abs/path/to/scripts/telemetry/call-home-ask-question.sh" }
],
"PreToolUse": [
{ "type": "command", "command": "/abs/path/to/scripts/telemetry/call-home-permission-check.sh" }
]
}
}CLAUDE_APPROVAL_TOOLS — opt-in for sensitive PreToolUse
call-home-permission-check.sh is gated on the CLAUDE_APPROVAL_TOOLS env var. Unset → exit 0 immediately, no envelope emitted. Colon-separated tool names → matching PreToolUse events route through the approval flow:
export CLAUDE_APPROVAL_TOOLS=Bash:EditMatch is case-sensitive and exact.
Security note — HMAC required (Story 2.1)
Approval-response POSTs from orchestrator → session are HMAC-signed with the destination session's webhookKey from the registry. The per-session webhook (server-session-webhook.ts) returns 401 on unsigned/wrong-signature POSTs. Story 3.1 cannot ship without Story 2.1; verify with approval-flow.test.ts — "missing-HMAC rejection" and "wrong-HMAC rejection" cases.
If a session's registry entry has no webhookKey (legacy / pre-Story-2.1 sessions), the orchestrator drops the approval-request server-side, logs [orchestrator] approval-request dropped: session <id> has no webhookKey, and returns 200 to the originating hook. No transport dispatch occurs.
Timeout + dedup behaviour
timeoutMson the request (default 30 min) — TTL sweeper synthesises adecision: 'timeout'envelope when the window elapses with no callback. The session's hook script translates that to "no answer received" for Claude.- Dedup is first-response-wins: a second callback for the same
questionIdreturns 200 to the transport, drops on the orchestrator side, and emits exactly onedropped-dedupaudit line. Map<questionId, Entry>is in-memory only — orchestrator restart drops in-flight approvals (the session's hook then waits on its own timeout fallback).
Audit stream (events.jsonl)
Every approval lifecycle event lands as one JSON line in ~/.claude/channels/orchestrator/events.jsonl:
{"event":"approval-request","questionId":"...","sessionId":"...","timestamp":"..."}
{"event":"approval-status","questionId":"...","terminalStatus":"approved","timestamp":"..."}terminalStatus ∈ { 'approved', 'denied', 'timeout', 'dropped-dedup', 'dropped-no-key', 'dropped-unknown' }.
Query the audit stream with jq:
jq -c 'select(.event=="approval-status")' ~/.claude/channels/orchestrator/events.jsonl
jq -c 'select(.event=="approval-status" and .terminalStatus=="timeout")' ~/.claude/channels/orchestrator/events.jsonlOutbound transport configuration
The orchestrator opportunistically instantiates Telegram and Slack outbound transports when the relevant env vars are set. Without them, an approval-request is stored and waits — operator must answer at the laptop (degraded mode).
| Env var | Purpose |
|---|---|
| TELEGRAM_BOT_TOKEN | Bot token for outbound approval messages |
| TELEGRAM_DEFAULT_CHAT_ID | Chat id to receive approval messages |
| SLACK_BOT_TOKEN + SLACK_APP_TOKEN | Slack credentials |
| SLACK_DEFAULT_CHANNEL | Channel id (e.g. C0AFEKF68SH) for approval messages |
Dispatch order is intentional: Slack first, Telegram second. Both transports share pendingApprovals, so whichever callback arrives first wins via the dedup path.
Requirements
- Bun runtime
- Claude Code v2.1.80+
OPENAI_API_KEYfor voice/video transcription (optional)ffmpegfor video key frame extraction (optional)
License
MIT
