@hyperdrive.bot/resilient-channel

v0.2.4

Published

8 days ago

Transport-agnostic self-healing MCP channel middleware for Claude Code

0High
0Medium
0Low

marcelodevsquad

claude-code mcp channels telegram slack self-healing

Resilient Channel

Transport-agnostic self-healing MCP channel middleware for Claude Code.

Events buffer during session disconnects and replay automatically on reconnection. Works with Telegram, Slack, webhooks, or any custom transport.

Why

Claude Code channels are session-scoped — when a session ends, events are lost. Resilient Channel adds a persistence layer so nothing gets dropped:

Events buffer to disk when Claude disconnects
Automatic replay when a new session connects
Context summaries survive across sessions via the update_context tool
Audit trail logs every event per session

Inspired by NVIDIA NemoClaw's state persistence patterns.

Quick Start

1. Add to your `.mcp.json`

Pick a transport and add it to your project or user-level .mcp.json:

Webhook (simplest — test with curl):

{
  "mcpServers": {
    "resilient-webhook": {
      "command": "bunx",
      "args": ["-y", "@hyperdrive.bot/resilient-channel/src/server.ts"]
    }
  }
}

Telegram:

{
  "mcpServers": {
    "resilient-telegram": {
      "command": "bunx",
      "args": ["-y", "@hyperdrive.bot/resilient-channel/src/server-telegram.ts"]
    }
  }
}

Slack:

{
  "mcpServers": {
    "resilient-slack": {
      "command": "bunx",
      "args": ["-y", "@hyperdrive.bot/resilient-channel/src/server-slack.ts"]
    }
  }
}

Or clone locally:

git clone https://github.com/hyperdrive-bot/claude-code-resilient-channel.git
cd claude-code-resilient-channel
bun install

Then use absolute paths in .mcp.json:

{
  "mcpServers": {
    "resilient-telegram": {
      "command": "bun",
      "args": ["/path/to/claude-code-resilient-channel/src/server-telegram.ts"]
    }
  }
}

2. Launch Claude Code

# Set env vars for your transport, then:
claude --dangerously-load-development-channels server:resilient-webhook

3. Send a message

# Webhook
curl -X POST localhost:8788 -d "build failed on main"

# Telegram — DM your bot
# Slack — DM @your-bot or @mention it in a channel

Transports

Webhook (one-way)

HTTP POST listener on localhost. No auth, no external deps. Perfect for CI webhooks, monitoring alerts, or testing.

claude --dangerously-load-development-channels server:resilient-webhook

# Send events
curl -X POST localhost:8788 -d "deploy succeeded"
curl -X POST localhost:8788 -H "x-chat-id: ci" -d "tests passed (42/42)"

Telegram (two-way)

Polls Telegram for messages. Supports text, voice notes (auto-transcribed via Whisper), photos, documents, and video (with key frame extraction + transcription).

TELEGRAM_BOT_TOKEN=<token> \
TELEGRAM_ALLOWED_IDS=<your-user-id> \
claude --dangerously-skip-permissions \
  --dangerously-load-development-channels server:resilient-telegram

Setup:

Create a bot via @BotFather
Find your user ID via @userinfobot
Set OPENAI_API_KEY for voice/video transcription (optional)

File support: | Type | Handling | |------|----------| | Text | Forwarded directly | | Voice/Audio | Downloaded + transcribed via Whisper | | Photo | Downloaded, passed as file_path (Claude sees it) | | Video | Audio transcribed + up to 4 key frames extracted | | Document | Downloaded, passed as file_path |

Slack (two-way)

Uses Socket Mode — no public URL needed. Reacts with :eyes: to channel mentions, replies via DM.

SLACK_BOT_TOKEN=xoxb-... \
SLACK_APP_TOKEN=xapp-... \
SLACK_ALLOWED_IDS=U12345678 \
claude --dangerously-skip-permissions \
  --dangerously-load-development-channels server:resilient-slack

Also reads from CLAUDE_CODE_SLACK_BOT_TOKEN, CLAUDE_CODE_SLACK_APP_TOKEN, CLAUDE_CODE_SLACK_USER_ID env vars.

Slack App requirements:

Socket Mode enabled
App-Level Token with connections:write
Event subscriptions: message.im, message.channels, app_mention
Bot scopes: chat:write, channels:history, im:history, app_mentions:read

Behavior: | Interaction | Response | |-------------|----------| | DM | Reply in same DM thread | | @mention in channel | :eyes: react + reply via DM with thread link | | Thread follow-up | Forwarded (no re-mention needed) |

Architecture

External System → ChannelTransport → ResilientChannel → MCP stdio → Claude Code
                   (thin adapter)    (self-healing core)

Self-Healing Mechanisms

| Mechanism | How | |-----------|-----| | State persistence | ~/.claude/channels/{name}/state.json updated after every event | | Event buffering | Events queue to disk when Claude disconnects | | Session replay | Buffered events replayed with is_replay="true" on reconnect | | Instructions survival | Injected into system prompt, survives context compaction | | Deduplication | externalId tracking prevents double-processing | | Audit trail | Per-session events.jsonl in runs/{runId}/ | | Context handoff | update_context tool persists summaries across sessions |

State File

{
  "runId": "ch-20260321-143022-a1b2c3d4",
  "sessionCount": 3,
  "lastEventId": "evt-42",
  "pendingEvents": [],
  "contextSummary": "User debugging CI pipeline failure",
  "status": "connected"
}

Custom Transport

Implement the ChannelTransport interface:

import type { ChannelTransport, InboundEvent } from './src/types.js'

class MyTransport implements ChannelTransport {
  name = 'my-transport'
  onEvent: (event: InboundEvent) => void = () => {}

  async start() { /* connect to your system */ }
  async stop() { /* disconnect */ }

  // Optional: two-way
  async send(destination: string, text: string) { /* send reply */ }
}

Then create an entry point:

import { ResilientChannel } from './src/resilient-channel.js'
import { MyTransport } from './my-transport.js'

const channel = new ResilientChannel({
  name: 'my-channel',
  transport: new MyTransport(),
  twoWay: true,
  allowlist: new Set(['allowed-sender-id']),
})

channel.start()

{
  "mcpServers": {
    "my-channel": {
      "command": "bun",
      "args": ["./server-my-channel.ts"]
    }
  }
}

Lifecycle hooks

Heartbeats

The orchestrator can distinguish "alive but busy in a long Bash invocation" from "dead, OS killed it" via heartbeat events emitted from each Claude Code session's PreToolUse:Bash hook.

Add call-home-heartbeat.sh to your ~/.claude/settings.json:

{
  "hooks": {
    "PreToolUse": [
      {
        "matcher": "Bash",
        "hooks": [
          {
            "type": "command",
            "command": "$HOME/Developer/ds/super-repo/scripts/telemetry/call-home-heartbeat.sh"
          }
        ]
      }
    ]
  }
}

Fires every Bash invocation, ≤60s effective cadence during active tool use.
Silent — no MCP notification per heartbeat (NFR3 token-cost guard). The orchestrator updates registry.json[sessionId].lastHeartbeatAt and short-circuits before the MCP fan-out.
Adaptive prune: pruneStale() uses max(30min, 3 × heartbeatIntervalMs) so genuinely slow sessions survive.

Crash-loop detection

Persists per-session crash records to ~/.claude/channels/orchestrator/crash-history.json (capped at 20 entries per session, oldest dropped on overflow). When ≥3 crashes occur within a 15-minute sliding window for the same session, the supervisor fires exactly one alert via the first available transport (Slack → Telegram → stderr fallback) with a 30-minute cooldown that survives supervisor restarts (lastAlertAt is wall-clock-persisted in the JSON file, not a setTimeout that dies with the process).

Crashes classified as killed-by-supervisor (intentional rolling restarts, manual kill, idle-timeout cleanups) do not count toward the threshold. This requires the caller to populate classification correctly — Story 1.3's .terminated markers and the SessionTerminated classifier are the upstream source of that field.

Inspect the per-session crash buffer with relative ages:

$ claude-channels crash-history a1b2c3d4
Session a1b2c3d4 — 3 crash(es) in window, lastAlertAt=2026-04-25T03:14:09.000Z
[2026-04-25T03:11:09.000Z] exit=139 signal=SIGSEGV classification=unexpected "boom" (3min ago)
[2026-04-25T03:12:09.000Z] exit=139 signal=SIGSEGV classification=unexpected "boom" (2min ago)
[2026-04-25T03:13:09.000Z] exit=139 signal=SIGSEGV classification=unexpected "boom" (1min ago)

--json emits the raw SessionCrashState. Prefix matching (8-char prefix → unique session) mirrors claude-channels session log; ambiguous prefixes exit 1 with the matching session ids.

Silent — no MCP <channel> notification per crash (NFR3 token-cost guard). The single alert fires only on the threshold transition, never on every crash.

Smoke walkthrough

# Populate 3 crashes 1s apart into a fresh sessionId, then read back via CLI
SID=$(uuidgen)
HIST=~/.claude/channels/orchestrator/crash-history.json
for i in 1 2 3; do
  node -e "
    const { recordCrash } = require('./packages/cli/claude-channels/dist/supervisor/health-monitor.js');
    recordCrash('$SID', { ts: Date.now(), errStr: 'boom #$i', exitCode: 139, signal: 'SIGSEGV', classification: 'unexpected' });
  "
  sleep 1
done
claude-channels crash-history "$SID"
# Cleanup
node -e "
  const fs=require('fs');
  const path=process.env.HOME+'/.claude/channels/orchestrator/crash-history.json';
  const s=JSON.parse(fs.readFileSync(path,'utf-8'));
  delete s.sessions['$SID'];
  fs.writeFileSync(path, JSON.stringify(s,null,2));
"
rm -rf ~/.claude/channels/orchestrator/sessions/$SID

The smoke calls recordCrash directly (not handleCrash), so lastAlertAt stays null — no transport dispatch is exercised.

Termination signals

Every session termination is classified into one of five reasons and recorded silently in the registry's recentlyTerminated[] ring buffer (cap 50, oldest dropped on overflow). The orchestrator never emits an MCP notification per termination — operators see unexpected terminations on-demand via formatStatus().

Classification table — applied top-down by scripts/telemetry/call-home-session-end.sh, first match wins:

| Rule | Condition | Reason | killedBy | |---|---|---|---| | 1 | ~/.claude/channels/session-webhook-<port>/.terminated exists, mtime ≤ 60s | from marker reason= | from marker killed_by= | | 2 | exitCode == 0 (or Claude Code reason ∈ {clear, logout, prompt_input_exit}) | user-exit | null | | 3 | signal == "SIGKILL" AND ~/.claude/channels/orchestrator/supervisor.log mtime ≤ 30s | supervisor-kill | "supervisor" | | 4 | signal == "SIGKILL" AND no recent supervisor.log | oom (heuristic, best-effort) | null | | 5 | else (SIGTERM, SIGINT, SIGHUP, …) | os-signal | null | | Fallback | legacy SessionEnd envelope (no reason field) | unknown | null |

The marker file is single-use — call-home-session-end.sh deletes it after read so a stale marker can't falsely classify the next exit.

`_marker-write.sh` — invocation contract

Callers (e.g. supervisor cleanup, manual ops) thread shutdown context to the SessionEnd hook by writing a marker file:

scripts/telemetry/_marker-write.sh <port> <reason>

port — numeric (matches ^[0-9]+$); per-session webhook port from ~/.claude/channels/session-ports/<claude_pid>.json.
reason — one of: user-exit | supervisor-kill | os-signal | oom | crash-loop-breaker | manual-cli | idle-timeout.
Atomic write: tmp-file + chmod 0600 + mv -f rename (POSIX rename is atomic).
Parent dir is created with mode 0700 if absent.
Exit codes: 0 success, 1 validation failure, 2 filesystem error.

Override the target dir for tests via MARKER_BASE_DIR=/tmp/... (defaults to $HOME/.claude/channels).

`recentlyTerminated[]` ring buffer

The registry persists the last 50 terminations to ~/.claude/channels/orchestrator/registry.json. Pre-rollout registries (no field) read as [] — no migration script required (NFR1).

formatStatus() appends one warning line when at least one entry has terminatedAt > now − 1h AND reason ∈ {oom, os-signal, supervisor-kill}:

⚠️ N session(s) terminated unexpectedly in the last hour: <name1> (<reason1>), <name2> (<reason2>), ...

Lists at most 5 names; if more match, appends , +<extra> more. reason=user-exit and reason=unknown are never counted as unexpected (a healthy system stays quiet).

Per-session restarts.log

Every termination — new SessionTerminated envelope OR legacy SessionEnd fallback — appends one line to ~/.claude/channels/orchestrator/sessions/<sessionId>/restarts.log:

[2026-04-25T07:48:33.000Z] TERMINATED: reason=user-exit exitCode=0 signal=null killedBy=null

Errors from the log writer are swallowed silently — logging must never break a recovery flow.

Backwards compatibility

Clients that have NOT upgraded to the new call-home-session-end.sh continue to work. The orchestrator's existing case 'SessionEnd': arm pushes reason=unknown onto recentlyTerminated[] and writes a TERMINATED reason=unknown line to restarts.log — so the audit trail is preserved even for un-upgraded clients.

Security

All transports enforce deny-first sender gating. No allowlist = no messages get through.

Telegram: TELEGRAM_ALLOWED_IDS is mandatory
Slack: SLACK_ALLOWED_IDS / CLAUDE_CODE_SLACK_USER_ID is mandatory
Webhook: optional x-sender-id header + allowlist

HMAC Authentication

Wire-level authentication for the orchestrator and per-session webhook surfaces. Optional, off by default, opt-in via key-file presence. Designed for the single-machine threat model — symmetric HMAC-SHA256 with a shared key on disk. Cross-machine deployments can swap the verifier without changing the wire format.

Key generation (orchestrator):

openssl rand -hex 32 > ~/.claude/channels/orchestrator/key
chmod 600 ~/.claude/channels/orchestrator/key

Wire format: every authenticated POST carries x-sender-sig: <hex> where <hex> is the lowercase hex output of hmac-sha256(key, body). The hook scripts under scripts/telemetry/ compute this header automatically when the key file exists; when absent, they POST as before (no header added).

Behaviour matrix:

| Key file | Header | Result | |---|---|---| | absent | absent | 200 + warn-once on stderr per process boot | | absent | present (any value) | 200 (verification skipped) | | present | absent | 401 unsigned request rejected | | present | bad signature | 401 invalid signature | | present | valid signature | 200 |

Strict mode (RESILIENT_CHANNEL_REQUIRE_AUTH=1): when set, the orchestrator refuses to boot without a key file — process.exit(2) with FATAL: RESILIENT_CHANNEL_REQUIRE_AUTH=1 but no key file at <path> — aborting on stderr. Default behaviour (env unset or 0) is the warn-and-accept fallback above.

Per-session keys: each server-session-webhook.ts instance generates its own 32-byte hex key on startup at ~/.claude/channels/session-webhook-<port>/key (mode 0o600). The key path is auto-managed: created on session boot, removed on SIGTERM/SIGINT. The key value is advertised to the orchestrator via the webhookKey field on the SessionStart call-home payload and persisted in ~/.claude/channels/orchestrator/registry.json under sessions.<sessionId>.webhookKey. The orchestrator's approval-response relay (Story 3.1) signs POSTs back into the session using this per-session key.

Approvals — inline-keyboard human-in-the-loop (Story 3.1, Steal #4)

Surface Claude Code's AskUserQuestion (and opt-in sensitive PreToolUse) events as inline-keyboard messages on Telegram and Slack. The user taps a button on their phone; the orchestrator HMAC-signs the response and POSTs it back to the originating session's webhook.

Sequence

sequenceDiagram
    participant CC as Claude Code (session)
    participant Hook as call-home-ask-question.sh
    participant Orch as Orchestrator
    participant Slack
    participant Telegram
    participant User as Operator (phone)

    CC->>Hook: AskUserQuestion event
    Hook->>Orch: POST approval-request envelope (HMAC)
    Orch->>Orch: PendingApprovals.add()
    Orch->>Slack: dispatchApprovalRequest (Block Kit)
    Orch->>Telegram: dispatchApprovalRequest (inline keyboard)
    User->>Telegram: tap button
    Telegram->>Orch: callback_query
    Orch->>Orch: PendingApprovals.resolve()
    Orch->>CC: POST approval-response envelope (HMAC sig)
    CC->>CC: receive via session-webhook → MCP notification

Hook wiring (`~/.claude/settings.json`)

{
  "hooks": {
    "AskUserQuestion": [
      { "type": "command", "command": "/abs/path/to/scripts/telemetry/call-home-ask-question.sh" }
    ],
    "PreToolUse": [
      { "type": "command", "command": "/abs/path/to/scripts/telemetry/call-home-permission-check.sh" }
    ]
  }
}

`CLAUDE_APPROVAL_TOOLS` — opt-in for sensitive PreToolUse

call-home-permission-check.sh is gated on the CLAUDE_APPROVAL_TOOLS env var. Unset → exit 0 immediately, no envelope emitted. Colon-separated tool names → matching PreToolUse events route through the approval flow:

export CLAUDE_APPROVAL_TOOLS=Bash:Edit

Match is case-sensitive and exact.

Security note — HMAC required (Story 2.1)

Approval-response POSTs from orchestrator → session are HMAC-signed with the destination session's webhookKey from the registry. The per-session webhook (server-session-webhook.ts) returns 401 on unsigned/wrong-signature POSTs. Story 3.1 cannot ship without Story 2.1; verify with approval-flow.test.ts — "missing-HMAC rejection" and "wrong-HMAC rejection" cases.

If a session's registry entry has no webhookKey (legacy / pre-Story-2.1 sessions), the orchestrator drops the approval-request server-side, logs [orchestrator] approval-request dropped: session <id> has no webhookKey, and returns 200 to the originating hook. No transport dispatch occurs.

Timeout + dedup behaviour

timeoutMs on the request (default 30 min) — TTL sweeper synthesises a decision: 'timeout' envelope when the window elapses with no callback. The session's hook script translates that to "no answer received" for Claude.
Dedup is first-response-wins: a second callback for the same questionId returns 200 to the transport, drops on the orchestrator side, and emits exactly one dropped-dedup audit line.
Map<questionId, Entry> is in-memory only — orchestrator restart drops in-flight approvals (the session's hook then waits on its own timeout fallback).

Audit stream (`events.jsonl`)

Every approval lifecycle event lands as one JSON line in ~/.claude/channels/orchestrator/events.jsonl:

{"event":"approval-request","questionId":"...","sessionId":"...","timestamp":"..."}
{"event":"approval-status","questionId":"...","terminalStatus":"approved","timestamp":"..."}

terminalStatus ∈ { 'approved', 'denied', 'timeout', 'dropped-dedup', 'dropped-no-key', 'dropped-unknown' }.

Query the audit stream with jq:

jq -c 'select(.event=="approval-status")' ~/.claude/channels/orchestrator/events.jsonl
jq -c 'select(.event=="approval-status" and .terminalStatus=="timeout")' ~/.claude/channels/orchestrator/events.jsonl

Outbound transport configuration

The orchestrator opportunistically instantiates Telegram and Slack outbound transports when the relevant env vars are set. Without them, an approval-request is stored and waits — operator must answer at the laptop (degraded mode).

| Env var | Purpose | |---|---| | TELEGRAM_BOT_TOKEN | Bot token for outbound approval messages | | TELEGRAM_DEFAULT_CHAT_ID | Chat id to receive approval messages | | SLACK_BOT_TOKEN + SLACK_APP_TOKEN | Slack credentials | | SLACK_DEFAULT_CHANNEL | Channel id (e.g. C0AFEKF68SH) for approval messages |

Dispatch order is intentional: Slack first, Telegram second. Both transports share pendingApprovals, so whichever callback arrives first wins via the dedup path.

Requirements

Bun runtime
Claude Code v2.1.80+
OPENAI_API_KEY for voice/video transcription (optional)
ffmpeg for video key frame extraction (optional)

License

MIT