nori-premortem

v1.0.2

Published

a month ago

![Node Version](https://img.shields.io/badge/node-%3E%3D20-brightgreen)

0High
0Medium
0Low

theahura

Nori Premortem

A system monitoring daemon that intelligently diagnoses machine issues before critical failure using Claude AI.

Why Premortem?

When a machine dies, you are often left with no real idea what happened and why, because the machine takes everything with it. Traditional monitoring rarely captures meaningful diagnostics, because it makes strong assumptions up front about possible sources of failure and is not able to dynamically adjust based on in-stream information. You can figure out that your system OOM'd, but you won't easily figure out why, or even more important, where in your code the problem came from.

Premortem spawns a Claude agent the moment issues arise, analyzing the system in real-time and streaming diagnostics to a safe backend. Instead of metric graphs, engineers get AI-powered root cause analysis.

Overview

Premortem continuously watches system vitals (CPU, memory, disk, processes) and spawns Claude agents to diagnose problems when thresholds are breached. All diagnostic output is streamed in real-time to a configured webhook endpoint. Since the analysis is done by an AI agent, it is more flexible and more able to dig into the actual cause of why things are going haywire.

Features

Continuous Monitoring: Polls system metrics at configurable intervals
Intelligent Diagnosis: Spawns Claude agents with full system context when thresholds breach
Real-Time Streaming: Fire-and-forget webhook delivery ensures data reaches backend even if the machine crashes
Reset on Completion: Daemon resets after each agent run, allowing multiple diagnostic sessions
Configurable Thresholds: Monitor memory %, disk %, CPU %, and process counts

Quick Start

# Clone and install
git clone [email protected]:tilework-tech/nori-premortem.git
cd nori-premortem
npm install
npm run build

# Configure
cp defaultConfig.example.json defaultConfig.json
# Edit defaultConfig.json with your webhook URL and Anthropic API key

# Run
node build/cli.js --config ./defaultConfig.json

Installation

Clone and install:

git clone [email protected]:tilework-tech/nori-premortem.git
cd nori-premortem
npm install
npm run build

Or install globally:

npm install -g .

Configuration

Create your configuration file from the example template:

cp defaultConfig.example.json defaultConfig.json
# Edit defaultConfig.json with your webhookUrl, anthropicApiKey, and desired thresholds

Note: defaultConfig.json is gitignored to prevent accidentally committing sensitive credentials.

Example configuration:

{
  "webhookUrl": "https://your-server.com/webhook-endpoint",
  "anthropicApiKey": "sk-ant-your-api-key-here",
  "pollingInterval": 10000,
  "thresholds": {
    "memoryPercent": 90,
    "diskPercent": 85,
    "cpuPercent": 80
  },
  "agentConfig": {
    "customPrompt": "You are diagnosing system performance issues. Focus on memory usage, disk space, CPU utilization, and process behavior."
  },
  "heartbeat": {
    "url": "https://your-server.com/heartbeat-endpoint",
    "interval": 60000,
    "processName": "my-process"
  }
}

Configuration Options

webhookUrl (required): HTTP endpoint to receive diagnostic output
- Must accept POST requests with JSON payloads containing Claude SDK message objects
- Messages are grouped by session_id field
- Each message follows the format: {type: string, session_id: string, ...other_fields}
anthropicApiKey (required): Your Anthropic API key for Claude
pollingInterval (optional, default: 10000): Milliseconds between system checks
thresholds (required): At least one threshold must be configured
- memoryPercent: Trigger when memory usage exceeds this percentage (uses "available" memory, not "used", to avoid false alerts from Linux buffer/cache)
- diskPercent: Trigger when disk usage exceeds this percentage
- cpuPercent: Trigger when CPU usage exceeds this percentage
agentConfig (optional): Claude agent configuration
- customPrompt: Additional context prepended to diagnostic prompt (default: null)
- Note: Model, allowed tools, and max turns are controlled by SDK defaults and not user-configurable
heartbeat (optional): Health check configuration
- url: Endpoint to receive periodic heartbeat signals
- interval (default: 60000): Milliseconds between heartbeat signals
- processName: Process name to monitor and report in heartbeat

Usage

Run the daemon:

nori-premortem --config ./config.json

The daemon will:

Validate the Anthropic API key with a test query (fail-fast if invalid)
Create the archive directory at ~/.premortem-logs if it doesn't exist
Validate the archive directory is writable (fail-fast if not)
Start monitoring system metrics
When a threshold is breached, spawn a Claude agent with system context
Stream all agent output to your webhook endpoint
Save complete session transcripts to ~/.premortem-logs/agent-{sessionId}.jsonl
Reset after the agent completes, ready to trigger again

Stop the daemon with Ctrl+C.

Webhook Integration

Premortem streams diagnostic data to any HTTP endpoint that accepts POST requests. This allows integration with existing monitoring infrastructure, logging systems, or custom backends.

Webhook Endpoint Requirements

The configured webhook endpoint must:

Accept POST requests with raw Claude SDK message payloads
Handle messages grouped by session_id field
Be highly available (premortem uses fire-and-forget delivery with no retry logic)

Message Format

Messages are sent as raw Claude SDK output, one message per POST:

{
  "type": "assistant",
  "session_id": "session-abc123",
  "message": {
    "role": "assistant",
    "content": "Analyzing system metrics..."
  }
}

The session_id field groups messages into a single diagnostic transcript artifact on the backend.

Architecture

Daemon (monitoring loop)
  ↓ (threshold breach detected)
Agent SDK (Claude diagnostics)
  ↓ (immediate streaming)
Webhook Endpoint (your server)

Key Design Decisions:

First-breach-only: When multiple thresholds breach, only the first (memory > disk > cpu) triggers
Reset on completion: Agent finish resets daemon state, allowing new breaches to trigger
Fire-and-forget webhooks: No retries - webhook endpoint must be reliable
API key in config: anthropicApiKey stored in config file, set to env before SDK calls

Development

Run tests:

npm test

Watch mode:

npm run test:watch

Build:

npm run build

Troubleshooting

Daemon not starting:

Check that anthropicApiKey is valid in config
Verify webhook URL is reachable

No agent triggering:

Check threshold values - may need to lower them for testing
Review daemon logs for system metrics

Webhook not receiving data:

Test webhook endpoint separately
Check firewall/network settings
Remember: no retries, so endpoint must be reliable

License

MIT