node-eventloop-watchdog

v1.1.1

Published

8 days ago

Production watchdog that detects Node.js event loop stalls and can trigger recovery actions before your app silently freezes

0High
0Medium
0Low

beingmartinbmc

event-loop blocking detection monitor performance diagnostics latency watchdog production nodejs observability profiler debugging

node-eventloop-watchdog

Why This Exists

Most Node monitoring tells you the event loop is slow. That is useful, but it does not answer the production question:

If the event loop is blocked, then what happens?

node-eventloop-watchdog is a small production safety layer for that exact moment. It can log, emit events, call your handler, post a webhook, exit, or terminate a stuck process so a supervisor such as Kubernetes, systemd, PM2, Docker, or a platform runtime can restart it.

What Makes It Different

| Tool category | What it usually does | Limitation | |---|---|---| | Event loop metrics | Tracks lag, averages, percentiles | Tells you something is wrong, but does not act | | Native watchdogs | Kill or supervise the process | Often require native dependencies or separate setup | | Simple timers | Detect lag after the loop resumes | Cannot handle a loop that never comes back | | node-eventloop-watchdog | Detects stalls, adds context, and can act | Zero runtime dependencies, opt-in recovery |

Ecosystem

node-eventloop-watchdog is part of a small Node.js observability ecosystem you can adopt independently or together:

node-actuator-lite — Spring Boot-style /actuator/health, /info, /metrics, /env, /threaddump, /heapdump, and /prometheus endpoints.
node-eventloop-watchdog — Detects event-loop stalls, captures stack traces and hotspots, and triggers recovery.
node-request-trace — Per-request timelines, browser dashboard, and CLI without OpenTelemetry.

When all three are installed:

This watchdog automatically registers /actuator/eventloop, /actuator/eventloop/history, /actuator/eventloop/hotspots, and /actuator/eventloop/metrics under node-actuator-lite.
Block events include the active request id, route, and method captured by node-request-trace.

Runnable example: node-actuator-lite/examples/ecosystem.

Quickest setup: Use node-observability-lite to wire the three packages together with production-safe presets in one line.
const observability = require('node-observability-lite');
observability.express(app, {
  preset: 'production',
  auth: req => req.get('authorization') === `Bearer ${process.env.OPS_TOKEN}`,
});

Install

npm install node-eventloop-watchdog

CommonJS and bundled TypeScript declarations are included.

const watchdog = require('node-eventloop-watchdog');

Quick Start: Observe Mode

Use start() when you want safe, backwards-compatible monitoring. It logs blocked event loop events and keeps history, metrics, hotspots, and request context.

const watchdog = require('node-eventloop-watchdog');

watchdog.start();

When a block crosses the threshold, you get a structured event:

[node-eventloop-watchdog] [WARN] Event Loop Blocked
  Duration: 142ms
  Severity: warning
  Threshold: 50ms
  Action: log
  Route: POST /checkout

  Suspected Blocking Operation
  JSON.stringify

  Location
  checkoutService.js:84

Production Mode: Protect

Use protect() when you want opinionated production behavior. It enables recovery defaults designed for apps already managed by a process supervisor.

const watchdog = require('node-eventloop-watchdog');

watchdog.protect();

Default protection behavior:

| Trigger | Default action | |---|---| | Event loop lag >= 100ms | Log warning, record metrics, emit block event | | Event loop lag >= 500ms | Mark event critical and terminate with SIGTERM | | Main event loop never resumes for 1000ms | Worker-backed hard watchdog terminates with SIGTERM |

The intended production pattern is simple: the watchdog terminates the unhealthy process, and your supervisor restarts it.

watchdog.protect({
  recovery: {
    action: 'kill',
    signal: 'SIGTERM',
    hardTimeout: 1000
  }
});

Brutal Demo

This demo intentionally freezes the main event loop forever. A normal timer-based monitor cannot recover from this because the timer callback never runs. protect() also starts a worker-backed hard watchdog, so the process can still be terminated.

node examples/brutal-demo.js

const watchdog = require('node-eventloop-watchdog');

watchdog.protect({
  criticalThreshold: 100,
  recovery: {
    enabled: true,
    action: 'kill',
    hardTimeout: 500,
    signal: 'SIGTERM'
  }
});

setTimeout(() => {
  while (true) {}
}, 2000);

Expected output:

Watchdog armed. This process will freeze in 2 seconds.
Expected result: the hard watchdog logs the stall and terminates the process.
[node-eventloop-watchdog] [ERROR] Event loop hard-stalled for 500ms. Action: kill
Terminated: 15

Trigger To Action

You can choose the action that matches your runtime:

| Action | What happens | Good for | |---|---|---| | log | Record and log the event only | Local dev, dashboards, low-risk rollout | | callback | Call recovery.handler(event) | Custom alerting or diagnostics | | webhook | POST the event as JSON | Alertmanager, incident bots, automation | | exit | Stop the monitor and call process.exit(exitCode) | Graceful process-manager restart | | kill | Send a signal to the process | Kubernetes, systemd, PM2, Docker restart | | abort | Hard watchdog aborts the process | Core dumps and severe failure analysis |

watchdog.start({
  warningThreshold: 100,
  criticalThreshold: 500,
  recovery: {
    enabled: true,
    minSeverity: 'critical',
    action: 'webhook',
    webhookUrl: 'https://alerts.example.com/event-loop-block'
  }
});

watchdog.start({
  recovery: {
    enabled: true,
    action: 'callback',
    handler(event) {
      pagerDuty.alert({
        summary: `Event loop blocked for ${event.duration}ms`,
        route: event.request?.route,
        location: event.location
      });
    }
  }
});

Real Problems This Solves

Infinite loops that leave a Node process alive but useless.
CPU-heavy synchronous code blocking requests.
Large JSON serialization or parsing on hot paths.
Synchronous filesystem, crypto, compression, or child-process calls in request handlers.
Stuck production servers that pass process liveness checks but stop serving traffic.
Incidents where you need recent block history, request correlation, and likely hotspots after recovery.

API

`watchdog.start(config?)`

Starts observe mode. This is the safest default for adding visibility without changing process lifecycle behavior.

watchdog.start({
  warningThreshold: 50,
  criticalThreshold: 100,
  captureStackTrace: true,
  historySize: 50,
  enableMetrics: true,
  detectBlockingPatterns: true,
  checkInterval: 20,
  logLevel: 'warn',
  jsonLogs: false,
  onBlock: null,
  recovery: false
});

`watchdog.protect(config?)`

Starts protect mode with opinionated recovery defaults.

watchdog.protect({
  warningThreshold: 100,
  criticalThreshold: 500,
  recovery: {
    action: 'kill',
    hardTimeout: 1000,
    signal: 'SIGTERM'
  }
});

`watchdog.stop()`

Stops monitoring and disables the hard watchdog worker.

`watchdog.on('block', listener)`

Subscribe to block events.

watchdog.on('block', (event) => {
  console.log(event.duration, event.severity, event.action.type);
});

`watchdog.getStats()`

Returns runtime state, lag metrics, memory snapshot, and active mode.

watchdog.getStats();
// {
//   avgLag: 12,
//   maxLag: 121,
//   minLag: 1,
//   totalBlocks: 14,
//   blocksLastMinute: 6,
//   running: true,
//   config: { mode: 'protect', warningThreshold: 100, criticalThreshold: 500, recoveryAction: 'kill' },
//   memory: { heapUsed: 42, heapTotal: 64, rss: 91, external: 2, arrayBuffers: 1 }
// }

`watchdog.getRecentBlocks(count?)`

Returns the most recent blocking events.

`watchdog.getBlockingHotspots(limit?)`

Returns best-effort user-code locations captured when blocks were detected.

watchdog.getBlockingHotspots();
// [
//   { file: 'reportService.js', line: 142, blocks: 18, maxLag: 221, avgLag: 145 },
//   { file: 'orderController.js', line: 51, blocks: 7, maxLag: 94, avgLag: 62 }
// ]

`watchdog.middleware()`

Returns Connect / Express-style middleware for request correlation.

const express = require('express');
const watchdog = require('node-eventloop-watchdog');

const app = express();

watchdog.start();
app.use(watchdog.middleware());

app.post('/checkout', (req, res) => {
  res.json({ ok: true });
});

Configuration

| Option | Type | Default | Description | |---|---|---|---| | mode | 'observe' \| 'protect' | 'observe' | Runtime posture | | warningThreshold | number | 50 | Lag in ms before warning | | criticalThreshold | number | 100 | Lag in ms before critical event | | captureStackTrace | boolean | true | Capture best-effort stack context | | historySize | number | 50 | Max blocking events retained | | enableMetrics | boolean | true | Collect lag and memory metrics | | detectBlockingPatterns | boolean | true | Identify likely sync blocking patterns | | checkInterval | number | 20 | Poll interval in ms | | logLevel | string | 'warn' | debug, info, warn, error, or silent | | jsonLogs | boolean | false | Emit JSON logs | | onBlock | function | null | Callback for every block | | recovery.enabled | boolean | false | Enable recovery actions | | recovery.action | string | 'log' | log, callback, webhook, exit, kill, or abort | | recovery.minSeverity | string | 'critical' | Minimum severity before action runs | | recovery.hardTimeout | number | 0 | Worker-backed timeout for never-returning stalls | | recovery.signal | string | 'SIGTERM' | Signal for kill action | | recovery.exitCode | number | 1 | Exit code for exit action | | recovery.webhookUrl | string | null | URL for webhook action | | recovery.handler | function | null | Function for callback action |

Blocking Pattern Hints

The watchdog looks for common synchronous patterns in captured stack context:

| Pattern | Category | |---|---| | JSON.stringify / JSON.parse | Serialization | | fs.readFileSync, fs.writeFileSync, etc. | Sync filesystem | | crypto.pbkdf2Sync, crypto.scryptSync, crypto.createHash | Sync crypto | | zlib.*Sync | Sync compression | | child_process.execSync, spawnSync | Sync child process | | RegExp.exec | Regex backtracking |

Important Attribution Note

Timer-based lag detection runs after the event loop resumes. Stack traces, location, userFrame, and hotspots are therefore best-effort context captured around detection time, not guaranteed blame for the exact blocking line.

For a loop that never resumes, enable recovery.hardTimeout through protect() or explicit recovery config. The hard watchdog runs in a worker thread and can terminate the process even when the main event loop is permanently stuck.

Integrations

JSON Logs

watchdog.start({ jsonLogs: true });

node-request-trace

If node-request-trace is installed, active request data is automatically attached to block events.

node-actuator-lite

If node-actuator-lite is installed, these endpoints are registered automatically:

| Endpoint | Description | |---|---| | GET /actuator/eventloop | Status, metrics, top hotspots | | GET /actuator/eventloop/history | Recent blocking events | | GET /actuator/eventloop/hotspots | Hotspot ranking | | GET /actuator/eventloop/metrics | Lag and memory metrics |

Operational Guidance

Use start() first when rolling out to an existing app.
Use protect() when the app runs under a supervisor that restarts failed processes.
Keep hardTimeout comfortably above normal CPU spikes to avoid killing legitimate long work.
Prefer SIGTERM for graceful runtime restarts; use abort only when you need crash diagnostics.
Run npm run bench in your own workload if overhead matters.

Development

npm ci
npm run lint
npm run typecheck
npm test
npm run test:coverage:check

The CI gate requires at least 90% coverage across statements, branches, functions, and lines.

License

MIT

Published

Vulnerabilities

Links

Maintainers

Keywords

Readme

node-eventloop-watchdog

Why This Exists

What Makes It Different

Ecosystem

Install

Quick Start: Observe Mode

Production Mode: Protect

Brutal Demo

Trigger To Action

Real Problems This Solves

API

watchdog.start(config?)

watchdog.protect(config?)

watchdog.stop()

watchdog.on('block', listener)

watchdog.getStats()

watchdog.getRecentBlocks(count?)

watchdog.getBlockingHotspots(limit?)

watchdog.middleware()

Configuration

Blocking Pattern Hints

Important Attribution Note

Integrations

JSON Logs

node-request-trace

node-actuator-lite

Operational Guidance

Development

License

`watchdog.start(config?)`

`watchdog.protect(config?)`

`watchdog.stop()`

`watchdog.on('block', listener)`

`watchdog.getStats()`

`watchdog.getRecentBlocks(count?)`

`watchdog.getBlockingHotspots(limit?)`

`watchdog.middleware()`