Heartbeat monitoring with automatic restart, exponential backoff, and escalation.
SelfHealer
The SelfHealer monitors registered processes via heartbeats and error rates. When a process degrades or fails, it automatically diagnoses the issue, restarts the process with exponential backoff, and escalates to alerts when restart attempts are exhausted.
import { SelfHealer } from '@codmir/cortex';
import type { RestartFn } from '@codmir/cortex';Constructor
new SelfHealer(config?: Partial<SelfHealerConfig>)SelfHealerConfig
interface SelfHealerConfig {
heartbeatExpectedMs: number; // default: 15000
heartbeatDriftThresholdMs: number; // default: 5000
maxConsecutiveFailures: number; // default: 3
restartBackoffMs: number; // default: 5000
maxRestartAttempts: number; // default: 5
degradedErrorRateThreshold: number; // default: 0.1
unhealthyErrorRateThreshold: number; // default: 0.5
checkIntervalMs: number; // default: 10000
}| Property | Default | Description |
|---|---|---|
heartbeatExpectedMs | 15000 | Expected interval between heartbeats |
heartbeatDriftThresholdMs | 5000 | Drift above expected that triggers degraded |
maxConsecutiveFailures | 3 | Consecutive failures before unhealthy |
restartBackoffMs | 5000 | Base backoff for restart attempts |
maxRestartAttempts | 5 | Max restarts before escalation to alert |
degradedErrorRateThreshold | 0.1 | Error rate (0-1) threshold for degraded |
unhealthyErrorRateThreshold | 0.5 | Error rate threshold for unhealthy |
checkIntervalMs | 10000 | Default healing check interval |
Methods
register()
Registers a process for monitoring. Initializes it as healthy with zero drift and no failures.
register(pid: number): voidhealer.register(1001);
healer.register(1002);unregister()
Removes a process from monitoring.
unregister(pid: number): voidheartbeat()
Records a heartbeat for a process, resetting its consecutive failure count and updating the last heartbeat timestamp.
heartbeat(pid: number): void// Call periodically from each worker
healer.heartbeat(1001);recordRequest()
Records a request outcome for a process, updating the EMA response time and error rate.
recordRequest(pid: number, durationMs: number, success: boolean): void| Parameter | Type | Description |
|---|---|---|
pid | number | Process identifier |
durationMs | number | Request duration in milliseconds |
success | boolean | Whether the request succeeded |
healer.recordRequest(1001, 250, true);
healer.recordRequest(1001, 0, false); // failureonRestart()
Registers the restart function called when the healer decides to restart a process.
onRestart(fn: RestartFn): voidtype RestartFn = (pid: number) => Promise<boolean>;healer.onRestart(async (pid) => {
await killProcess(pid);
await spawnProcess(pid);
return true; // return false if restart failed
});check()
Runs a full healing cycle. For each registered process, diagnoses its health status. If the status changed, takes the appropriate action (restart or alert) and returns the resulting healing events.
async check(): Promise<HealingEvent[]>Restart uses exponential backoff: restartBackoffMs * 2^(restartCount - 1), capped at 30 seconds.
const events = await healer.check();
for (const event of events) {
console.log(`[${event.pid}] ${event.previousStatus} -> ${event.newStatus}: ${event.action} (${event.reason})`);
}getHealth()
Returns the health status for a single process.
getHealth(pid: number): ProcessHealth | undefinedgetAllHealth()
Returns the health status for all registered processes.
getAllHealth(): ProcessHealth[]getEventLog()
Returns the most recent healing events.
getEventLog(limit?: number): HealingEvent[]| Parameter | Type | Description |
|---|---|---|
limit | number (optional) | Max events to return (default: 50) |
flush()
Clears all registered processes and the event log.
flush(): voidTypes
HealthStatus
type HealthStatus = 'healthy' | 'degraded' | 'unhealthy' | 'dead';Diagnosis rules:
| Condition | Status |
|---|---|
Heartbeat drift > heartbeatExpectedMs * 3 | dead |
Consecutive failures >= maxConsecutiveFailures | unhealthy |
Error rate > unhealthyErrorRateThreshold | unhealthy |
Heartbeat drift > expected + threshold, or error rate > degradedErrorRateThreshold | degraded |
| Otherwise | healthy |
HealingAction
type HealingAction = 'none' | 'restart' | 'migrate' | 'scale_down' | 'alert';Action selection:
| Status | Restarts Remaining | Action |
|---|---|---|
healthy | any | none |
degraded | any | none |
unhealthy | yes | restart |
unhealthy | no | alert |
dead | yes | restart |
dead | no | alert |
ProcessHealth
interface ProcessHealth {
pid: number;
status: HealthStatus;
heartbeatDriftMs: number;
errorRate: number;
avgResponseMs: number;
consecutiveFailures: number;
lastAction: HealingAction;
lastActionAt: number;
}HealingEvent
interface HealingEvent {
pid: number;
action: HealingAction;
reason: string;
previousStatus: HealthStatus;
newStatus: HealthStatus;
timestamp: number;
success: boolean;
}Example
import { SelfHealer } from '@codmir/cortex';
const healer = new SelfHealer({
heartbeatExpectedMs: 10_000,
maxConsecutiveFailures: 2,
maxRestartAttempts: 3,
restartBackoffMs: 2000,
});
// Register restart handler
healer.onRestart(async (pid) => {
console.log(`Restarting process ${pid}...`);
// Restart logic here
return true;
});
// Register processes
healer.register(1);
healer.register(2);
// Simulate healthy heartbeats
healer.heartbeat(1);
healer.heartbeat(2);
// Simulate failures on process 2
healer.recordRequest(2, 0, false);
healer.recordRequest(2, 0, false);
// Run healing cycle
const events = await healer.check();
for (const e of events) {
console.log(`PID ${e.pid}: ${e.action} — ${e.reason}`);
// PID 2: restart — 2 consecutive failures
}
// Inspect health
const health = healer.getAllHealth();
for (const h of health) {
console.log(`PID ${h.pid}: ${h.status}, error rate: ${(h.errorRate * 100).toFixed(1)}%`);
}
// View event log
const log = healer.getEventLog(10);
console.log(`${log.length} healing events recorded`);