Skip to main content

Incident Runbooks

Use this guide for production incident response across Excalibur workloads.

Severity Model

Severity	Definition	Initial Response Target
Sev 1	outage, data-loss risk, or critical message processing halt	15 minutes
Sev 2	major degradation without total outage	30 minutes
Sev 3	localized degradation or non-critical impact	1 business day

Ownership

Area	Primary Owner	Escalation
Dispatch core runtime	Platform team	Architecture lead
Transport/provider failures	Platform + provider owner	Release engineer
Security/compliance incidents	Compliance/security owner	Security lead

Standard Response Flow

Declare incident and assign incident commander.
Identify blast radius (message types, transports, tenants).
Stabilize (pause/scale/reroute as needed).
Roll back or patch.
Verify recovery using health metrics and queue/DLQ indicators.
Publish post-incident actions.

Common Playbooks

Transport backlog surge

check queue depth and lag,
verify consumer health and recent config/deploy changes,
scale consumers or throttle producers,
rollback recent runtime/transport changes if no recovery trend.

Dead-letter spike

identify dominant failure reason,
validate retry/poison policy config,
patch root cause and replay DLQ in controlled batches.

Cancellation/timeout regression

verify token propagation from HTTP/job trigger to dispatcher and transport calls,
compare behavior with last known-good release,
rollback if leaked work continues after cancellation.

Post-Incident Requirements

For Sev 1 and Sev 2 incidents:

create and track remediation issues,
add regression tests for uncovered failure mode,
update runbooks and reliability docs if behavior changed.

See Also

Severity Model
Ownership
Standard Response Flow
Common Playbooks
Transport backlog surge
Dead-letter spike
Cancellation/timeout regression
Post-Incident Requirements
See Also