SLO, SLI, and Telemetry
This page defines baseline production objectives and the telemetry required to operate Excalibur safely.
Suggested SLO Baseline
| Objective | Target |
|---|---|
| Dispatch success rate | >= 99.9% over rolling 30 days |
| Local dispatch p95 latency | <= 5 ms |
| Transport dispatch p95 latency | <= 100 ms (provider dependent) |
| Dead-letter growth | no sustained growth in steady-state |
| Queue lag recovery | return to baseline within 15 minutes after spikes |
Required SLIs
| SLI | Formula / Meaning |
|---|---|
| Success rate | success_count / total_count |
| Latency p95/p99 | percentile latency by message type and route |
| Error budget burn | failure trend against allowed SLO budget |
| Queue lag | age/depth of unprocessed transport messages |
| Dead-letter rate | dead-letter additions per minute |
Required Telemetry Dimensions
Use consistent dimensions across metrics/traces/logs:
message.typerouteoperationresulterror.typetransport.namecorrelation.id
Minimum Alert Set
- error rate above threshold (for example > 2% for 5m),
- p95 latency over target (for example 10m),
- dead-letter growth sustained,
- queue lag sustained.