Skip to main content

SLO, SLI, and Telemetry

This page defines baseline production objectives and the telemetry required to operate Excalibur safely.

Suggested SLO Baseline

ObjectiveTarget
Dispatch success rate>= 99.9% over rolling 30 days
Local dispatch p95 latency<= 5 ms
Transport dispatch p95 latency<= 100 ms (provider dependent)
Dead-letter growthno sustained growth in steady-state
Queue lag recoveryreturn to baseline within 15 minutes after spikes

Required SLIs

SLIFormula / Meaning
Success ratesuccess_count / total_count
Latency p95/p99percentile latency by message type and route
Error budget burnfailure trend against allowed SLO budget
Queue lagage/depth of unprocessed transport messages
Dead-letter ratedead-letter additions per minute

Required Telemetry Dimensions

Use consistent dimensions across metrics/traces/logs:

  • message.type
  • route
  • operation
  • result
  • error.type
  • transport.name
  • correlation.id

Minimum Alert Set

  1. error rate above threshold (for example > 2% for 5m),
  2. p95 latency over target (for example 10m),
  3. dead-letter growth sustained,
  4. queue lag sustained.

See Also