Skip to main content

Operations

Operational guidance for running Excalibur in production environments, including resilience, recovery procedures, and maintenance runbooks.

Before You Start

Guides

TopicDescription
Runtime ContractCanonical runtime semantics for dispatch ordering, cancellation, retries, and context propagation
Reliability GuaranteesDelivery/ordering/deduplication/dead-letter guarantees by execution path and provider family
SLO, SLI, and TelemetryProduction objectives and telemetry schema for release readiness and operations
Incident RunbooksEscalation model and step-by-step response playbooks for common runtime incidents
Operational ResilienceTransient error handling, retry policies, and recovery strategies
Recovery RunbooksStep-by-step recovery procedures for common failure scenarios

Quick Reference

Provider Resilience Matrix

ProviderRetry PolicyRecovery OptionsCDC Position Recovery
SQL ServerSqlServerRetryPolicyAutomatic reconnectCdcRecoveryOptions
PostgreSQLPostgresRetryPolicyAutomatic reconnectPostgresCdcRecoveryOptions
CosmosDBSDK-managedAutomaticContinuation token
DynamoDBSDK-managedAutomaticStream ARN
MongoDBDriver poolAutomaticResume token
RedisManual reconnectConnectionMultiplexerN/A

Key Error Codes

SQL Server Transient Errors:

  • 596 - Session killed by backup/restore (critical for CDC)
  • 9001, 9002 - Transaction log unavailable
  • 1205 - Deadlock victim
  • 40613 - Database unavailable

PostgreSQL Transient Errors:

  • 08xxx - Connection errors
  • 40001, 40P01 - Serialization/deadlock
  • 57Pxx - Admin/crash shutdown
  • 53xxx - Insufficient resources

See Also