Skip to main content

Operational Resilience

Excalibur providers are designed for operational resilience, handling transient failures, connection disruptions, and recovery scenarios automatically. This guide covers the retry policies, transient error catalogs, and recovery options available across all supported providers.

Before You Start

  • .NET 8.0+ (or .NET 9/10 for latest features)
  • Install the packages for your data provider:
    dotnet add package Excalibur.Data.SqlServer  # or Excalibur.Data.Postgres
  • Familiarity with data access and Polly resilience concepts

Retry Policies

SQL Server

The SqlServerRetryPolicy handles transient failures with exponential backoff automatically. Configure SQL Server stores with their storage-specific options:

// Event sourcing - configure storage options
services.AddSqlServerEventSourcing(options =>
{
options.ConnectionString = connectionString;
options.EventStoreSchema = "dbo";
options.EventStoreTable = "Events";
});

// Outbox - configure storage and processing via fluent builder
services.AddExcaliburOutbox(outbox =>
{
outbox.UseSqlServer(connectionString)
.WithProcessing(p => p.MaxRetryCount(5)
.RetryDelay(TimeSpan.FromMinutes(5)))
.EnableBackgroundProcessing();
});

Transient Error Codes:

Error CodeDescriptionImpact
596Session killed by backup/restoreCritical for CDC processors
9001, 9002Transaction log unavailable/fullWrite operations fail
3960, 3961Snapshot isolation conflictsConcurrent write conflicts
1204, 1205, 1222Lock/deadlock errorsTransaction conflicts
40143, 40613, 40501Azure SQL service errorsService unavailable
49918-49920Resource governanceThrottling
20, 64, 233Connection errorsNetwork issues
-2, 2, 53Network errorsConnectivity loss

Recovery Behavior:

  • Automatic retry with exponential backoff (1s, 2s, 4s, 8s...)
  • Connection recreation on retry
  • Logging of retry attempts for observability

PostgreSQL

The PostgresRetryPolicy handles PostgreSQL-specific transient failures automatically:

// Event sourcing with PostgreSQL - configure storage options
services.AddPostgresEventStore(connectionString, options =>
{
options.SchemaName = "public";
options.EventsTableName = "event_store_events";
});

Transient Error Codes:

Error CodeDescriptionImpact
08xxxConnection exceptionsAll connection failures
08007Connection failure during transactionTransaction rollback
40001, 40P01Serialization/deadlockConcurrent conflicts
53xxxInsufficient resourcesMemory/disk pressure
57P01-57P04Admin/crash shutdownServer unavailable
58000, 58030System/IO errorsInfrastructure issues
25P02, 25006Failed/readonly transactionTransaction state
55P03Lock not availableAdvisory lock contention
XX000Internal errorsUnexpected failures

Cloud Providers

Cloud providers (CosmosDB, DynamoDB, Firestore) primarily use SDK-managed retry policies:

// CosmosDB - SDK handles 408, 503, 504, 429 automatically
services.AddCosmosDbEventStore(options =>
{
options.MaxRetryAttemptsOnRateLimitedRequests = 9;
options.MaxRetryWaitTimeOnRateLimitedRequests = TimeSpan.FromSeconds(30);
});

CDC Position Recovery

CDC (Change Data Capture) processors maintain position state to resume after failures.

SQL Server CDC Recovery

// SQL Server CDC with fluent builder configuration
services.AddCdcProcessor(cdc =>
{
cdc.UseSqlServer(connectionString, sql =>
{
sql.SchemaName("Cdc")
.StateTableName("CdcProcessingState");
})
.WithRecovery(recovery =>
{
recovery.Strategy(StalePositionRecoveryStrategy.FallbackToEarliest)
.MaxAttempts(3);
})
.EnableBackgroundProcessing();
});

Recovery Strategies:

StrategyBehaviorUse Case
FallbackToEarliestResume from oldest available positionData consistency priority
FallbackToLatestResume from current positionSkip missing events
ThrowFail with detailed errorManual intervention required
InvokeCallbackCustom handling via callbackComplex recovery scenarios

PostgreSQL CDC Recovery

services.AddPostgresCdc(options =>
{
options.ConnectionString = connectionString;
options.PublicationName = "excalibur_cdc_publication"; // Default
options.ReplicationSlotName = "excalibur_cdc_slot"; // Default
options.RecoveryOptions = new PostgresCdcRecoveryOptions
{
// Configure recovery behavior for stale WAL positions
};
});

CosmosDB CDC Recovery

services.AddCosmosDbCdc(options =>
{
options.ConnectionString = connectionString;
options.DatabaseId = "mydb";
options.ContainerId = "events";
options.ProcessorName = "cdc-processor"; // Default
options.Mode = CosmosDbCdcMode.LatestVersion; // or AllVersionsAndDeletes
});

Connection Recovery

Long-Running Processors

CDC and projection processors use long-lived connections. Handle connection loss gracefully:

public class ResilientProjectionProcessor : BackgroundService
{
protected override async Task ExecuteAsync(CancellationToken stoppingToken)
{
while (!stoppingToken.IsCancellationRequested)
{
try
{
await _processor.RunAsync(stoppingToken);
}
catch (Exception ex) when (IsTransient(ex))
{
_logger.LogWarning(ex, "Transient failure in projection processor, restarting...");
await Task.Delay(TimeSpan.FromSeconds(5), stoppingToken);
}
}
}
}

Connection Pool Health

For SQL Server and PostgreSQL, ensure connection pool health after failures:

// SQL Server - ClearAllPools after major failures
SqlConnection.ClearAllPools();

// PostgreSQL - Clear connection pool
NpgsqlConnection.ClearAllPools();

Failure Mode Coverage

Failure ModeSQL ServerPostgreSQLCosmosDBDynamoDBMongoDB
Process restart✅ Full✅ Full✅ Full✅ Full✅ Full
Database restart✅ Full✅ Full✅ Full✅ Full✅ Full
Backup/restore (LSN rollback)✅ Full✅ FullN/AN/AN/A
Killed session (error 596)✅ Full✅ FullN/AN/AN/A
Network partition/timeout✅ Full✅ Full✅ Full✅ Full✅ Full
Throttling/rate limits✅ Full✅ Full✅ Full✅ Full✅ Full
Failover/replica promotion✅ Full✅ Full✅ Full✅ Full✅ Full

Legend:

  • ✅ Full - Automatic recovery with no manual intervention
  • ⚠️ Partial - May require manual intervention
  • N/A - Not applicable to provider

Observability

Metrics

All retry operations emit metrics via OpenTelemetry:

  • dispatch.write_store.operations_total - Total number of write-side store operations (tagged by store, provider, operation, result)
  • dispatch.write_store.operation_duration_ms - Duration of write-side store operations in milliseconds

Logging

Retry attempts are logged at Warning level:

SQL Server operation failed with transient error. Retry 1 after 1000ms
PostgreSQL operation failed with transient error. Retry 2 after 2000ms

Best Practices

  1. Configure appropriate retry counts - Balance between recovery and fail-fast
  2. Monitor retry metrics - High retry rates indicate underlying issues
  3. Use CDC recovery options - Configure stale position handling for your use case
  4. Implement circuit breakers - Prevent cascade failures with Polly
  5. Clear connection pools - After major failures, clear stale connections

See Also