Recovery Runbooks

This guide provides step-by-step procedures for recovering from common operational failures in Excalibur applications.

Before You Start

.NET 8.0+ (or .NET 9/10 for latest features)
Access to your production or staging SQL Server instance
Familiarity with performance tuning and health checks

SQL Server Recovery Scenarios

Session Killed During CDC Processing (Error 596)

Symptoms:

CDC processor stops processing events
Log shows error 596: "Cannot continue the execution because the session is in the kill state"
Projection updates stop

Diagnosis:

-- Check for active CDC sessions
SELECT session_id, status, start_time
FROM sys.dm_cdc_sessions
WHERE end_time IS NULL;

-- Check CDC capture position
SELECT * FROM cdc.lsn_time_mapping
ORDER BY tran_begin_time DESC;

Recovery Steps:

The retry policy automatically handles error 596
If processor doesn't recover within retry attempts:
```
# Restart the application
systemctl restart myapp
```

If position is invalid after restart:

// Configure recovery options
options.Recovery = new CdcRecoveryOptions
{
    RecoveryStrategy = StalePositionRecoveryStrategy.FallbackToEarliest
};

Prevention:

Configure CDC recovery options in application startup
Monitor CDC processor health via metrics
Alert on repeated retry attempts

Database Backup/Restore Invalidates LSN Position

Symptoms:

CDC processor fails with "Invalid LSN" error
Event store cannot find expected position
Projection processor stuck

Diagnosis:

-- Check current CDC min LSN
SELECT name, min_lsn, max_lsn
FROM cdc.change_tables;

-- Compare with saved position
SELECT * FROM [dbo].[CdcState]
WHERE ProcessorName = 'YourProcessor';

Recovery Steps:

Automatic (Recommended): Configure recovery options:

options.Recovery = new CdcRecoveryOptions
{
    RecoveryStrategy = StalePositionRecoveryStrategy.FallbackToEarliest
};

Manual Reset: If automatic recovery fails:

-- Reset CDC state to earliest available position
UPDATE [dbo].[CdcState]
SET Position = (SELECT MIN(min_lsn) FROM cdc.change_tables)
WHERE ProcessorName = 'YourProcessor';

Restart the processor and verify events are processing

Prevention:

Use FallbackToEarliest strategy for data consistency
Monitor position age vs CDC retention
Schedule backups during low-activity periods

Connection Pool Corruption

Symptoms:

Intermittent connection failures
"Connection is broken" errors
Some operations succeed, others fail

Diagnosis:

-- Check for orphaned connections
SELECT session_id, login_name, status, last_request_end_time
FROM sys.dm_exec_sessions
WHERE program_name LIKE '%YourApp%'
ORDER BY last_request_end_time;

Recovery Steps:

Clear all connection pools:
```
SqlConnection.ClearAllPools();
```

If in Kubernetes, rolling restart:

kubectl rollout restart deployment/myapp

Monitor for recurring issues

Prevention:

Configure connection lifetime limits
Implement health checks that test connections
Use Azure SQL maintenance windows

PostgreSQL Recovery Scenarios

Broken Pipe / Connection Lost (08xxx Errors)

Symptoms:

"Connection broken" or "57P01 admin_shutdown" errors
Operations fail mid-transaction
CDC processor stops

Diagnosis:

-- Check active connections
SELECT pid, usename, application_name, state, query_start
FROM pg_stat_activity
WHERE application_name LIKE '%YourApp%';

-- Check for terminated backends
SELECT * FROM pg_stat_activity WHERE state = 'idle in transaction';

Recovery Steps:

Retry policy handles transient errors automatically
If persistent, clear connection pool:
```
NpgsqlConnection.ClearAllPools();
```

Check PostgreSQL logs for root cause:

tail -100 /var/log/postgresql/postgresql-15-main.log

Prevention:

Configure tcp_keepalives_idle in connection string
Monitor pg_stat_activity for long-running transactions
Use connection pool health checks

Deadlock Detection (40P01)

Symptoms:

Operations fail with "deadlock detected" error
Concurrent writes to same aggregates

Diagnosis:

-- Find blocking queries
SELECT blocked_locks.pid AS blocked_pid,
       blocking_locks.pid AS blocking_pid,
       blocked_activity.query AS blocked_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks ON blocking_locks.locktype = blocked_locks.locktype;

Recovery Steps:

Automatic retry handles deadlocks
If frequent, review aggregate design:
- Reduce aggregate scope
- Implement optimistic concurrency
- Use advisory locks for coordination

Prevention:

Design aggregates to minimize contention
Use consistent lock ordering
Monitor deadlock frequency via metrics

Cloud Provider Recovery Scenarios

CosmosDB Rate Limiting (429)

Symptoms:

Operations fail with 429 "Request rate too large"
Throughput drops dramatically
SDK retries exhausted

Recovery Steps:

SDK handles 429 automatically with backoff

If persistent, increase RU/s:

az cosmosdb sql container throughput update \
  --account-name myaccount \
  --database-name mydb \
  --name mycontainer \
  --throughput 10000

Enable autoscale:

az cosmosdb sql container throughput migrate \
  --account-name myaccount \
  --database-name mydb \
  --name mycontainer \
  --resource-group mygroup \
  --throughput-type autoscale

Prevention:

Enable autoscale for variable workloads
Monitor RU consumption via Azure Monitor
Implement bulk operations for high-throughput scenarios

DynamoDB Throttling

Symptoms:

Operations fail with ProvisionedThroughputExceededException
Latency spikes

Recovery Steps:

SDK handles throttling automatically

If persistent, increase capacity:

aws dynamodb update-table \
  --table-name MyTable \
  --provisioned-throughput ReadCapacityUnits=100,WriteCapacityUnits=100

Prevention:

Enable on-demand capacity for unpredictable workloads
Monitor consumed capacity via CloudWatch
Implement exponential backoff in application code

General Recovery Procedures

Event Store Recovery

When to use: Event store corruption or position drift

Verify event store integrity:

-- SQL Server
SELECT StreamId, COUNT(*) as EventCount, MAX(Version) as MaxVersion
FROM [dbo].[Events]
GROUP BY StreamId
HAVING COUNT(*) != MAX(Version) + 1;

Rebuild projections if needed:

await projectionRebuilder.RebuildAsync<MyProjection>(cancellationToken);

Verify projection state:

SELECT * FROM [dbo].[ProjectionCheckpoints]
WHERE ProjectionName = 'MyProjection';

Outbox Recovery

When to use: Messages stuck in outbox, duplicate delivery suspected

Check outbox status:

SELECT Status, COUNT(*) as Count
FROM [dbo].[OutboxMessages]
GROUP BY Status;

Reprocess stuck messages:

UPDATE [dbo].[OutboxMessages]
SET Status = 'Pending', RetryCount = 0
WHERE Status = 'Failed' AND CreatedAt > DATEADD(hour, -24, GETUTCDATE());

Monitor for successful delivery

Monitoring and Alerting

Key Metrics to Monitor

Metric	Warning Threshold	Critical Threshold
Retry rate	> 5% of operations	> 20% of operations
CDC lag	> 1 minute	> 5 minutes
Connection errors	> 1/minute	> 10/minute
Deadlock rate	> 1/hour	> 10/hour

Recommended Alerts

# Example Prometheus alerting rules
groups:
  - name: excalibur-resilience
    rules:
      - alert: HighRetryRate
        expr: rate(dispatch_write_store_retry_count_total[5m]) > 0.05
        for: 5m
        labels:
          severity: warning

      - alert: CdcLagHigh
        expr: dispatch_cdc_lag_seconds > 300
        for: 5m
        labels:
          severity: critical

Operational Resilience - Retry policies and configuration
Observability - Monitoring setup
Health Checks - Application health monitoring

Before You Start​

SQL Server Recovery Scenarios​

Session Killed During CDC Processing (Error 596)​

Database Backup/Restore Invalidates LSN Position​

Connection Pool Corruption​

PostgreSQL Recovery Scenarios​

Broken Pipe / Connection Lost (08xxx Errors)​

Deadlock Detection (40P01)​

Cloud Provider Recovery Scenarios​

CosmosDB Rate Limiting (429)​

DynamoDB Throttling​

General Recovery Procedures​

Event Store Recovery​

Outbox Recovery​

Monitoring and Alerting​

Key Metrics to Monitor​

Recommended Alerts​

Related Documentation​

See Also​

Before You Start

SQL Server Recovery Scenarios

Session Killed During CDC Processing (Error 596)

Database Backup/Restore Invalidates LSN Position

Connection Pool Corruption

PostgreSQL Recovery Scenarios

Broken Pipe / Connection Lost (08xxx Errors)

Deadlock Detection (40P01)

Cloud Provider Recovery Scenarios

CosmosDB Rate Limiting (429)

DynamoDB Throttling

General Recovery Procedures

Event Store Recovery

Outbox Recovery

Monitoring and Alerting

Key Metrics to Monitor

Recommended Alerts

Related Documentation

See Also