CDC Troubleshooting
Change Data Capture (CDC) issues can cause projection lag, missed events, and data inconsistency. This guide covers common problems and recovery procedures.
Before You Start
- .NET 10.0
- A running CDC deployment with SQL Server CDC enabled
- Familiarity with CDC patterns and recovery runbooks
Common CDC Issues
| Issue | Symptoms | Severity |
|---|---|---|
| Stale position | Projection lag increasing | High |
| Missing events | Data gaps in projections | Critical |
| Position corruption | CDC processor errors | Critical |
| Log truncation | Events unavailable | Critical |
Diagnosing Stale Positions
Check CDC Position
// Check CDC processor position via provider-specific processor
// (e.g., IPostgresCdcProcessor, ISqlServerCdcProcessor)
var position = await _cdcProcessor.GetCurrentPositionAsync(ct);
_logger.LogInformation("CDC position: {Position}", position);
SQL Server CDC Status
-- Check CDC is enabled
SELECT name, is_cdc_enabled FROM sys.databases WHERE name = DB_NAME();
-- Check capture instance
SELECT * FROM cdc.change_tables;
-- Check current LSN vs max available LSN
SELECT
sys.fn_cdc_get_min_lsn('EventSourcing_Events') AS MinLsn,
sys.fn_cdc_get_max_lsn() AS MaxLsn;
-- Check for stale position
SELECT
capture_instance,
start_lsn,
DATEDIFF(MINUTE, create_date, GETDATE()) AS MinutesSinceStart
FROM cdc.lsn_time_mapping
ORDER BY create_date DESC;
PostgreSQL Replication Status
-- Check replication slot
SELECT * FROM pg_replication_slots WHERE slot_name = 'excalibur_cdc';
-- Check replication lag
SELECT
slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;
Stale Position Recovery
Automatic Recovery
Excalibur includes automatic stale position detection:
services.AddCdcProcessor(cdc =>
{
cdc.UseSqlServer(sql => sql.ConnectionString(connectionString))
.WithRecovery(recovery =>
{
recovery.Strategy(StalePositionRecoveryStrategy.FallbackToEarliest)
.MaxAttempts(5)
.AttemptDelay(TimeSpan.FromSeconds(30));
})
.EnableBackgroundProcessing();
});
Recovery Strategies
| Strategy | When to Use | Data Impact |
|---|---|---|
FallbackToEarliest | Data consistency priority | Reprocesses events from earliest available |
FallbackToLatest | Data gaps acceptable | Skips missed events |
Throw | Manual intervention required | Fails with detailed error |
InvokeCallback | Complex scenarios | Custom handling via callback |
SQL Error 313: Insufficient Arguments
SQL Server may raise error 313 ("An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_*") when the CDC table-valued function receives an LSN outside the valid range. This is a boundary condition variant of the more common errors 22037/22029.
Symptoms:
SqlExceptionwithNumber = 313in CDC processor logs- Processing loop stops advancing for affected capture instances
Resolution: The framework detects error 313 automatically via CdcStalePositionDetector and maps it to StalePositionReasonCodes.TvfInsufficientArguments. If you have a recovery strategy configured (e.g., FallbackToEarliest), the position resets and processing resumes. If using the default Throw strategy, you will see the exception in logs and must reset the position manually.
Tip: Pair recovery with idempotency filtering to safely reprocess events after a position reset without duplicate side effects.
Manual Recovery Procedure
- Stop CDC processor
kubectl scale deployment cdc-processor --replicas=0
- Determine recovery point
-- Find safe starting position
SELECT MIN(SequenceNumber) AS SafeStart
FROM EventSourcing.Events
WHERE Timestamp > DATEADD(DAY, -1, GETDATE());
- Reset position
await _cdcPositionStore.SetPositionAsync(
new CdcPosition { SequenceNumber = safeStart },
CancellationToken.None);
- Rebuild affected projections (if needed)
await _projectionRebuildService.RebuildAsync(
projectionName: "OrderSummary",
fromSequence: safeStart,
CancellationToken.None);
- Restart CDC processor
kubectl scale deployment cdc-processor --replicas=1
Log Truncation Issues
SQL Server Log Truncation
CDC requires transaction log retention. If logs are truncated:
-- Check if CDC capture job is running
EXEC sys.sp_cdc_help_jobs;
-- Start capture job if stopped
EXEC sys.sp_cdc_start_job @job_type = N'capture';
-- Check retention period
EXEC sys.sp_cdc_change_job
@job_type = N'cleanup',
@retention = 4320; -- 3 days in minutes
Prevention
-- Set adequate retention
EXEC sys.sp_cdc_change_job
@job_type = N'cleanup',
@retention = 10080; -- 7 days
-- Monitor log space
SELECT
DB_NAME(database_id) AS DatabaseName,
log_reuse_wait_desc
FROM sys.databases
WHERE database_id = DB_ID();
PostgreSQL WAL Retention
-- Check replication slot status
SELECT * FROM pg_replication_slots;
-- If slot is lagging, may need to drop and recreate
SELECT pg_drop_replication_slot('excalibur_cdc');
SELECT pg_create_logical_replication_slot('excalibur_cdc', 'pgoutput');
Position Validation
Detect Invalid Position
public class CdcPositionValidator
{
public async Task<PositionValidation> ValidateAsync(CancellationToken ct)
{
var currentPosition = await _positionStore.GetPositionAsync(ct);
var minAvailable = await _cdcSource.GetMinAvailableAsync(ct);
var maxAvailable = await _cdcSource.GetMaxAvailableAsync(ct);
if (currentPosition < minAvailable)
{
return new PositionValidation
{
IsValid = false,
Issue = PositionIssue.BehindMinimum,
CurrentPosition = currentPosition,
MinAvailable = minAvailable,
RecommendedAction = "Reset to minimum available position"
};
}
if (currentPosition > maxAvailable)
{
return new PositionValidation
{
IsValid = false,
Issue = PositionIssue.AheadOfMaximum,
CurrentPosition = currentPosition,
MaxAvailable = maxAvailable,
RecommendedAction = "Reset to maximum available position"
};
}
return new PositionValidation { IsValid = true };
}
}
Projection Rebuild
When CDC recovery requires projection rebuild, inject IProjectionRebuildService from the framework:
public class CdcProjectionRecoveryHandler
{
public async Task RebuildAsync(
string projectionName,
long fromSequence,
CancellationToken ct)
{
_logger.LogWarning(
"Rebuilding projection {Name} from sequence {Sequence}",
projectionName, fromSequence);
// 1. Clear existing projection data
await _projectionStore.ClearAsync(projectionName, ct);
// 2. Replay events from the event store
var events = await _eventStore.LoadAsync(
aggregateId: "*",
aggregateType: projectionName,
fromVersion: fromSequence,
ct);
foreach (var @event in events)
{
var projector = _projectorFactory.GetProjector(projectionName);
await projector.ApplyAsync(@event, ct);
}
// 3. Update rebuild metadata
await _projectionStore.SetLastRebuiltAsync(
projectionName,
DateTime.UtcNow,
ct);
_logger.LogInformation(
"Projection {Name} rebuild complete",
projectionName);
}
}
Monitoring and Alerting
Health Check
public class CdcHealthCheck : IHealthCheck
{
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken ct)
{
var validation = await _validator.ValidateAsync(ct);
if (!validation.IsValid)
{
return HealthCheckResult.Unhealthy(
$"CDC position invalid: {validation.Issue}");
}
var lag = await _cdcProcessor.GetLagAsync(ct);
if (lag > TimeSpan.FromMinutes(5))
{
return HealthCheckResult.Degraded(
$"CDC lag: {lag.TotalSeconds}s");
}
return HealthCheckResult.Healthy();
}
}
Alerting
# Prometheus alert rules
groups:
- name: cdc
rules:
- alert: CDCPositionStale
expr: excalibur_cdc_lag_seconds > 300
for: 5m
labels:
severity: critical
annotations:
summary: "CDC position is stale"
runbook: "https://docs/operations/cdc-troubleshooting"
- alert: CDCPositionInvalid
expr: excalibur_cdc_position_valid == 0
for: 1m
labels:
severity: critical
annotations:
summary: "CDC position is invalid"
Database Restore Handling
When a database is restored from a backup (common in development/staging environments), the CDC processor handles two scenarios automatically:
During the Restore (Database Unavailable)
The CDC processor survives database unavailability without crashing:
- All DB operations are wrapped in a resilience policy (retry with exponential backoff + circuit breaker) via
IDataAccessPolicyFactory - Checkpoint updates and state store writes are guarded with try-catch — failures are logged but don't terminate the processing loop
- The circuit breaker opens after sustained failure, reducing load on the recovering database
- The health check transitions through Degraded → Unhealthy as inactivity duration increases
No operator intervention required — the processor automatically resumes when the database comes back online.
After the Restore (Data Replaced)
A restored database may have different CDC LSN ranges than what the processor has checkpointed:
| Scenario | What Happens | Resolution |
|---|---|---|
| Checkpoint LSN is within the restored range | Processing resumes normally | Automatic |
| Checkpoint LSN is outside the restored range (stale) | CdcStalePositionException is raised | Handled by recovery strategy |
| CDC tables were not restored | No change data available | Re-enable CDC on restored database |
Configure a recovery strategy to handle stale positions automatically:
services.AddCdcProcessor(cdc =>
{
cdc.UseSqlServer(sql => sql.ConnectionString(connectionString))
.TrackTable("dbo.Orders", t => t.MapAll<OrderChangedEvent>())
.WithRecovery(recovery =>
{
// FallbackToEarliest: resume from earliest available position
// (may reprocess some events — handlers should be idempotent)
recovery.Strategy(StalePositionRecoveryStrategy.FallbackToEarliest)
.MaxAttempts(5)
.AttemptDelay(TimeSpan.FromSeconds(30));
})
.EnableBackgroundProcessing();
});
In environments where databases are frequently restored from production backups, use FallbackToEarliest or FallbackToLatest instead of the default Throw strategy. Ensure your event handlers are idempotent to safely handle reprocessed events.
Prevention Best Practices
| Practice | Benefit |
|---|---|
| Enable automatic recovery | Reduces manual intervention |
Register IDataAccessPolicyFactory | Automatic retry and circuit breaker for all DB operations |
| Set adequate log retention | Prevents truncation issues |
| Monitor CDC lag | Early warning of problems |
| Regular position validation | Detect issues before impact |
| Checkpoint frequently | Faster recovery |
| Make event handlers idempotent | Safe reprocessing after restore or recovery |
| Test recovery procedures | Confidence in recovery |
Quick Reference
Recovery Commands
# Stop CDC processor
kubectl scale deployment cdc-processor --replicas=0
# Check current position
kubectl exec -it cdc-processor -- dotnet cdc position show
# Reset position
kubectl exec -it cdc-processor -- dotnet cdc position reset --to-latest
# Start CDC processor
kubectl scale deployment cdc-processor --replicas=1
# Trigger projection rebuild
kubectl exec -it cdc-processor -- dotnet projection rebuild OrderSummary
SQL Server Quick Checks
-- CDC status
SELECT is_cdc_enabled FROM sys.databases WHERE name = DB_NAME();
-- Capture job status
EXEC sys.sp_cdc_help_jobs;
-- Available LSN range
SELECT
sys.fn_cdc_get_min_lsn('EventSourcing_Events') AS Min,
sys.fn_cdc_get_max_lsn() AS Max;
See Also
- Change Data Capture Pattern — Architecture and implementation of the CDC pattern
- Recovery Runbooks — Step-by-step recovery procedures for common failure scenarios
- Production Observability — Monitoring and alerting for production environments