Skip to main content

CDC Troubleshooting

Change Data Capture (CDC) issues can cause projection lag, missed events, and data inconsistency. This guide covers common problems and recovery procedures.

Before You Start

Common CDC Issues

IssueSymptomsSeverity
Stale positionProjection lag increasingHigh
Missing eventsData gaps in projectionsCritical
Position corruptionCDC processor errorsCritical
Log truncationEvents unavailableCritical

Diagnosing Stale Positions

Check CDC Position

// Check CDC processor position via provider-specific processor
// (e.g., IPostgresCdcProcessor, ISqlServerCdcProcessor)
var position = await _cdcProcessor.GetCurrentPositionAsync(ct);
_logger.LogInformation("CDC position: {Position}", position);

SQL Server CDC Status

-- Check CDC is enabled
SELECT name, is_cdc_enabled FROM sys.databases WHERE name = DB_NAME();

-- Check capture instance
SELECT * FROM cdc.change_tables;

-- Check current LSN vs max available LSN
SELECT
sys.fn_cdc_get_min_lsn('EventSourcing_Events') AS MinLsn,
sys.fn_cdc_get_max_lsn() AS MaxLsn;

-- Check for stale position
SELECT
capture_instance,
start_lsn,
DATEDIFF(MINUTE, create_date, GETDATE()) AS MinutesSinceStart
FROM cdc.lsn_time_mapping
ORDER BY create_date DESC;

PostgreSQL Replication Status

-- Check replication slot
SELECT * FROM pg_replication_slots WHERE slot_name = 'excalibur_cdc';

-- Check replication lag
SELECT
slot_name,
pg_size_pretty(pg_wal_lsn_diff(pg_current_wal_lsn(), restart_lsn)) AS lag
FROM pg_replication_slots;

Stale Position Recovery

Automatic Recovery

Excalibur includes automatic stale position detection:

services.AddCdcProcessor(cdc =>
{
cdc.UseSqlServer(sql => sql.ConnectionString(connectionString))
.WithRecovery(recovery =>
{
recovery.Strategy(StalePositionRecoveryStrategy.FallbackToEarliest)
.MaxAttempts(5)
.AttemptDelay(TimeSpan.FromSeconds(30));
})
.EnableBackgroundProcessing();
});

Recovery Strategies

StrategyWhen to UseData Impact
FallbackToEarliestData consistency priorityReprocesses events from earliest available
FallbackToLatestData gaps acceptableSkips missed events
ThrowManual intervention requiredFails with detailed error
InvokeCallbackComplex scenariosCustom handling via callback

SQL Error 313: Insufficient Arguments

SQL Server may raise error 313 ("An insufficient number of arguments were supplied for the procedure or function cdc.fn_cdc_get_all_changes_*") when the CDC table-valued function receives an LSN outside the valid range. This is a boundary condition variant of the more common errors 22037/22029.

Symptoms:

  • SqlException with Number = 313 in CDC processor logs
  • Processing loop stops advancing for affected capture instances

Resolution: The framework detects error 313 automatically via CdcStalePositionDetector and maps it to StalePositionReasonCodes.TvfInsufficientArguments. If you have a recovery strategy configured (e.g., FallbackToEarliest), the position resets and processing resumes. If using the default Throw strategy, you will see the exception in logs and must reset the position manually.

Tip: Pair recovery with idempotency filtering to safely reprocess events after a position reset without duplicate side effects.

Manual Recovery Procedure

  1. Stop CDC processor
kubectl scale deployment cdc-processor --replicas=0
  1. Determine recovery point
-- Find safe starting position
SELECT MIN(SequenceNumber) AS SafeStart
FROM EventSourcing.Events
WHERE Timestamp > DATEADD(DAY, -1, GETDATE());
  1. Reset position
await _cdcPositionStore.SetPositionAsync(
new CdcPosition { SequenceNumber = safeStart },
CancellationToken.None);
  1. Rebuild affected projections (if needed)
await _projectionRebuildService.RebuildAsync(
projectionName: "OrderSummary",
fromSequence: safeStart,
CancellationToken.None);
  1. Restart CDC processor
kubectl scale deployment cdc-processor --replicas=1

Log Truncation Issues

SQL Server Log Truncation

CDC requires transaction log retention. If logs are truncated:

-- Check if CDC capture job is running
EXEC sys.sp_cdc_help_jobs;

-- Start capture job if stopped
EXEC sys.sp_cdc_start_job @job_type = N'capture';

-- Check retention period
EXEC sys.sp_cdc_change_job
@job_type = N'cleanup',
@retention = 4320; -- 3 days in minutes

Prevention

-- Set adequate retention
EXEC sys.sp_cdc_change_job
@job_type = N'cleanup',
@retention = 10080; -- 7 days

-- Monitor log space
SELECT
DB_NAME(database_id) AS DatabaseName,
log_reuse_wait_desc
FROM sys.databases
WHERE database_id = DB_ID();

PostgreSQL WAL Retention

-- Check replication slot status
SELECT * FROM pg_replication_slots;

-- If slot is lagging, may need to drop and recreate
SELECT pg_drop_replication_slot('excalibur_cdc');
SELECT pg_create_logical_replication_slot('excalibur_cdc', 'pgoutput');

Position Validation

Detect Invalid Position

public class CdcPositionValidator
{
public async Task<PositionValidation> ValidateAsync(CancellationToken ct)
{
var currentPosition = await _positionStore.GetPositionAsync(ct);
var minAvailable = await _cdcSource.GetMinAvailableAsync(ct);
var maxAvailable = await _cdcSource.GetMaxAvailableAsync(ct);

if (currentPosition < minAvailable)
{
return new PositionValidation
{
IsValid = false,
Issue = PositionIssue.BehindMinimum,
CurrentPosition = currentPosition,
MinAvailable = minAvailable,
RecommendedAction = "Reset to minimum available position"
};
}

if (currentPosition > maxAvailable)
{
return new PositionValidation
{
IsValid = false,
Issue = PositionIssue.AheadOfMaximum,
CurrentPosition = currentPosition,
MaxAvailable = maxAvailable,
RecommendedAction = "Reset to maximum available position"
};
}

return new PositionValidation { IsValid = true };
}
}

Projection Rebuild

When CDC recovery requires projection rebuild, inject IProjectionRebuildService from the framework:

public class CdcProjectionRecoveryHandler
{
public async Task RebuildAsync(
string projectionName,
long fromSequence,
CancellationToken ct)
{
_logger.LogWarning(
"Rebuilding projection {Name} from sequence {Sequence}",
projectionName, fromSequence);

// 1. Clear existing projection data
await _projectionStore.ClearAsync(projectionName, ct);

// 2. Replay events from the event store
var events = await _eventStore.LoadAsync(
aggregateId: "*",
aggregateType: projectionName,
fromVersion: fromSequence,
ct);

foreach (var @event in events)
{
var projector = _projectorFactory.GetProjector(projectionName);
await projector.ApplyAsync(@event, ct);
}

// 3. Update rebuild metadata
await _projectionStore.SetLastRebuiltAsync(
projectionName,
DateTime.UtcNow,
ct);

_logger.LogInformation(
"Projection {Name} rebuild complete",
projectionName);
}
}

Monitoring and Alerting

Health Check

public class CdcHealthCheck : IHealthCheck
{
public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken ct)
{
var validation = await _validator.ValidateAsync(ct);

if (!validation.IsValid)
{
return HealthCheckResult.Unhealthy(
$"CDC position invalid: {validation.Issue}");
}

var lag = await _cdcProcessor.GetLagAsync(ct);

if (lag > TimeSpan.FromMinutes(5))
{
return HealthCheckResult.Degraded(
$"CDC lag: {lag.TotalSeconds}s");
}

return HealthCheckResult.Healthy();
}
}

Alerting

# Prometheus alert rules
groups:
- name: cdc
rules:
- alert: CDCPositionStale
expr: excalibur_cdc_lag_seconds > 300
for: 5m
labels:
severity: critical
annotations:
summary: "CDC position is stale"
runbook: "https://docs/operations/cdc-troubleshooting"

- alert: CDCPositionInvalid
expr: excalibur_cdc_position_valid == 0
for: 1m
labels:
severity: critical
annotations:
summary: "CDC position is invalid"

Database Restore Handling

When a database is restored from a backup (common in development/staging environments), the CDC processor handles two scenarios automatically:

During the Restore (Database Unavailable)

The CDC processor survives database unavailability without crashing:

  • All DB operations are wrapped in a resilience policy (retry with exponential backoff + circuit breaker) via IDataAccessPolicyFactory
  • Checkpoint updates and state store writes are guarded with try-catch — failures are logged but don't terminate the processing loop
  • The circuit breaker opens after sustained failure, reducing load on the recovering database
  • The health check transitions through Degraded → Unhealthy as inactivity duration increases

No operator intervention required — the processor automatically resumes when the database comes back online.

After the Restore (Data Replaced)

A restored database may have different CDC LSN ranges than what the processor has checkpointed:

ScenarioWhat HappensResolution
Checkpoint LSN is within the restored rangeProcessing resumes normallyAutomatic
Checkpoint LSN is outside the restored range (stale)CdcStalePositionException is raisedHandled by recovery strategy
CDC tables were not restoredNo change data availableRe-enable CDC on restored database

Configure a recovery strategy to handle stale positions automatically:

services.AddCdcProcessor(cdc =>
{
cdc.UseSqlServer(sql => sql.ConnectionString(connectionString))
.TrackTable("dbo.Orders", t => t.MapAll<OrderChangedEvent>())
.WithRecovery(recovery =>
{
// FallbackToEarliest: resume from earliest available position
// (may reprocess some events — handlers should be idempotent)
recovery.Strategy(StalePositionRecoveryStrategy.FallbackToEarliest)
.MaxAttempts(5)
.AttemptDelay(TimeSpan.FromSeconds(30));
})
.EnableBackgroundProcessing();
});
Development Environments

In environments where databases are frequently restored from production backups, use FallbackToEarliest or FallbackToLatest instead of the default Throw strategy. Ensure your event handlers are idempotent to safely handle reprocessed events.

Prevention Best Practices

PracticeBenefit
Enable automatic recoveryReduces manual intervention
Register IDataAccessPolicyFactoryAutomatic retry and circuit breaker for all DB operations
Set adequate log retentionPrevents truncation issues
Monitor CDC lagEarly warning of problems
Regular position validationDetect issues before impact
Checkpoint frequentlyFaster recovery
Make event handlers idempotentSafe reprocessing after restore or recovery
Test recovery proceduresConfidence in recovery

Quick Reference

Recovery Commands

# Stop CDC processor
kubectl scale deployment cdc-processor --replicas=0

# Check current position
kubectl exec -it cdc-processor -- dotnet cdc position show

# Reset position
kubectl exec -it cdc-processor -- dotnet cdc position reset --to-latest

# Start CDC processor
kubectl scale deployment cdc-processor --replicas=1

# Trigger projection rebuild
kubectl exec -it cdc-processor -- dotnet projection rebuild OrderSummary

SQL Server Quick Checks

-- CDC status
SELECT is_cdc_enabled FROM sys.databases WHERE name = DB_NAME();

-- Capture job status
EXEC sys.sp_cdc_help_jobs;

-- Available LSN range
SELECT
sys.fn_cdc_get_min_lsn('EventSourcing_Events') AS Min,
sys.fn_cdc_get_max_lsn() AS Max;

See Also