Skip to main content

Error Handling & Recovery Guide

Every message in a distributed system can fail. Networks drop, databases lock, external APIs time out, and services crash. The question is not whether failures will happen, but what happens to the message when they do.

This guide explains the complete failure lifecycle in Excalibur.Dispatch: from the first failed attempt, through retries and circuit breakers, into the dead letter queue, and back out via recovery. Understanding how these features compose is the key to building a system that handles failure gracefully.

Before You Start

Why Messages Fail

Message processing failures fall into two categories, and the distinction matters because it determines the correct response.

Transient Failures

Temporary problems that resolve on their own. Retrying is the correct response.

  • Database connection timeout
  • Network blip to an external API
  • SQL deadlock between concurrent transactions
  • Cloud service throttling (HTTP 429)
  • Temporary resource exhaustion

Permanent Failures

Problems that will never resolve by retrying. Continuing to retry wastes resources and delays other messages.

  • Message payload cannot be deserialized (corrupt or wrong schema)
  • No handler registered for the message type
  • Business rule violation (order amount is negative)
  • Authorization failed (invalid credentials)
  • External resource permanently gone (HTTP 404/410)

The framework's job is to retry transient failures, stop retrying permanent failures quickly, and capture everything that fails for later analysis.

The Failure Lifecycle

When a message fails processing, it moves through up to four layers of protection. Each layer operates independently, and you configure each one based on your needs.

flowchart TD
M[Message arrives] --> H[Handler executes]
H -->|Success| Done[Done]
H -->|Failure| CB{Circuit breaker<br/>open?}
CB -->|Yes| DLQ[Dead letter queue]
CB -->|No| R{Retry policy}
R -->|Retry| H
R -->|Exhausted| PD{Poison message<br/>detection}
PD -->|Poison| DLQ
PD -->|Not poison| DLQ
DLQ --> Review[Review & analyze]
Review --> Replay[Replay when fixed]
Replay --> M

Layer 1: Retry Policy

The first line of defense. When a handler throws, the retry policy determines whether and how to retry.

services.AddDispatch(dispatch =>
{
dispatch.AddDispatchResilience(options =>
{
options.EnableRetry = true;
options.EnableCircuitBreaker = true;
});
});

// Named retry policies for specific operations
services.AddPollyRetryPolicy("transient-retry", options =>
{
options.MaxRetries = 3;
options.BaseDelay = TimeSpan.FromMilliseconds(200);
options.UseExponentialBackoff = true; // 200ms, 400ms, 800ms
options.UseJitter = true; // Prevents thundering herd
});

Dispatch supports two retry strategies:

StrategyDelay PatternBest For
FixedDelaySame delay each time (e.g., 1s, 1s, 1s)Predictable backoff, simple scenarios
ExponentialBackoffIncreasing delay (e.g., 200ms, 400ms, 800ms)Network/API failures, database contention

Jitter adds randomness to the delay. Without jitter, if 100 consumers all fail at the same time, they all retry at the same time, causing another failure. Jitter spreads retries across a time window.

You can also classify exceptions to control retry behavior:

services.AddPoisonMessageHandling(options =>
{
// Never retry these (permanent failures)
options.PoisonExceptionTypes.Add(typeof(InvalidOperationException));
options.PoisonExceptionTypes.Add(typeof(ArgumentNullException));

// Always retry these (transient failures)
options.TransientExceptionTypes.Add(typeof(TimeoutException));
options.TransientExceptionTypes.Add(typeof(HttpRequestException));
});

Layer 2: Circuit Breaker

The circuit breaker prevents your system from hammering a failing dependency. If a service is down, retrying every message against it wastes resources and can cascade the failure.

services.AddPollyCircuitBreaker("payment-service", options =>
{
options.FailureThreshold = 5; // Open after 5 failures
options.SamplingDuration = TimeSpan.FromSeconds(30);
options.MinimumThroughput = 10; // Need 10 requests in window
options.DurationOfBreak = TimeSpan.FromSeconds(60);
});

The circuit breaker has three states:

stateDiagram-v2
[*] --> Closed
Closed --> Open: Failure threshold exceeded
Open --> HalfOpen: Break duration elapsed
HalfOpen --> Closed: Test requests succeed
HalfOpen --> Open: Test request fails
StateBehavior
ClosedNormal operation. Requests flow through. Failures are counted.
OpenAll requests are rejected immediately without calling the handler. Messages go directly to the dead letter queue with reason CircuitBreakerOpen.
Half-OpenA limited number of test requests are allowed through. If they succeed, the circuit closes. If they fail, it opens again.

Transport isolation is built in. Each transport gets its own circuit breaker, so a Kafka failure doesn't affect RabbitMQ processing:

// Per-transport circuit breaker registry
var registry = serviceProvider.GetRequiredService<ITransportCircuitBreakerRegistry>();

// Check state
var states = registry.GetAllStates();
// { "kafka": Closed, "rabbitmq": Open, "azure-service-bus": Closed }

// Reset a specific transport
var cb = registry.TryGet("rabbitmq");

Layer 3: Poison Message Detection

After retries are exhausted, the PoisonMessageMiddleware evaluates whether the message is a poison message -- one that will always fail regardless of how many times you retry it.

services.AddDispatch(dispatch =>
{
dispatch.UsePoisonMessageDetection(options =>
{
options.MaxRetryAttempts = 5;
options.MaxProcessingTime = TimeSpan.FromMinutes(5);
options.PoisonExceptionTypes.Add(typeof(JsonException));
});
});

Dispatch includes four built-in detectors:

DetectorTriggers When
RetryCountPoisonDetectorMessage has exceeded the max retry count
ExceptionTypePoisonDetectorException type is in the poison list (e.g., JsonException, ArgumentNullException)
TimespanPoisonDetectorMessage has been in processing for longer than the max age
CompositePoisonDetectorAny of the above detectors triggers

You can add custom detectors for domain-specific logic:

public class BusinessRulePoisonDetector : IPoisonMessageDetector
{
public Task<PoisonDetectionResult> IsPoisonMessageAsync(
IDispatchMessage message,
IMessageContext context,
MessageProcessingInfo processingInfo,
Exception? exception = null)
{
// Non-retryable business exception
if (exception is BusinessRuleViolationException)
{
return Task.FromResult(PoisonDetectionResult.Poison(
reason: "Business rule violation - will never succeed on retry",
detectorName: nameof(BusinessRulePoisonDetector)));
}

return Task.FromResult(PoisonDetectionResult.NotPoison());
}
}

services.AddDispatch(dispatch =>
{
dispatch.AddPoisonDetector<BusinessRulePoisonDetector>();
});

Layer 4: Dead Letter Queue

When a message cannot be processed -- either because retries are exhausted, the circuit breaker is open, or it's been detected as poison -- it moves to the dead letter queue.

The DLQ preserves the full context of the failure: the original message, the exception, the number of delivery attempts, the reason, and any metadata. Nothing is lost.

services.AddDispatch(dispatch =>
{
dispatch.AddPoisonMessageHandling(options =>
{
options.MaxRetryAttempts = 3;
options.DeadLetterRetentionPeriod = TimeSpan.FromDays(30);
options.EnableAutoCleanup = true;
});
});

Every message that enters the DLQ is tagged with a reason:

ReasonWhat Happened
MaxRetriesExceededHandler failed on every retry attempt
CircuitBreakerOpenCircuit breaker rejected the message
DeserializationFailedMessage payload could not be deserialized
HandlerNotFoundNo handler registered for this message type
ValidationFailedMessage failed validation middleware
AuthorizationFailedAuthorization check rejected the message
MessageExpiredMessage TTL expired before processing
ManualRejectionHandler explicitly rejected the message
PoisonMessageDetected as poison by a detector
UnhandledExceptionUnhandled exception in processing

Two Levels of Dead Letter Handling

Dispatch provides dead letter handling at two independent levels. They can be used separately or together.

Application-Level DLQ

The IDeadLetterQueue interface provides a transport-agnostic DLQ that works with any transport. Messages are captured at the Dispatch pipeline level.

// Query dead letter entries
var entries = await dlq.GetEntriesAsync(
DeadLetterQueryFilter.ByReason(DeadLetterReason.MaxRetriesExceeded),
limit: 50,
cancellationToken);

// Replay a single entry after fixing the issue
await dlq.ReplayAsync(entryId, cancellationToken);

// Batch replay all validation failures after updating validation logic
var replayedCount = await dlq.ReplayBatchAsync(
DeadLetterQueryFilter.ByReason(DeadLetterReason.ValidationFailed),
cancellationToken);

// Purge old entries
await dlq.PurgeOlderThanAsync(TimeSpan.FromDays(30), cancellationToken);

Providers for persistent storage:

ProviderPackageUse Case
In-MemoryBuilt-inTesting, development
SQL ServerExcalibur.Data.SqlServerSQL Server production
ElasticsearchExcalibur.Data.ElasticSearchAnalytics, search, audit

Transport-Level DLQ

Each transport can also implement IDeadLetterQueueManager for native DLQ support. Transport-level DLQs use the broker's own infrastructure, which is important when messages fail before they even reach the Dispatch pipeline (e.g., deserialization failures in the transport adapter).

TransportMechanismStatus
KafkaTopic-based ({topic}.dead-letter)Available
AWS SQSQueue-based (native redrive via IAmazonSQS)Available
Google Pub/SubSubscription-basedAvailable
Azure Service BusSubqueue-basedPlanned
RabbitMQExchange-basedPlanned

The transport-level interface supports the same operations -- move, retrieve, reprocess, statistics, purge:

// Get statistics about failed messages
var stats = await dlqManager.GetStatisticsAsync(cancellationToken);
// stats.MessageCount, stats.ReasonBreakdown, stats.OldestMessageAge

// Reprocess messages back to their original queue
var result = await dlqManager.ReprocessDeadLetterMessagesAsync(
messages,
new ReprocessOptions
{
RemoveFromDlq = true,
ProcessInParallel = true,
MaxDegreeOfParallelism = 4
},
cancellationToken);
// result.SuccessCount, result.FailureCount, result.ProcessingTime

How Retries and Idempotency Work Together

A common question: if a message is retried 3 times, does the inbox see 3 different messages or the same message 3 times?

The answer depends on who is doing the retrying:

Retry SourceSame MessageId?Inbox Behavior
Pipeline retry (Polly/built-in)Yes -- same pipeline invocationInbox is not consulted during retries within one pipeline execution
Broker redeliveryYes -- same transport messageInbox detects duplicate and skips
Publisher retry (outbox re-publish)Depends on outbox implementationUse MessageIdStrategy.Custom with business keys for safety

Pipeline retries happen inside a single message processing attempt. The inbox operates at the pipeline boundary -- it checks on message arrival and records on completion. So pipeline retries don't conflict with idempotency; they complement it.

Reference: Idempotent Consumer Guide for the full inbox pattern.

Monitoring Failed Messages

Health Checks

Monitor DLQ depth as a health indicator:

public class DeadLetterHealthCheck : IHealthCheck
{
private readonly IDeadLetterQueue _dlq;

public DeadLetterHealthCheck(IDeadLetterQueue dlq) => _dlq = dlq;

public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context,
CancellationToken ct)
{
var count = await _dlq.GetCountAsync(
DeadLetterQueryFilter.PendingOnly(), ct);

return count switch
{
> 100 => HealthCheckResult.Degraded(
$"Dead letter queue has {count} pending entries"),
> 0 => HealthCheckResult.Healthy(
$"{count} entries in DLQ"),
_ => HealthCheckResult.Healthy()
};
}
}

Alerting on Thresholds

services.AddPoisonMessageHandling(options =>
{
options.EnableAlerting = true;
options.AlertThreshold = 10; // Alert if 10+ failures
options.AlertTimeWindow = TimeSpan.FromMinutes(15); // Within 15 minutes
});

Circuit Breaker State

var registry = serviceProvider.GetRequiredService<ITransportCircuitBreakerRegistry>();
var states = registry.GetAllStates();

foreach (var (transport, state) in states)
{
if (state == CircuitState.Open)
{
logger.LogWarning("Circuit breaker OPEN for transport {Transport}", transport);
}
}

Recovery Strategies

When messages land in the DLQ, the recovery approach depends on the failure category.

Transient Failures (MaxRetriesExceeded)

These usually resolve on their own. Wait for the underlying issue to clear, then replay:

// After the database/service recovers, replay all transient failures
var count = await dlq.ReplayBatchAsync(
DeadLetterQueryFilter.ByReason(DeadLetterReason.MaxRetriesExceeded),
cancellationToken);

logger.LogInformation("Replayed {Count} messages after service recovery", count);

Validation Failures

Fix the validation logic or data, then replay:

var failures = await dlq.GetEntriesAsync(
DeadLetterQueryFilter.ByReason(DeadLetterReason.ValidationFailed),
cancellationToken: ct);

// Review each entry
foreach (var entry in failures)
{
logger.LogInformation(
"Validation failure: {Type} at {Time} - {Exception}",
entry.MessageType,
entry.EnqueuedAt,
entry.ExceptionMessage);
}

// After updating validation rules, replay the batch
await dlq.ReplayBatchAsync(
DeadLetterQueryFilter.ByReason(DeadLetterReason.ValidationFailed), ct);

Deserialization Failures

These are permanent. The message payload is corrupt or the schema has changed. Review, fix the data or schema, then decide whether to replay or purge:

var deserializationFailures = await dlq.GetEntriesAsync(
DeadLetterQueryFilter.ByReason(DeadLetterReason.DeserializationFailed),
cancellationToken: ct);

// These rarely succeed on replay -- purge after investigation
await dlq.PurgeOlderThanAsync(TimeSpan.FromDays(7), ct);

Transport-Level Recovery (Reprocessing)

For transport-native DLQs, use ReprocessOptions for fine-grained control:

var result = await dlqManager.ReprocessDeadLetterMessagesAsync(
messages,
new ReprocessOptions
{
// Only reprocess messages from the last hour
MessageFilter = msg => msg.DeadLetteredAt > DateTimeOffset.UtcNow.AddHours(-1),
// Transform messages before replay (e.g., fix headers)
MessageTransform = msg =>
{
msg.Headers["RetrySource"] = "manual-recovery";
return msg;
},
RemoveFromDlq = true,
RetryDelay = TimeSpan.FromMilliseconds(100),
MaxMessages = 100
},
cancellationToken);

logger.LogInformation(
"Reprocessed {Success}/{Total} messages in {Duration}",
result.SuccessCount, result.TotalCount, result.ProcessingTime);

Decision Guide: What Level of Protection Do You Need?

Not every handler needs every layer. Match the protection to the consequence of failure.

Is the message from an external system (broker redelivery possible)?
NO --> Internal pipeline only? Basic retry is usually sufficient.
YES --> What happens if processing fails permanently?
Nothing (metrics, logs) --> Retry only, no DLQ needed
Minor issue (duplicate email) --> Retry + DLQ for monitoring
Data loss or corruption --> Retry + Circuit breaker + DLQ + Idempotent inbox
Financial impact --> All layers + transport-level DLQ

Minimal -- Internal messages, low consequence:

services.AddDispatch(dispatch =>
{
dispatch.AddDispatchResilience(options =>
{
options.EnableRetry = true;
});
});

Standard -- External messages, moderate consequence:

services.AddDispatch(dispatch =>
{
dispatch.AddDispatchResilience(options =>
{
options.EnableRetry = true;
options.EnableCircuitBreaker = true;
});

dispatch.AddPoisonMessageHandling(options =>
{
options.MaxRetryAttempts = 3;
options.DeadLetterRetentionPeriod = TimeSpan.FromDays(30);
options.EnableAutoCleanup = true;
});
});

Full protection -- Financial transactions, critical business processes:

services.AddDispatch(dispatch =>
{
dispatch.AddHandlersFromAssembly(typeof(Program).Assembly);

// Resilience: retry + circuit breaker
dispatch.AddDispatchResilience(options =>
{
options.EnableRetry = true;
options.EnableCircuitBreaker = true;
});

// Poison message detection
dispatch.UsePoisonMessageDetection(options =>
{
options.MaxRetryAttempts = 5;
options.MaxAge = TimeSpan.FromHours(24);
options.PoisonExceptionTypes.Add(typeof(JsonException));
});

// Dead letter queue with alerting
dispatch.AddPoisonMessageHandling(options =>
{
options.DeadLetterRetentionPeriod = TimeSpan.FromDays(90);
options.EnableAlerting = true;
options.AlertThreshold = 5;
});
});

// Transport-specific DLQ (e.g., Kafka)
services.AddKafkaDeadLetterQueue(options =>
{
options.MaxDeliveryAttempts = 3;
});

// Named policies for critical operations
services.AddPollyRetryPolicy("payment-retry", options =>
{
options.MaxRetries = 5;
options.UseExponentialBackoff = true;
options.UseJitter = true;
});

services.AddPollyCircuitBreaker("payment-cb", options =>
{
options.FailureThreshold = 3;
options.DurationOfBreak = TimeSpan.FromSeconds(60);
});

// Persistent DLQ store
services.AddSqlServerDeadLetterStore(connectionString);

// Idempotent consumer for critical handlers
services.AddSqlServerInboxStore(options =>
{
options.ConnectionString = connectionString;
});

Summary

LayerPurposeHandlesReference
RetryRecover from transient failuresTimeouts, deadlocks, throttlingPolly Resilience
Circuit BreakerStop cascading failuresDownstream service outagesPolly Resilience
Poison DetectionIdentify permanent failures earlyBad payloads, missing handlersDead Letter
Dead Letter QueuePreserve failed messages for analysisAll failures after retry exhaustionDead Letter
Recovery/ReplayReprocess after fixing the issueDLQ entriesDead Letter
Idempotent ConsumerPrevent duplicate processing on replayRedelivered messagesIdempotent Consumer

These layers are independent. You can use retry without a DLQ, a DLQ without circuit breakers, or all of them together. Start with retry and DLQ, then add circuit breakers and poison detection as your system grows.

Next Steps

See Also

  • Dead Letter Handling -- DLQ interface reference, filtering, replay, and provider configuration
  • Polly Resilience -- Detailed retry, circuit breaker, timeout, and bulkhead policy setup
  • Recovery Runbooks -- Step-by-step operational procedures for production incident recovery
  • Validation Middleware -- Configure message validation that feeds into the error handling pipeline