Skip to main content

Production Observability Guide

Knowing that your system can emit metrics is different from knowing which metrics matter. Excalibur exposes over 100 OpenTelemetry metrics, but in practice you need to watch about a dozen signals to know whether your system is healthy.

This guide explains what to monitor, what to alert on, and how to set up dashboards that tell you something useful.

Before You Start

  • .NET 8.0+ (or .NET 9/10 for latest features)
  • Install the required packages:
    dotnet add package Excalibur.Dispatch
    dotnet add package Excalibur.Dispatch.Observability
    dotnet add package OpenTelemetry.Extensions.Hosting
  • Familiarity with OpenTelemetry and metrics reference

Enabling Observability

Before anything else, enable tracing and metrics:

builder.Services.AddDispatch(dispatch =>
{
dispatch.AddHandlersFromAssembly(typeof(Program).Assembly);
dispatch.UseOpenTelemetry(); // Enables both tracing and metrics
});

builder.Services.AddOpenTelemetry()
.WithTracing(tracing =>
{
tracing.AddSource("Excalibur.Dispatch.*");
// Add your exporter (Jaeger, Zipkin, OTLP, etc.)
tracing.AddOtlpExporter();
})
.WithMetrics(metrics =>
{
metrics.AddMeter("Excalibur.Dispatch.*");
metrics.AddMeter("Excalibur.Data.*");
metrics.AddMeter("Excalibur.EventSourcing.*");
// Add your exporter (Prometheus, OTLP, etc.)
metrics.AddPrometheusExporter();
});

The Five Signals That Matter

Out of 100+ available metrics, these are the ones that tell you whether your system is working.

1. Message Processing Latency

Metric: dispatch.messages.duration (histogram, milliseconds)

This is the single most important metric. If processing latency increases, something is degrading -- slow database, overloaded handler, or resource contention.

What to watch:

  • p50 (median): Your typical processing time
  • p99: Your worst-case processing time
  • p99 / p50 ratio: If this ratio suddenly increases, you have a tail latency problem

Alert when: p99 exceeds 2x your normal baseline for 5+ minutes.

2. Message Failure Rate

Metric: dispatch.messages.failed (counter) vs dispatch.messages.processed (counter)

Calculate the failure rate: failed / (processed + failed) * 100. A healthy system should have a failure rate under 1%. Spikes indicate handler bugs, external service outages, or bad data.

Alert when: Failure rate exceeds 5% over a 5-minute window.

3. Dead Letter Queue Depth

Metric: dispatch.dlq.depth (gauge)

A growing DLQ means messages are failing faster than they are being reviewed and replayed. A flat, non-zero DLQ is normal (pending review). A continuously growing DLQ is an incident.

Alert when: DLQ depth increases by more than 50 entries in 15 minutes.

4. Circuit Breaker State

Metric: dispatch.circuitbreaker.state (gauge: 0=Closed, 1=Open, 2=HalfOpen)

An open circuit breaker means a downstream dependency is unhealthy and messages to that transport are being rejected. This is the circuit breaker doing its job, but you need to know about it.

Alert when: Any circuit breaker enters Open state.

5. Outbox Lag

Metric: dispatch.transport.pending_messages (gauge)

If you use the outbox pattern, this tells you how many messages are waiting to be published. A small, stable number is normal (outbox processor is keeping up). A growing number means the processor is falling behind.

Alert when: Pending count exceeds 1,000 for more than 10 minutes.

What Traces Look Like

When UseOpenTelemetry() is enabled, each message creates a span that flows through the middleware pipeline:

[Excalibur.Dispatch.Pipeline] ProcessMessage OrderCreatedEvent
├── [Excalibur.Dispatch.Middleware] ValidationMiddleware (0.2ms)
├── [Excalibur.Dispatch.Middleware] AuthorizationMiddleware (0.1ms)
├── [Excalibur.Dispatch.Middleware] IdempotentHandlerMiddleware (1.5ms)
│ └── [Excalibur.Dispatch.Inbox] CheckProcessed (1.2ms)
├── [Excalibur.Dispatch.Handler] ProcessOrderHandler (45ms)
│ ├── [Database] INSERT Orders (12ms)
│ └── [HTTP] POST /api/payments (30ms)
└── [Excalibur.Dispatch.Middleware] MetricsMiddleware (0.1ms)

Total: 47ms

Each span carries tags that you can filter and group by:

TagExampleUse For
message_typeOrderCreatedEventFilter traces by message type
handlerProcessOrderHandlerIdentify slow handlers
resultSuccess / FailureFilter for failures
transportkafka / rabbitmqFilter by transport
dispatch.message_idabc-123Trace a specific message

Health Check Setup

Health checks provide a quick binary signal for load balancers and orchestrators.

builder.Services.AddHealthChecks()
// Pipeline health
.AddCheck("self", () => HealthCheckResult.Healthy())
// Transport connectivity
.AddTransportHealthChecks()
// Dead letter queue depth
.AddCheck<DeadLetterHealthCheck>("dlq");

var app = builder.Build();

// Kubernetes-style endpoints
app.MapHealthChecks("/health/live", new HealthCheckOptions
{
// Liveness: is the process running?
Predicate = check => check.Name == "self"
});

app.MapHealthChecks("/health/ready", new HealthCheckOptions
{
// Readiness: can we process messages?
Predicate = _ => true
});

Custom Dead Letter Health Check

public class DeadLetterHealthCheck : IHealthCheck
{
private readonly IDeadLetterQueue _dlq;

public DeadLetterHealthCheck(IDeadLetterQueue dlq) => _dlq = dlq;

public async Task<HealthCheckResult> CheckHealthAsync(
HealthCheckContext context, CancellationToken ct)
{
var count = await _dlq.GetCountAsync(DeadLetterQueryFilter.PendingOnly(), ct);

return count switch
{
> 100 => HealthCheckResult.Degraded($"{count} messages in DLQ"),
> 0 => HealthCheckResult.Healthy($"{count} messages in DLQ"),
_ => HealthCheckResult.Healthy("DLQ empty")
};
}
}

Dashboard Patterns

The Overview Dashboard

A single dashboard that answers "is the system healthy?" at a glance:

PanelMetricVisualization
Messages/secrate(dispatch.messages.processed)Time series
Failure ratefailed / (processed + failed) * 100Gauge (0-100%)
p99 latencydispatch.messages.duration p99Time series
DLQ depthdispatch.dlq.depthSingle stat
Circuit breakersdispatch.circuitbreaker.state per transportStatus map
Outbox lagdispatch.transport.pending_messagesSingle stat

The Debug Dashboard

When something is wrong, switch to the debug dashboard to drill down:

PanelMetricPurpose
Latency by handlerdispatch.messages.duration grouped by handlerFind the slow handler
Failures by typedispatch.messages.failed grouped by message_typeFind the failing message
DLQ by reasondispatch.dlq.enqueued grouped by reasonUnderstand why messages fail
Circuit breaker timelinedispatch.circuitbreaker.state_changesCorrelate outages with circuit trips
Retry ratedispatch.messages.failed by retry_attemptMeasure transient failure frequency

Alert Thresholds

Start with these thresholds and tune based on your system's baseline:

AlertConditionSeverityAction
High failure rate>5% failures over 5 minWarningCheck handler logs
Very high failure rate>20% failures over 5 minCriticalPotential outage
Latency spikep99 > 2x baseline for 5 minWarningCheck dependency health
DLQ growing+50 entries in 15 minWarningReview DLQ entries
Circuit breaker openAny breaker in Open stateWarningCheck downstream service
Outbox backlog>1,000 pending for 10 minWarningCheck outbox processor
Health check failingReadiness probe failsCriticalService cannot process

Common Troubleshooting Scenarios

"Latency suddenly spiked"

  1. Check dispatch.messages.duration grouped by handler to find the slow handler
  2. Check that handler's traces to find the slow operation (database? API call?)
  3. Check dispatch.circuitbreaker.state -- is a downstream circuit open, causing retries?

"DLQ is growing"

  1. Check dispatch.dlq.enqueued grouped by reason -- what's failing?
  2. If MaxRetriesExceeded: transient failures, check downstream health
  3. If DeserializationFailed: schema mismatch, check message publishers
  4. If ValidationFailed: bad data, check upstream systems

"Messages are processing but nothing happens"

  1. Check dispatch.transport.pending_messages -- is the outbox processor running?
  2. Check transport health checks -- is the broker reachable?
  3. Check dispatch.messages.processed -- are messages actually being consumed?

Next Steps

See Also

  • Observability Overview - Monitor Dispatch applications with OpenTelemetry, health checks, and integrations
  • Health Checks - Application health monitoring for load balancers and orchestrators
  • Performance Tuning - Optimize throughput and latency for production workloads