Your Observability Stack Is All Dashboards and No Answers

Most engineering teams I work with have observability tooling. They have Grafana dashboards, Datadog agents, CloudWatch alarms, maybe an OpenTelemetry collector stitching it together. They are paying five or six figures a year for the privilege.

And when something breaks at 2am, someone still spends 45 minutes clicking through dashboards trying to figure out which service is actually broken.

The problem is not a lack of data. It is a lack of structure. Most observability stacks are built bottom-up: instrument everything, ship it somewhere, build dashboards when someone asks. The result is telemetry sprawl, hundreds of dashboards, thousands of metrics, and no clear path from alert to root cause.

The Three Pillars Are Table Stakes, Not a Strategy

Metrics, logs, and traces are the raw materials. Having all three does not mean you have observability. You have observability when an on-call engineer can go from page to root cause in under 10 minutes without tribal knowledge.

That is the bar. Most teams are nowhere near it.

The gap is usually in correlation. Your metrics tell you CPU spiked. Your logs tell you errors increased. Your traces show a slow database call. But unless those three signals are connected, with shared request IDs, consistent service labels, and correlated timestamps, you are still doing manual detective work across three different tools.

I worked with a FinTech platform running 40+ microservices on EKS. They had Prometheus, Loki, and Tempo all deployed. Three pillars, check. Their mean time to resolution (MTTR) for P1 incidents was still averaging 52 minutes. The issue was not missing data. It was that none of it was connected.

What We Changed

We restructured their observability around three principles:

1. Service-level objectives drive alerting, not thresholds.

We replaced 180+ static threshold alerts with 12 SLO-based alerts tied to actual user impact. Instead of alerting on "CPU > 80%" (which fires constantly and means nothing), we alerted on "error budget burn rate exceeding 10x in the last 5 minutes."

This cut alert volume by 87% and increased the percentage of actionable alerts from roughly 15% to over 90%.

2. Every request gets a correlation ID, no exceptions.

We enforced a trace-id header across all services using an OpenTelemetry SDK wrapper something similar to:

def traced_handler(request, handler):
    trace_id = request.headers.get("trace-id")
    if not trace_id:
        raise ValueError("Missing required trace-id header")
    
    ctx = extract(request.headers)
    with tracer.start_as_current_span(
        "handle_request",
        context=ctx,
        attributes={"trace.id": trace_id}
    ):
    return handler(request)

This was not optional. Any service that did not propagate the header failed a CI check when tests were run:

def test_missing_trace_id_is_rejected():
    response = client.get("/endpoint", headers={})  # no trace-id
    assert response.status_code == 400

With consistent trace IDs, an engineer can now click from a failing SLO alert to the specific traces that violated the objective, then drill into the exact span where latency or errors originated.

3. Runbooks are linked to alerts, not buried in Confluence.

Every SLO alert includes a runbook_url annotation that points to a living document with three sections: what this alert means, how to investigate, and who owns the upstream dependency. We templated this in Terraform so new alerts cannot be created without a runbook link.

resource "grafana_rule_group" "payment_slo" {
  name             = "payment-service-slo"
  folder_uid       = grafana_folder.slo_alerts.uid
  interval_seconds = 60

  rule {
    name      = "PaymentErrorBudgetBurnRate"
    condition = "burn_rate"

    annotations = {
      summary     = "Payment service error budget burning too fast"
      runbook_url = "https://runbooks.internal/payment-slo-breach"
    }
  }
}

The Results

After 60 days:

MTTR for P1 incidents dropped from 52 minutes to 11 minutes
Alert noise reduced by 87%, from ~210 weekly alerts to ~27
On-call escalation rate dropped by 63%, because the first responder could actually resolve the issue
Dashboard count went from 140+ to 22 curated service dashboards, each built around SLO performance rather than raw infrastructure metrics

The monthly Datadog bill also dropped because we stopped ingesting metrics nobody looked at. Custom metric cardinality was the biggest offender: teams were emitting high-cardinality labels on metrics that had no alert and no dashboard. We wrote a script that flagged any metric not referenced by an alert rule or dashboard panel in the last 90 days, then dropped those series from the collector config.

The Uncomfortable Truth

Most observability problems are not tooling problems. Switching from Datadog to Grafana Cloud, or vice versa, does not fix the underlying issue. The problem is almost always:

No service-level objectives, so you do not know what "broken" means
No correlation between signals, so investigation is manual
No maintenance discipline, so dashboards and alerts accumulate without review
No ownership model, so nobody is responsible for keeping the observability stack useful

Observability is not a monitoring upgrade. It is an engineering discipline. Treat it like one.

If your team is drowning in alerts but still slow to resolve incidents, book a strategy call and let's work through your observability architecture together.