You have metrics. So why are incidents still a mystery?
Teams have more dashboards, metrics, and alerts than ever, but still struggle to respond effectively during incidents. Learn how to build systems that guide action, not just display data.
In modern engineering teams, we have more data available than ever before. Every application generates telemetry metrics, every engineer would recognize the Grafana color scheme in seconds, and alerts are (unfortunately) fired 24/7. And yet, when something breaks, it takes too long to understand what's going on, what has changed, and then how to fix it. Sometimes, dashboards are not even reviewed when incidents occur!
Observability is another term that has become more and more popular to replace the less-sexy one of “Monitoring”. However, in many organizations, changing the name didn’t solve the issues, and it has become another set of tools that show symptoms without revealing causes, generating alerts without guiding action. And we end up in the same situation: engineers stare at dashboards, unsure whether the red line matters, whether the metrics are correct, or if anyone else has seen it before.
So we ignore. If nobody complains, it means nothing is broken, right? It’s not that teams don’t care. It’s that we confuse visibility with insight, and insight with action. So everybody still keeps fighting symptoms.
The illusion: metrics without meaning
Nowadays, most teams can confidently say, “We have observability.” They’ve deployed metrics, logs, and traces. They’ve built dashboards and set up alerts. But do they trust the data, understand its meaning, or know how to act on it? The answers are less clear.
Too often, observability is treated as a production-readiness checkbox. Something we need to do before deploying to production, as the release team has requested. So teams copy dashboards from past projects, tweak parameters, and keep default thresholds. Typically, that includes the default alert of “CPU usage is over 80%”.
Therefore, when such alerts are triggered, engineers are unsure how to handle them. More often than not, the response is silence or a Slack thread that fades away unresolved.
The deeper problem is that most observability data is technical, not contextual. It indicates that a system is slow, but not whether a customer is affected. It shows that the error rate is rising, but it does not indicate whether it’s costing the business money. Without that bridge between telemetry and business impact, it’s challenging to prioritize or interpret the metrics effectively. If you must ask around what the impact is when an incident occurs (or in the RCA, for example), you are likely in this situation.
As engineers, we tend to focus on dashboards and alerts, but often overlook other sources of signals, such as logs, batch jobs, feature flag usage, user interactions, or recent deployments. All of these can provide insight into system behavior and failure patterns. But unlike metrics, they are often all over the place.
Meanwhile, we search for the newest observability tools that would solve all the problems, even though many teams never use half the features their current tools provide. Predictive analytics, anomaly detection, and historical correlation are available but underutilized. Or too expensive to be used and applied without a strategy. That’s why in many war rooms, we are still guessing at root causes and trying to find a previous incident pattern that may correlate.
With all the observability tools, we have made things visible, but not yet meaningful. And that is what is needed to take action.
Great tools don’t replace the culture
The problem isn’t just the tools or the data. It’s the environment.
Many teams have access to several observability tools in their organisation. One for AWS, one for GCP, another for traces, a separate one for logs, and so on. It fragments understanding and visibility. Engineers must jump between tools with different interfaces, data models, and alerts. When an incident hits, it can be challenging to get the whole picture. In the end, teams choose the tools with which they are most comfortable. And in war rooms, people share links to different tools, and sometimes people don’t even have access to them. This leads to slower diagnosis, more guesswork, and rising cognitive load. However, is this the fault of tools?
The culture is probably as important as the tooling. And it takes time to build up. Especially when engineers prefer to focus on the technical part.
Observability is often seen as a cost center rather than a strategic function. This discourages investment in refining signals, reviewing false positives, or making metrics business-relevant. Therefore, many teams are stuck in a reactive posture, waiting for alerts to fire. There’s less incentive to anticipate problems using historical data or patterns.
The problem is made worse by silos. The SRE team often owns observability tools, while business metrics are disconnected and live in analytics dashboards. Ownership of alerts is unclear. Should SRE be the first responder? When should it be escalated to the product team? And when everyone is monitoring different things, it’s hard to build shared intuition or develop transparent processes.
Finally, few teams take the time to document their observability practices. There’s no registry of key alerts, no defined thresholds aligned to SLOs (if you even have SLOs), and no clear definitions of what “normal” looks like. Think about how much time a new engineer would need to get a good understanding of your production system.
All this leads to an observability stack that exists, but doesn’t evolve. It keeps the lights on, but doesn’t help teams learn, anticipate, or move faster. And that’s where the gap lies.
Making Observability actionable
If the goal is to drive action, then teams need more than dashboards and alerts.
First, not every alert should be treated the same. A CPU spike or memory warning is useful for diagnosis, but unless it ties to degraded user experience or failed transactions, it shouldn’t trigger a high-severity incident. Neither a phone call at 3 AM. A common practice is to separate “health” from “impact”:
Technical alerts: For awareness and system tuning. Can be considered as warnings.
Business alerts: When critical user-facing or revenue-generating processes are at risk.
It doesn’t mean we should completely silence the technical alerts, but we should put in the engineering effort where it matters most first.
It doesn’t mean you should have technical alerts for everything you can think of, as it will only lead to alert fatigue. To prevent this from happening, track incidents triggered by poor observability coverage and organise alert reviews regularly. For example, schedule quarterly reviews of active alerts across teams: eliminate outdated ones, adjust thresholds based on historical data, or align with new priorities. Bring in multiple teams, such as product and data teams, to connect infrastructure health to business impact. It will also improve trust in the system. Alerts would gain meaning and wouldn’t get ignored in the middle of the night. Similarly, before adding any new alert, ensure it is relevant to your strategy and restrict who can respond to it until it has been thoroughly vetted.
You can also consider creating a central registry of business and technical metrics. Not just what’s monitored, but what it means and why it matters. This builds a shared language between engineers, product managers, and operations. For example, define ownership, normal baselines, thresholds, or SLOs if needed. And no, it is not the same as pointing at a Git repository with all the alert definitions. A readable document would improve collaboration, on-call effectiveness, and accelerate decision-making during incidents. Now, you don’t need to wait for a few senior engineers to join the call and inform everyone about the impact. And your on-call engineers can feel more confident in coordinating the response.
Finally, invest in training. Too often, engineers spend too much time on technical implementation and tooling, but it is necessary to find a balance with adoption. Invest in training engineers on how to interpret signals, create runbooks, document alert response playbooks, and socialize the observability strategy throughout the organization. You don’t need more tools. Just better habits.
Conclusion
Observability is excellent, but clarity and action are better. Most teams have metrics, dashboards, and alerts. Yet they still struggle to understand what matters, why it matters, and what to do when things go wrong. That’s not a tooling problem. It’s an organizational and cultural one.
To unlock real value, observability must evolve from a data collection exercise into a shared practice of informed decision-making. That means aligning on what to measure, distinguishing between noise and signal, and embedding observability into the way teams build, review, and improve systems.
It’s not just about seeing more, it’s about acting with confidence.
So take a step back and ask yourself: do you have an observability strategy that your engineers can follow, or do you only provide tools? What would be the first thing you would change?