Upstream Alert: Another patching gone wrong

A routine firewall upgrade blocked Australia’s emergency calls for over 13 hours. It’s a chilling reminder that "minor" infrastructure changes can have dramatic consequences.

Oct 07, 2025

Source: https://www.arnnet.com.au/article/4060616/firewall-upgrade-behind-optus-triple-zero-failure.html

What happened?

In September 2025, Australia’s Optus telecommunications network failed its most basic mission: emergency calls, the country’s 911, stopped connecting for hours.

The culprit? A firewall upgrade. But here’s the thing: Optus itself admits it departed from “established processes” during that upgrade, which led to the system incorrectly blocking emergency traffic.

The outage lasted about 13 hours before Optus rolled back the changes. During that time, calls weren’t routed to alternative networks, some contact centre warnings were ignored, and the company delayed notifying emergency services.

Tragically, a few people attempted emergency calls during that outage period but did not receive support in a timely manner.

What’s in it for you?

This one cuts past the usual “cloud went down” stories. This outage had terrible consequences on people's lives. And as with any other incident, it must be thoroughly examined to prevent it from happening again.

When systems fail, it’s not just data or money at stake. It can be trust, safety, and perhaps lives. Being aware of the critical components of your systems and ensuring that everyone else in the company is also mindful is crucial.

In the Optus case, a few things went wrong: applying the change, following the processes, and responding to/mitigating the incident. It is a lot for a crucial service. The lessons here are universal:

Upgrades are dangerous. Even “routine firewall patches” can disable the most crucial services. I have “hoped nothing will break” more than once when making changes to minor services, but this is not a strategy when failures carry critical consequences.
Process discipline is non-negotiable. Optus admitted it departed from its own standard operating procedures. We have all been annoyed by those “boring controls” in some companies, but they exist for a reason. Raise awareness to update them. Don’t skip them just because you don’t like them.
Delayed escalation increases damage. Customers tried to report issues early, but their calls weren’t escalated. Understanding why is necessary. If the core feature of your system crashes, there shouldn’t be any doubt about the escalation procedure to follow.
Redundancy and fallback. Emergency calls weren’t routed to alternate networks. Expect any single path in your system to fail, and be prepared to handle the situation accordingly, especially for critical features.

If you run anything critical, you probably have regulations to follow. And you are accountable. And when it fails, be ready to handle it. Because sometimes it will.

Last advice

Failures like this are reminders that your change process, monitoring, and fallback logic must be as robust as possible for critical services. Mistakes will happen, but never let a routine change become the weakest link in your system.

So what are you waiting for? If you haven’t done it recently, simulate an incident in your most critical service. Observe how well your teams respond. What you learn may surprise you.

Beaver Duty

Discussion about this post