AT&T Outage Shines A Spotlight On Network Dependability

On February 22, a massive service interruption in AT&T cellular services affected subscribers across the nation. Although outage-report volumes were in the hundreds of thousands, that is likely just the tip of the iceberg. What lies beneath is a massive number of subscribers who experienced issues but didn’t or couldn’t report them, as well as affected services using cellular networks (e.g., tracking services, point-of-sale terminals, etc.). The outage lasted for approximately 11 hours, and based on the impacts of similar outages in the past on areas such as financial transactions and supply chains, we estimate the impact to the US economy at $500 million. Here’s what we know happened and what will happen next:

A mundane network change caused the massive outage. AT&T has officially released a statement February 22 that attributes the outage to “ … the application and execution of an incorrect process used as we were expanding our network, not a cyber attack … ” — what’s the big deal? For most of us in IT, cellular technologies have been used as backup underlying technology for wide-area networks, making the impact minimal. But for some enterprises, cellular connectivity is the lifeline of their core business functions such as operations (e.g., field and fleet operations or asset tracking and management) or sales (e.g., payment terminals, kiosks, etc.). In these circumstances, an outage like this can be devasting.
There will be investigations and significant costs to AT&T … and, ultimately, its customers. A chain of events will unfold following the outage, starting with AT&T submitting the official outage root cause report to the FCC. In parallel, US government agencies will support efforts to rule out any possible cyberattacks. Customer rebates and credits will start to flow, as will lawsuits from consumers and businesses alike. AT&T will implement processes and technology improvements addressing the root cause(s), and the FCC will be forced to review its rules. If we use the July 8, 2022, Rogers outage in Canada as a guide, we estimate that AT&T will see as much as $1.5 billion in impact, considering the outage duration and population proportions, which could be bundled into a three-year plan, as done by Rogers (C$10 billion over 3 years). If such an improvement plan is put together by AT&T, we expect it to be in the vicinity of US$20 to 30 billion. It is likely that customers will see the result of this in higher costs, similar to what Rogers subscribers experienced a few months after its outage.

That’s not great news for anyone. It is important to remember that networks will always have outages and performance degradations; it’s a matter of physics, human intervention, and technology complexity. What made this newsworthy was that this was a major carrier that enterprises and citizens depend on. For these reasons, carriers are held to the highest standards — often with SLAs of five nines of availability for a year; that means being unavailable for no more than 5 minutes and 15 seconds a year. Being down for 11 hours … that’s a new ballpark. What are the key lessons for carriers and IT leaders from this unfortunate event?

IT leaders must revisit their end-device wireless connectivity capabilities. Especially for companies that rely on single-carrier cellular connectivity, it may be time to reconsider that approach and whether other technologies might better serve your needs — for example, allowing for multi SIM/eSIM redundant carrier connectivity or having multiple wireless connectivity options, such as satellite, LoRa, Sigfox, or even Wi-Fi in your end devices. But there’s more to learn here. As much as we hold carriers to higher standards, we can try to avoid their mistakes …
All networking orgs must accelerate monitoring, visibility, observability, and AI investments. As noted above, networks will always have outages and performance degradations. However, networking teams aren’t known for diligent planning ahead and proactive resilience measures. For example, network monitoring solutions are usually an afterthought. After an issue arises, especially when the root cause can’t be found, networking teams will invest in a monitoring solution. Part of the issue is lack of budget for fundamentals versus flashing new concepts, such as autonomous networks, intent-based networking, and networking as a service. But that approach is nothing more than taping a crack on an airplane wing and must be phased out. Uptime and fast remediation are essential for customer experience. This makes network automation, performance management (including visibility, observability, and AIOps), fast analytics for root-cause analysts/CAST, and systemwide improvements via AI all essential. Automation and AI won’t eliminate every outage, but it can help uncover and avoid many outages and performance degradations while running simulations before changes or issues.
Advanced companies, like carriers, should seek out advanced practices. The expectations for large enterprises, especially carriers, are even higher. It is no longer enough to just invest fully in the items above. They need to push into advanced practices such as businesswide networking fabrics, simulations/digital twins, real-time event communication, etc. Why are these so important? Past segmented networks were discrete components, manually controlled with changes occurring across each network point, sequentially, over a long period. The emergence of businesswide networking fabrics controlled by software, where one change can occur across hundreds if not thousands of devices simultaneously, pushes the need for running scenarios through digital twins to ensure an understanding of the full scope of change before it occurs for elements such as network config changes, updates, upgrades, etc. Carriers should accelerate the adoption of these technologies — similar to the simulations that the aerospace and aircraft industry does before building components, aircrafts, or rockets.