Forrester analysts from our security, risk, tech exec, and technology architecture and delivery teams have been working around the clock to assemble an in-depth point of view on the massive global disruption caused by the CrowdStrike update of Friday, July 19. This is our second blog on the topic; see also our initial blog, CrowdStrike Global Outage: Critical Next Steps For Tech And Security Leaders.
As of July 25, CrowdStrike reported that approximately 97% of systems have recovered. So, after a long week of system crashes, endlessly typing BitLocker recovery keys, and ordering food for staff members, tech executives need to shift to a different kind of recovery mode now. One where they cast off the stress — and adrenaline — of having thousands (or more) systems down to one where they think about recharging the mental energy of their teams, giving everyone some much-needed time off, and evaluating what they need to change because of this event.
Crises like these shine a bright light on many aspects of business and tech, from strategic vendor relationships to your IT workers laboring in the trenches. As the dust settles on the immediate crisis, technology leaders face the unfolding long-term repercussions. With the significant impact of the event, it’s crucial for tech leaders to brace for probing inquiries from executives, board members, customers, and employees alike. To prevent a recurrence and to rebuild trust with these key stakeholders, a thorough reassessment of concentration risk, third-party risk, and auto-update strategies is imperative. This necessitates a critical review of IT management, monitoring of vital infrastructure, and the robustness of incident response.
In short, there are many dimensions to investigate and many conversations ahead. In our new report, Redefining Resilience In The Aftermath Of The CrowdStrike Outage: Turn The Crisis Into Strategy With Forrester’s Recommendations (client-only access), we provide a thorough overview of recommended actions, and in this blog post, we highlight some of the report’s key points.
The Foibles Of Testing
Whenever a major outage strikes, message boards and social media fill up with comments such as “clearly it was a simple failure of testing.” The reality is unfortunately much messier and more complex. In CrowdStrike’s preliminary incident review, it outlined an existing set of testing/QA protocols including an architecture that was supposed to validate new content. The trouble was that this validation itself had a bug. CrowdStrike presumably used traditional software testing techniques to exercise this feature, but these clearly fell short. They’ve said they’ll also add “fuzzing,” throwing randomized data at their inputs, as a further strategy to identify defects.
There is broad consensus, however, that CrowdStrike’s big bang approach to rolling out content packs was at least as important. The error in fact was a classic example of why current incident management thinking discourages looking for a single “root” cause. Here, we clearly see multiple factors converging into a global catastrophe: failed validation logic, a big bang release, a Windows monoculture, and the particulars of BitLocker drive encryption (among other factors) all contributed.
A spotlight is now on the near-universal acceptance of increasingly frequent vendor-driven updates, and the accelerating abandonment of user-side quality assurance. (Does your IT organization preview and test all new Office 365 updates?) This is a wicked problem when it comes to security. Unlike a new set of PowerPoint icons, the content CrowdStrike is continually pushing out is to prevent the latest and nastiest exploits and viruses from taking hold in your infrastructure. Hackers only need a narrow window of your systems being unprotected to compromise them.
So, the tradeoff between regression testing and zero-day preparedness is getting a lot of attention. Many are calling for IT pros to test all incoming updates more thoroughly. However, this kind of testing regimen does not come for free. if you want to test all updates from a given vendor, you: 1) must invest in regression testing capabilities (expensive), and 2) in the case of security definitions, accept the risk that a zero-day issue could be exploited by hackers while you are testing the security content.
In terms of core IT operational practices, incident and crisis management and business continuity also are having a moment. Despite perpetual claims of eventual complete autonomy and automation (at which point human beings will do what? Lounge around with pina coladas? I hope so…), IT systems do not run themselves, nor does Forrester see this happening anytime soon. As many companies again discovered this week, the ability to identify an exceptional situation, declare an incident or crisis, and turn to a well-defined response plan remains essential.
The Ripple Effects Go Far Beyond CrowdStrike And Security Tools
CrowdStrike isn’t the only vendor affected by this. Every XDR, EDR, and EPP vendor is under a microscope from their customers. Tech executives and security leaders should demand clear explanations on how their security vendors access the kernel, what updates are being introduced and when, and how they conduct quality assurance in security software. Not just in the agent itself, but content updates of all types. Product and support teams across cybersecurity will be busy answering these questions from their customer base and, rightfully so.
Allowing end-user software kernel access has been in decline in recent years for exactly the reasons we saw with CrowdStrike. However, there are still current examples along with much legacy code. One key proactive control employed by large IT organizations is technology lifecycle management (TLM), the systematic assessment of new incoming tech for value, suitability, and risk. This is often run by an enterprise architecture organization. Organizations that formalize this process call in security and technology architects on initial evaluation of any new vendor technology, to assess a wide variety of questions including required OS privileges and how it updates its software.
Trust Is Paramount
CrowdStrike needs to earn the trust of its customers back. That will not be easy, but Forrester’s research on trust shows several key areas it can focus on. Of the seven levers of trust, Competence, Consistency, and Dependability all took major hits. However, CrowdStrike showed transparency and accountability throughout the event, which the vendor should and has been praised for. We hope — and expect — more of this. Their initial Post Incident Review outlines a number of reasonable steps they plan to take, including improved testing protocols, staged deployment, and external QA of both their code and their end-to-end processes.