After the CrowdStrike Outage: What Can We Learn?

The CrowdStrike software update issue highlighted the importance of robust software testing and the software development lifecycle, domain elements of the CSSLP certification. Following on from our look at dealing with IT outages, we consider how this incident can serve as a learning opportunity for those producing and deploying software.

In the early hours of Friday, July 19^th, 2024, cybersecurity vendor CrowdStrike began to roll out an update to its Falcon sensor program. This usually routine activity didn’t go as planned, with an undetected bug resulting in widespread IT disruption that impacted client computers and servers. This in turn led to a wide array of hosted and on-premises applications, services and data sources being adversely affected or rendered inaccessible.

The disruption evoked memories of how a mistake in a switch configuration knocked out the global BlackBerry email device platform for three days after the error was replicated to every other switch on the network. Human error happens and it’s not always avoidable. Nonetheless, there’s always things we can learn to try and minimize the risk in the future.

With the error in the update corrected and instructions now available on how to recover affected machines, let’s look at what we can use from this experience to improve approaches and processes elsewhere and avoid falling foul of a similar incident in the future.

Risk Factor of Automated Updates

The routine update to CrowdStrike’s Falcon software contained a small code error in one of the supporting code files. Microsoft estimates that it impacted around 8.5 million Windows machines, about 1% of the installed base worldwide to fail to a critical error display commonly known as the "Blue Screen of Death" or BSOD.

The program is locally installed, but serviced from the cloud, meaning it operates in conjunction with and regularly connects to CrowdStrike's servers, removing the need for users to manually check for or trigger the installation of updates, which accelerated the spread of the defective update.

Remote deployment and updating of software rely on the receiving endpoint being functional and connected. If the endpoint becomes inoperable, especially if the software responsible is close to the core operating system as security applications often are, manual intervention is usually needed to reverse things and recover. In this instance many endpoints also had Microsoft’s BitLocker hard drive volume encryption feature enabled. While this is good security practice to do, it had the knock-on effect of making the manual recovery process more complicated.

Depending on the organization and the operating scenario, it may or may not be worth the risk of allowing systems to acquire and deploy their own updates automatically. That means either pulling those updates into an internal testing environment, or at least have direct automated updates disabled in favor of polling a centrally managed update server as part of a managed enterprise patch deployment solution. The latter can at least allow time to be built into the deployment process, to lower the risk of defective updates being rolled out to critical systems and users before issues are discovered.

Test, Test and Test Again

The immediate takeaway is to examine what systems rely on direct, automated updates that bypass software testing processes. That includes bypassing sandboxes, patch management servers or any other quarantine and internally managed deployment measures applied to software before it’s rolled out to users and critical systems. It’s commonplace for anti-malware software to be allowed to self-update, a process of continuous integration and continuous delivery. The benefits of getting the latest updates, malware profiles and other protective data out to users as fast as possible have often outweighed the risk of an update breaking an operating system or another application. However, the CrowdStrike experience does highlight the risk being taken when allowing any application to self-manage its updates.

Looking beyond deployment to the creation of software updates, there are additional risk factors that need to be considered if problems arise, including:

Communications and disclosure requirements
Reputational risk
Loss of sales
Compensation claims
Legal liability

For any organization developing and deploying its own code, whether it’s internally or to external customers, extensive and exhaustive code testing is critical.

Conducting rigorous code reviews and using static analysis tools to detect potential issues in the code before it is merged into the main codebase helps in maintaining code quality and catching bugs early. More extensive beta testing, via a phased rollout to a small group of opted-in and engaged users to gather feedback and identify issues before a full-scale rollout, reduces risk and impact if bad code does slip through to users.

Software testing is one of the eight domains ISC2 covers in the Certified Secure Software Lifecycle Professional (CSSLP) certification, underlining the importance such testing plays in the wider Software Development Life Cycle (SDLC). The experiences of July 19, 2024, illustrate the value and benefit of software deployment personnel having secured certifications and other accredited training and education to ensure a more robust, repeatable and measurable approach to testing software before it is rolled out, as well as dealing with remedial changes. Whether it’s to address security vulnerabilities or any other kind of defects that could disrupt.

Mitigation and Recovery

While CrowdStrike has addressed the bad update and issued fixes and instructions to recover affected systems, the vendor is not alone in providing help and support. It is always important to consider additional guidance and alerts from other affected parties and independent sources.

For example, Microsoft issued an updated Windows recovery tool with two additional repair options designed specifically for recovering from the CrowdStrike update. It also provided other resources including a blog and alternative workarounds to help with recovering Windows-based machines that were affected.

The U.S. Cybersecurity and Infrastructure Security Agency (CISA) produced a timeline-based resource page with links to information and statements in the order they were released, allowing it to capture and alert followers to updates in guidance and instructions. CISA has also been able to provide a broader view of the situation, identifying instances of cybercriminals trying to take advantage of the disruption – which itself was not a cybersecurity incident – by circulating phishing emails and other compromised links, capitalizing on the concern of individuals trying to recover from a BSOD state. CISA, along with cybersecurity agencies in other nations, are working closely with CrowdStrike as well as with critical infrastructure operators to help with fixes.

Media coverage has noted that malicious actors were already sending phishing emails using a variety of domains that impersonate CrowdStrike. One of the emails posted falsely claimed it could “fix the CrowdStrike apocalypse” if the recipient paid a fee worth several hundred euros to a random crypto wallet.

In reality, the only confirmed fixes in this instance are to either to repeatedly restart affected computers in the hope that they stay on long enough for the newly fixed update to download and install or removing the defective file from every bricked computer manually or with Microsoft’s recovery tool. Given these are the solutions, it’s understandable that some users may be enticed by the promise of an easier, albeit false, fix.

A Word About Effective Communication

Whenever an organization is impacted by an IT issue that impacts others, there is an immediate need to understand what needs to be communicated and to do it in an effective and appropriate manner. Whether it is disclosure to comply with local laws, or transparency to ensure the affected parties know what has happened, what you are doing about it, what they need to do about it, and what is happening to prevent it from happening again.

By implementing strategies like these, organizations can ensure the robustness of its code, identify more issues early on in the development cycle and reduce the risk of the IT outage that occurred. Robust code testing and deployment processes that consider the risk factors of an application to the organization or customer ensure higher reliability, security and operating performance. They contribute to overall software quality and user satisfaction and reduce the risk assumed by all parties.

We want to hear from you. How were you impacted by this? Connect with other cybersecurity professionals who are discussing this incident on the ISC2 Community
Find out more about the CSSLP certification
ISC2 member Kaushal Perera, CISSP discusses the crucial role of leadership in information security