Blue Screens to Blackouts: The Story Behind the CrowdStrike Outage
A Deep Dive into the Causes and Consequences of the CrowdStrike Incident
Introduction
On 19th July 2024, the world witnessed one of the biggest IT disruptions in recent years. Corporations worldwide reported outages and disruptions, with Windows computers displaying the dreaded Blue Screen of Death (BSOD) Error.
The outage impacted sectors such as airlines, banking, trading, media companies and many more. It was confirmed that there was no security incident or cyberattack resulting in this disruption.
The outage was due to a software update of Falcon, a tool built by the cybersecurity firm CrowdStrike. The latest update resulted in a crash of Windows operating system and resulted in worldwide outage.
In this article, we will dive deep and understand the issue. We will start by demystifying the Blue Screen of Death by revisiting the operating system fundamentals. Later, we will look at CrowdStrike’s architecture, the faulty release and the mitigation strategy. We will conclude by going over some key learnings from this incident.
Blue Screen of Death (Blue Screen of Death)
If you use Windows, you've likely encountered the Blue Screen of Death (BSOD). This error screen indicates that the system has malfunctioned, usually due to hardware or software issues.
After the reboot, everything works seamlessly. But, do you know why the BSOD appears ? And why does it appear only on Windows machine and not on Linux/Mac ? Let’s understand this in detail.
Operating System
Before going into the details, we will revisit some operating system concepts. This would help us grasp complex details in an easy manner.
The operating system manages hardware components like memory, CPU, and I/O devices, with the Kernel at its core.
The Kernel is responsible for primary functionalities such as :-
Memory management - Allocating/Deallocating memory for the running processes. Tracking the memory used by the processes.
Process management - Managing the process lifecycle and ensuring two or more processes run in the memory without any issue.
Device management - Communication with devices like monitor, mouse, keyboard, printers, etc.
File System management - Organizing the files on the hard drive or any external storage.
The processes running on any computer/machine run in two modes - a) Kernel Mode b) User Mode.
Kernel Mode
This mode has highest privilege and can directly interact with all the hardware. Operating system processes run in this mode.
It’s a critical mode and any errors in the code can result in system instability and eventual crash. Hence, code has to be written meticulously and should be error-free.
As users, we don’t directly deal with processes that run in the kernel mode. However, you can view them in the Windows Task Manager. The following snippet shows processes such as Host processes and NT Kernel & System running in kernel mode.
User Mode
The user mode has least privilege and it can’t directly interact with the hardware. It uses the operating system to get the data from any hardware device.
The user mode processes call system calls for sending/receiving data to/from the hardware. For example: User programs call fopen
system call to open a file on the hard disk.
The operating system ensures that the user process doesn’t access any memory that’s not assigned to it. In case a user process accesses memory belonging to a different process or a kernel process, the operating system sends a SIGSEV to the process. And the process finally terminates and the user sees a SEGFAULT (Segmentation Fault).
Applications that you use daily such as web browsers, MS Excel, MS Word, games, media players like VLC, etc run in user mode.
What causes BSOD ?
BSOD is only shown for Windows devices. It indicates that the system has malfunctioned. It could be either due to a hardware or software issue.
The following are few reasons for BSOD :-
File system corruption - In case the files required by the Kernel are corrupted, the system can crash.
Device driver issues - Incompatible or corrupted device drivers can lead to communication errors. And eventually cause BSOD.
Software bugs - Bugs which modify critical Kernel data structures, and access forbidden memory locations can bring the system to a halt.
Malware - Malware can interfere with the system processes leading to a BSOD.
In most of the cases, we restart the system and as expected it starts functioning. We can often troubleshoot the issue by looking at the system logs.
To ensure smooth functioning of operating system, it is essential to prevent scenarios such as kernel’s data corruption, accidental modification of kernel’s memory, etc.
Mac/Linux use different Kernel, data structures and error handling mechanisms. In case of an OS crash, both OSs use Kernel panic. Unlike Windows’s BSOD, Mac/Linux show a screen with error messages.
Now, that you understand BSOD, let’s understand the reason for 19th July’s worldwide BSOD.
19th July’s BSOD
72% of the users in the world are windows users. Windows devices use tools developed by CrowdStrike(CRWD) for cybersecurity.
CrowdStrike has developed a suite of cloud-native tools. One such product in their suite is known as CrowdStrike Falcon.
CrowdStrike Falcon offers the following features :-
Endpoint security - It detects and prevents malware, ransomware and other threats.
Extended detection and response (XDR) - Helps in investigating security incidents, identifying root causes, and taking the right steps.
Cloud workload protection - It monitors any malicious activity in cloud environments for Azure, AWS , and Google Cloud Platform (GCP).
CrowdStrike Falcon architecture
Falcon is installed on all the devices and runs as an agent (sensor). It continuously monitors the device activity.
CrowdStrike runs a cloud platform that collects data from all the Falcon agents. This platform also serves as a control plane for CrowdStrike to view the threats and security incidents.
Falcon agents run as background processes. They collect the security data from the device and send it to the CrowdStrike platform.
The following image shows the architecture (for representation only) of Falcon :-
Falcon Agent runs as a kernel process since it has to monitor activities such as -
Device driver activity
Network traffic
Restricted file accesses
The above activities require highest privileges and as a result, the Falcon Agent runs in a kernel mode.
Unlike windows updates, Falcon Agent updates itself silently in the background. It gets the latest update from the cloud platform.
As seen from the above architecture, the Falcon Agent gets the latest update, restarts and runs the latest version.
Faulty deployment
Recently, CrowdStrike deployed latest version of Falcon Agent. The file C-00000291*.sys was updated and the running agents downloaded the file.
After consuming the update, windows machines started crashing. The issue wasn’t observed on the machines where the agent wasn’t updated.
This was observed worldwide and brought most of the economic activities such as Air travel, Hospitals, Stock trading, etc to a standstill.
CrowdStrike discovered that the issue was due to faulty update in the C-00000291*.sys file. A fix was deployed and the file C-00000291*.sys with timestamp of 0527 UTC had the fix. While the one with 0409 UTC was the problematic one. Reference - Falcon update for Windows host.
CrowdStrike hasn’t published the exact technical details. However, I suspect it could be due to incompatibility between the latest agent update and it’s interaction with the Kernel.
The latest Agent might have a bug that might have written to Kernel’s memory causing corruption. This might have resulted in the crash. And since this was repetitive, the fault might have been in Falcon Agent’s initialization.
Immediate Fix
CrowdStrike mentioned the steps to mitigate the issue immediately. They deployed a fix and machines using the latest Agent wouldn’t face the same issue.
They also stated that users should manually delete the faulty C-00000291*.sys file from the machines and reboot them. Similar workarounds are suggested for workloads running on cloud.
According to me, it would take a couple of days for the fix to reach all the machines worldwide. Also, since it requires manual intervention, non-technical users would face a challenge. But, hopefully, it will get resolved soon.
Long-term mitigation
On 19th July, we witnessed how a simple software update can disrupt our daily life. In the future, such issues must be avoided at all costs.
In my opinion, there are two angles to the 19th July’s BSOD issue :-
Agent updates - The issue could have been prevented if the update wasn’t rolled out to all the users.
OS design - Linux/Mac didn’t face the same issue. However, it impacted most of the Windows devices. A better design of OS and its kernel modules would have prevented this havoc.
Agent updates
Since the agents are installed on the end-user devices, it makes the deployment more challenging. Unlike cloud-services, these changes can’t be rolled-back automatically. And also the blast radius is huge as it encompasses all devices in the world.
However, in the future, CrowdStrike can follow a strategy similar to blue-green deployment. Instead of all the agents receiving the update, a set of computers (green environment) would receive the update. It would bake for few days in the green environment and then the change would roll-out to the blue environment (end-user devices).
Similarly, since the Agent runs in the kernel mode, the testing must be rigorous. Also, the code reviews must be done thoroughly and critically. There should be enough guardrails to prevent any accidental updates to the source operating system.
OS design
Windows users have a history of dealing with BSOD. Over the years, Windows has become stable and users are seeing BSOD less frequently.
One of the primary reasons for BSOD can be attributed to the design of Windows OS. While kernel extensions provide low-level access, hardware interaction and customization, the downside is instability.
Mac followed a similar approach until macOS 10.15 (Catalina). Since then, Mac introduced system sxtensions. System extensions run as User processes and have limited access to the core kernel. This prevents unexpected crashes, and reduces security vulnerabilities.
Windows can benefit from following Mac’s footsteps in the future. It would improve overall security and stability.
Additionally, the Windows kernel is implemented in C/C++. These languages are prone to errors such as buffer overflows, dangling pointers, & Null pointer references. The errors lead to process crashes.
Microsoft has started rewriting its kernel in Rust. Rust’s strong typing and ownership system make it memory safe. In the future, 19th July-like BSOD issues can be prevented with the Windows version written in Rust.
Conclusion
On 19th July, the majority of systems running Windows faced BSOD. Moreover, cloud services running on Windows servers were impacted, disrupting many clients relying on these services.
The root cause of the issue was a faulty update of CrowdStrike’s Falcon Agent (sensor). The latest update caused an OS crash, and users started seeing BSOD.
CrowdStrike identified the issue and rolled out a fix. They also suggested ways to mitigate the issues on their website. Some users had to manually reboot their machines and delete the problematic file from the update as an immediate fix. In a couple of days, all systems are expected to function normally.
As a software developer, key takeaways from this incident include :-
Before deployment, assess the impact of the change and the blast radius.
Adopt blue-green deployments to minimize the disruption to end-users.
Minimize the manual steps for users in case of software roll-backs. Roll-backs must be seamless, and user functionality shouldn’t get impacted.
When dealing with the OS kernel, follow best practices for testing and code reviews.
If developing an OS, use memory-safe languages like Rust over C/C++, and use concepts such as system extensions that limit a process’s access to the OS Kernel.
Let me know in the comments below what your views are and how CrowdStrike or any other company could prevent such issues in the future. Also, restack and share the post if you liked it.
Before you go:
❤️ the story and follow the newsletter for more such articles
Your support helps keep this newsletter free and fuels future content. Consider a small donation to show your appreciation here - Paypal Donate
Very well articulated about the BSOD issue thank you for presenting in depth. Felt like every line touched my Kernel level 😁
While reviewing this incident with my of Quality and Risk analysis experience, came up with following questions:
1. Deployment and releasing strategies: When any Security Company release these type of critical updates, considering the risks they start deploying in low impacted environments customers and iteratively expand to medium and high impacted environments. This approach gives great opportunity to uncover the potential risk during every stage and impact risk can easily gets mitigated.
2. Sandbox approach. Now many organizations are releasing in sandbox which are like production and look for possible issues before migrating to real production.
3. Deploying iteratively region wise across globe rather than whole world.
Wondering what made Crowdstrike’s release management team not to consider these options? Any thoughts?
I do wonder what combination of events led to the mass deployment of this catastrophic update.
Surely Crowdstrike tests their deployments against a significant pool of test clients running various combinations of Windows and other 3rd party software?
My guess is that one of the following happened:
(1) there was an error in some automated deployment test which failed to detect the issue and gave a "false positive" that ended up clearing the "gate" for deployment. A lot of CI/CD systems have such gates that are fed from unit-tests. If this pipeline failed for some reason, it could have led to the deployment of faulty code
(2) Some other update from Microsoft or a third party that also deploys kernel-mode code happened *just before* the CrowdStrike deployment, and the latter's testing did not take this into account. One would hope that CrowdStrike would coordinate their deployments with Microsoft to ensure that any recent updates are also tested, but maybe this didn't happen?
In any case, it goes to show that
(a) it's a really bad idea to do a global deployment of something that can take down a significant part of the world's IT infrastructure - a blue/green deployment practice seems like a *really* good idea here.
(b) MS Windows lacks the robustness to allow these kinds of deployments without oversight/ control from Microsoft themselves