Discussion about this post

User's avatar
Srinivasa Tupurani's avatar

Very well articulated about the BSOD issue thank you for presenting in depth. Felt like every line touched my Kernel level 😁

While reviewing this incident with my of Quality and Risk analysis experience, came up with following questions:

1. Deployment and releasing strategies: When any Security Company release these type of critical updates, considering the risks they start deploying in low impacted environments customers and iteratively expand to medium and high impacted environments. This approach gives great opportunity to uncover the potential risk during every stage and impact risk can easily gets mitigated.

2. Sandbox approach. Now many organizations are releasing in sandbox which are like production and look for possible issues before migrating to real production.

3. Deploying iteratively region wise across globe rather than whole world.

Wondering what made Crowdstrike’s release management team not to consider these options? Any thoughts?

Expand full comment
John's avatar

I do wonder what combination of events led to the mass deployment of this catastrophic update.

Surely Crowdstrike tests their deployments against a significant pool of test clients running various combinations of Windows and other 3rd party software?

My guess is that one of the following happened:

(1) there was an error in some automated deployment test which failed to detect the issue and gave a "false positive" that ended up clearing the "gate" for deployment. A lot of CI/CD systems have such gates that are fed from unit-tests. If this pipeline failed for some reason, it could have led to the deployment of faulty code

(2) Some other update from Microsoft or a third party that also deploys kernel-mode code happened *just before* the CrowdStrike deployment, and the latter's testing did not take this into account. One would hope that CrowdStrike would coordinate their deployments with Microsoft to ensure that any recent updates are also tested, but maybe this didn't happen?

In any case, it goes to show that

(a) it's a really bad idea to do a global deployment of something that can take down a significant part of the world's IT infrastructure - a blue/green deployment practice seems like a *really* good idea here.

(b) MS Windows lacks the robustness to allow these kinds of deployments without oversight/ control from Microsoft themselves

Expand full comment
18 more comments...

No posts