Engineering At Scale

Jul 24, 2024

Thanks a lot Srinivasa.

I suspect that it could be because of some misconfiguration that triggered this and it went in an uncontrolled manner. Given the blast radius of the change, I am sure that the CloudStrike team would have sufficient guardrails in place to overcome any catastrophic consequences due to new changes.

Expand full comment

John

I do wonder what combination of events led to the mass deployment of this catastrophic update.

Surely Crowdstrike tests their deployments against a significant pool of test clients running various combinations of Windows and other 3rd party software?

My guess is that one of the following happened:

(1) there was an error in some automated deployment test which failed to detect the issue and gave a "false positive" that ended up clearing the "gate" for deployment. A lot of CI/CD systems have such gates that are fed from unit-tests. If this pipeline failed for some reason, it could have led to the deployment of faulty code

(2) Some other update from Microsoft or a third party that also deploys kernel-mode code happened *just before* the CrowdStrike deployment, and the latter's testing did not take this into account. One would hope that CrowdStrike would coordinate their deployments with Microsoft to ensure that any recent updates are also tested, but maybe this didn't happen?

In any case, it goes to show that

(a) it's a really bad idea to do a global deployment of something that can take down a significant part of the world's IT infrastructure - a blue/green deployment practice seems like a *really* good idea here.

(b) MS Windows lacks the robustness to allow these kinds of deployments without oversight/ control from Microsoft themselves

Expand full comment

In my opinion, the Operating System must at least let the user know what software is being updated in case that software is running in Kernel mode like Falcon. Although, silent updates simplify the user experience, but it's a fair tradeoff for Kernel mode software.

Also, I feel that the primary reason for failure could be some misconfiguration in the pipelines or a critical infrastructure piece that led to this deployment. However, it could be something else too. We will have to wait for CrowdStrike team to get back with the RCA.

Expand full comment

John

Jul 23, 2024

Depending on the actual root cause, CrowdStrike may not be too keen to share this with the wider world or even a limited subset of the tech community. It could either demonstrate gross incompetence on their part (bad for business), or reveal too many details of their deployment mechanism that could benefit bad actors planning future attacks.

Expand full comment

John

That was a great summary written at a technical level that is understandable to a wide audience, but goes into enough detail to describe the underlying issues - this balance of technical writing is often quite a challenge!

Expand full comment

Thanks a lot John.

Expand full comment

Your article goes under the hood of the problem in a way that satisfies me greatly. You need to put a PayPal Donate button on articles like this one. Try doing this soon and I will come back in a few days to try and make a donation. Given the large number of machines affected and their diverse locations, comments I see in YouTube tell me that there is a very large number of people for whom one of the following is true: they do not know how to get into safe mode and no help is nearby, bitlocker has been deployed and the unlock key cannot be found, the machine simply refuses to reboot, no matter what you try. The total cost to these people plus the cost to those who managed to do the fix after a great deal of effort will run into multiple millions of dollars of business lost for one reason or another. Many people who paid $184 for the program must be wondering what exactly they bought. When they read your explanation of how certain unpredictable low level interactions might be involved, interactions that surely should be on the minds of the software engineers, they should be very angry. So the whole deployment process represented human systems engineering at a pre-freshman level. Brilliance in software engineering is no substitute for adequate sophistication in human systems engineering, so what we’re looking at here is astonishing corporate incompetence.

Expand full comment

Thanks Theodore for the kind words. I will add a Paypal donate button on my articles soon.

I agree that it will be very painful for non-technical people especially the ones who don't have immediate help. Looking at the blast radius, we realize how careful companies must be while deploying software that is installed on user's machine. Additionally, this is super-critical for Falcon-like software that runs in Kernel mode. At least software that runs in user mode wouldn't impact the functioning of the system.

Perhaps, I anticipate emergence of new practices following the whole fiasco. Companies will come up with innovative ways to tackle the same in the future.

Expand full comment

Jul 27, 2024

Thanks! I am here looking for your Paypal Donate button. Assuming that Substack software does not facilitate this much author flexibility, I will shortly check the Upgrade Plans (my problem here is that I do not wish to cope with automatic monthly subscription withdrawals from my accounts due to travelling etc., so I will try to see if some kind of one-time payment will work.)

Expand full comment

Jul 27, 2024

At the “Subscription Plan” Page I see $7 per month under “Monthly” and $125 per month under “Annual”. A message at the bottom says there is a “currency mismatch”. Perhaps you could check with your payment Host to see what is the problem.

I suppose one way to do a one time payment is to select Annual, pay the money and then cancel immediately; but a Substack Guide should make it clear that this route is available and is working.

Assuming that their software does not allow a Donate button, please discuss with your payment Host exactly what they are doing to permit people to make one-time payments if they do not accept automatic monthly withdraws from their accounts. Thanks in advance.

Expand full comment

Jul 27, 2024

Thanks for reminding, Theodore. Unfortunately, Substack only supports Stripe and due to Indian regulations, recently it has stopped sending payments due to change in the regulations.

However, Paypal would work and later I would receive the money. Here is the link - https://www.paypal.com/ncp/payment/EUGP54EHQFDZC. I will add the same link in the article too.

Expand full comment

Genetic Species

I really don't think that Rust is a suitable replacement for C, and it's advocates will take the industry into a whole new era of chaotic software design.

Memory safety via Rust's opinionated restrictive design is not right, and is not an elegant solution. This is why many programmers need to side-step the whole Borrow Checker mechanism, and end up in manual memory management territory anyway.

There are better solutions than Rust.

Also, while I am here.... writing memory safe code in C is not difficult.

Expand full comment

Agree that writing memory safe code in C isn't difficult but that's subjective and each developer might perceive it differently.

I am not sure why programmers have to side-step from the borrow checker mechanism and end up with manual memory management. In case they are doing it, then either they aren't using the semantics correctly or they completely don't comprehend it. Are you aware of what limitations of Rust is leading to this practice ?

From what I heard, Microsoft is rewriting the Windows Kernel in Rust. We might have to wait and observe how the implementation goes on and whether that's the right decision. They might end up either enhancing the language by overcoming the limitations or ditch their efforts and continue to maintain and make contributions in C/C++.

Expand full comment

Genetic Species

Jul 25, 2024Edited

"In case they are doing it, then either they aren't using the semantics correctly or they completely don't comprehend it."

NONSENSE. The whole concept of data ownership in a program is just weird.

For example: if someone wanted to implement a double-linked list (or some other kind of cyclic data ownership structure), a pair of linked items would 'hold on' to each other, and the Borrow Checker will prevent those from being removed from memory, or from being able to insert a new item between those. This is a clear (and common) example of where the draconian language design fails.

The solution is to either use an esoteric ugly patch on the language's design, or to forgo the whole Borrow Checker and write "unsafe" code. Rolling one's own memory allocation and management... Perhaps an arena approach or even just an array, allowing for sane management of the data in question.

Having a language dictate data design so specifically will come back to bite in the future. But this is the way of the software industry. It is embarrassingly unaware of how much damage it does to itself. Inelegance has become ingrained.

Side note: Just because some people at Microsoft decide to do something, that doesn't make it or authoritative.

Expand full comment

Rick

Jul 20, 2024

What confuses me are the procedures that failed to prevent this. Why did CrowdStrike push an untested update to all agents at 4 AM UTC? That's not just irresponsible; it's bizarre. It would suggest that some CrowdStrike developer was working in the middle of the night (if their Austin HQ was where it was developed and deployed) and had the privileges to completely bypass the code review and testing processes. This isn't a subtle impact from an uncommon configuration; this BSOD'd essentially every Windows machine that tried to run it within hours or minutes, and any testing should have caught that.

That makes me suspect some kind of threat actor was involved. A disgruntled CrowdStrike employee might have wanted to tank their reputation, but I doubt that. A more likely option to me is a cyberattack that created or altered agent software and then deployed it as an update. CrowdStrike says it's not a cyber incident, but they may just mean that no customer data was accessed or exfiltrated. It's very early to conclude that no files were tampered with, no logs were altered, no unauthorized users are in their systems, or that the update was deployed using legitimate procedures. State actors have been known to pull these kinds of major service denial operations, such as the Sandworm attacks that brought down Ukraine's power systems in 2015 and 2016.

That's just a theory. Do you have any information about why the agent update was deployed like this? Does CrowdStrike typically push updates at 11 PM CST on a Thursday? Is the update file that caused this significantly different than the software in previous updates?

Expand full comment

Reply (2)

Jul 20, 2024

Is the news true that the change was pushed without testing ? Since, CrowdStrike is a renowned cybersecurity firm and similar to other software firms, it would have established testing and deployment practices.

Unless the change is urgent or critical, companies don't deploy the changes. There could be a possibility of cyberattack as well. Although, it's not officially declared but we might come to know about it a couple of years later.

I believe there would be some rumours & speculations on the internet for a while until few weeks. Post-that, things would come back to normal.

Expand full comment

Rick

Jul 20, 2024

I mean, how could the update have possibly been tested? Install it on an up to date Windows VM, wait 10 minutes, and you've got a BSOD. The update file was pulled an hour and 21 minutes after the original update was pushed, so it didn't take long for CrowdStrike to notice that it was causing major negative impact, but it still disabled a ton of systems in that short timeframe. It's possible that there was a mistake and the developers deployed the update when they meant to send it to their test environment (which would mean it never got tested), but that doesn't explain the timing. Why would they be working on a patch and attempting to send it to testing at 11 PM on a Thursday? There could have been a non-US based team doing the work (India would work at 9:30 AM), but an international development organization would be sophisticated enough to build in safeguards on deploying untested code before trying to develop with follow the sun international teams.

The point of CrowdStrike being a renowned cybersecurity firm only reinforces that this is unlikely to have happened by accident. I've worked in 20 person development teams for internal applications, and even those teams had separation of duties and code checks that would prevent deploying code without successful testing. It would almost be difficult to not catch this issue, as the normal course of checking to see if the code runs properly would cause a crash unless they develop in an entirely non-Windows environment. It's just so unlikely to me that they intentionally deployed untested, dramatically unstable software to 8.5 million endpoints that I would spend months digging through logs to find the indicators of compromise before I'd accept that it was something that slipped through the cracks. Of course, I could be wrong and it's possible some dev with elevated privileges admitted to royally screwing up, but barring that I'd be looking for the malicious actor's footprints in the development environment.

Expand full comment

Agree, that few things don't tie up. Such things happen only if the change is directly deployed to prod systems.

Expand full comment