IT End of World – STW going strong – Page 4 – Chat Forum

franksinatra · 2024-07-19T08:20:33Z

Flights grounded, trains halted, stock exchange not trading, Sky news off air. But the (previously) flakiest forum in the world just powers on without issue. Will be soon be running the world through STW?

Cougar

Posts: 78745

Full Member

Right there with you until the last paragraph.

There is no "simple mechanism" to roll back because of how early in the boot process Falcon is called. It's not loaded by the OS, it's loaded by the boot manager. The boot logic is basically "check for malware, if no then start Windows Kernel, if yes then Halt." It's not an oversight. Rather, it's not possible.

As I Understand It.

Posted : 21/07/2024 9:33 pm

mattyfez and mattyfez reacted

mattyfez

Posts: 15555

Free Member

Yeah, windows machines quite rightly cacked themselves due to 'unexpected item in bagging area'.

Theres no automatic roll back for such a low level security update for endpoint/desktop pc.

If it were a server, then any 'org' could just take that server offline and fail over to an unpatched mirror/backup whilst the issue was figured out...

Posted : 21/07/2024 10:08 pm

Cougar and Cougar reacted

GlennQuagmire

Posts: 3285

Free Member

It’s not loaded by the OS, it’s loaded by the boot manager.

I would suggest the OS would instantiate the Falcon drivers at a very early stage. Falcon will undoubtedly reference a whole raft of Windows dll's for things like low-level IO access and the like.

But agreed, if this part fails to work then there is no easy way to "roll back" hence Windows halts - and correctly so.

Posted : 21/07/2024 10:37 pm

MSP

Posts: 15842

Free Member

I am guessing that the solution would be to have some sort of integrity check on the update files. From my understanding of the problem (which isn't great) even a digital signature in the file would have highlighted in this case that the content wasn't sound, a checksum would have highlighted if the file was corrupted in the distribution network.

Posted : 22/07/2024 7:56 am

FuzzyWuzzy

Posts: 8792

Full Member

I'm sure they can and will add some better error checking into the driver code. It's not that driver that's being updated frequently, it's the channel files the driver calls which contain the updated content for the detection code that runs in the kernel layer. It appears there isn't much validation done of those channel files by the driver as it just assumes they are correctly formatted etc. as they come from Crowdstrike. That will need to change (although it's unlikely to be able to detect every anomaly) and a rollback process might be an option (as in if an anomaly in the latest channel file is detected it reverts to using the previous update, rather than disable itself).

I still don't understand how it was missed by Crowdstrike in their testing, it made more sense when it was speculated the updated channel file 291 had null bytes in it (which might have been caused by corruption whilst copying it to their public staging locations post validation - although even that process should have file hash checks) but Crowdstrike has said that wasn't the case and imply it was just the new detection logic in the channel file that triggered a logic issue in the driver when it processed it (and if a kernel mode driver crashes it will intentionally crash the OS).

Posted : 22/07/2024 8:44 am

dlr

Posts: 701

Free Member

Yes full of zeros from posts I saw. Was a busy Friday. ~25 Servers, ~100 desktops half of which are installed in random areas in a manufacturing plant, great fun.......one Hyper-V Host in my cluster got itself messed up and would no longer live migrate, fixed now along with a couple of remaining desktops which I CBA to deal with on Friday and weren't important.

Posted : 22/07/2024 11:39 am

Ro5ey

Posts: 4155

Free Member

@ahsat

Please can you ask your bro about Sky news' choice of content around 7am on friday.

(see my post on page 2 ?!?)

Ta

Posted : 22/07/2024 11:46 am

branes

Posts: 860

Full Member

Good explanation of the technicals by Dave who used to work at MS (*):

TLDR the Crowdstrike driver is a kernel driver that marks itself as required to boot ('a bootstart driver').

The driver is tested and certified by MS....the definition files that the driver loads, which are almost certainly code, are not. The definition file made an invalid memory access causing a SEGV. Kernel quite reasonably gives up at this point, reasonable given its architecture and CrowdStrike's use of it anyway.

Still of course how Crowdstrike allowed something so large scope to happen is anyone's guess.

(*) and by the looks of things was in early enough to make an absolute boatload!

Posted : 22/07/2024 1:25 pm

oldnpastit

Posts: 7130

Full Member

Initial root cause analysis:

https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/

On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

It still doesn't answer the question of why they were not doing staged rollouts of these new named-pipe templates.

The first template for spotting named pipe usage went out in February, and the named pipe monitoring itself is just another way to possibly spot malicious programs - it wasn't actually handling some kind of 0-day attack - i.e. they could have done a staged rollout without impacting their ability to protect customer systems.

It also seems like a poor design choice to put so much complex code into the kernel - is it really not possible to do the complicated stuff in userspace? I don't know anything about Windows, but in Linux all of this could have been in userspace (auditd, apparmor, etc). Maybe there's some reason I don't understand.

Posted : 25/07/2024 9:40 am

dissonance

Posts: 8143

Full Member

It still doesn’t answer the question of why they were not doing staged rollouts of these new named-pipe templates.

Its worse than that. Whilst initially they did test their "template type" properly once it is bedded in apparently they just switch to using a "content validator" and so were just throwing these into prod without real testing.

On the plus side they have handed out some giftvouchers to their partners for the inconvenient caused.

On the downside at $10 it is probably one of those times they shouldnt have bothered at all.

Posted : 25/07/2024 10:02 am

oldnpastit and oldnpastit reacted

FuzzyWuzzy

Posts: 8792

Full Member

I've also heard that a lot of companies configure staged deployments of Crowdstrike updates to their end points (not involved with managing it myself though) but the way they pushed this update (I guess the Rapid Response option) ignores all that and pushes out to all the end points at once - which is probably why it took down services in companies like Microsoft where you'd expect them to have staged roll-outs configured. If I were MS I'd certainly be suing Crowdstrike

Posted : 25/07/2024 12:12 pm

Cougar

Posts: 78745

Full Member

Initial root cause analysis:

The executive overview is worth a read:

Adopt a staggered deployment strategy, starting with a canary deployment to a small subset of systems before a further staged rollout.

- called by oldnpastit at the top of this page

Conduct multiple independent third-party security code reviews.

- Called by me a couple of pages back.

Strengthen error handling mechanisms in the Falcon sensor to ensure errors from problematic content are managed gracefully.

- Called by multiple people on here.

Make no mistake, even fully understanding how this happened, it really shouldn't have. But credit where it's due, I cannot fault CrowdStrike's subsequent handling of it. They'd resolved the initial issue within 90 minutes - which I'd expect from a rapid response company - and have been transparent as to what went wrong and what they're doing about it.

Posted : 25/07/2024 12:36 pm

IT End of World - STW going strong