Home Forums Chat Forum IT End of World – STW going strong

Viewing 27 posts - 121 through 147 (of 147 total)
  • IT End of World – STW going strong
  • oldnpastit
    Full Member

    Genius move to push an potentially bricking update to every single client machine in one go!

    My employer has many millions of embedded (not Windows) devices with updates of one sort or another going out pretty regularly. All of those updates go through some kind of “Canary” phase – deploy to internal alpha/beta, then to a small population, and then rollout to the entire population while monitoring various metrics. It’s not rocket surgery.

    Anything that ends up affecting code like bootloaders – where bricking a device is a real possibility – gets huge amounts of care taken over it – everyone’s nightmare is waking up to a slack message from someone you’ve never met before asking you to join an urgent 2am call.

    On the one hand, I do feel a lot of sympathy for whoever it was made whatever change it was that did this, and I’m sure it won’t be much fun being that person, or writing the RCA.

    On the other hand, they’ve got a huge market cap, and insane valuation so they must have huge amounts of cash sloshing around so surely they could afford to do a better job than they did, and foresee this kind of thing and defend against it?

    As a wise old engineer once said to me when I was a young whippersnapper, “If it hasn’t been tested, it doesn’t work”.

    mattyfez
    Full Member

    As a wise old engineer once said to me when I was a young whippersnapper, “If it hasn’t been tested, it doesn’t work”.

    As someone else upthread said though…

    …do you trust your security firm for a zero day fix, or do you run an multi-million pound business unpatched for 48hrs to allow for testing and hope you don’t get hacked, either way it’s a risk.

    zomg
    Full Member

    48 hours? You’re doing it wrong.

    edit: Ah, you’re talking about staging in the customer environment? That’s probably fair, though a smoke test could hopefully be automated and be done much quicker. Perhaps there’s now a product niche there, courtesy of engineering management at Crowdstrike who presided over a pipeline that didn’t test what they were publishing.

    oldnpastit
    Full Member

    .

    somafunk
    Full Member

    Our Tesla thinks it’s a 30mph speed limit everywhere today until the cameras pick up an actual sign. It’s normally eerily accurate.

    My mg hs trophy thinks every single road is 40mph and constantly flashes red in the display, it’s quite distracting and no fix for it according to the dealer. I fixed it myself with a bit of duct tape over the flashing icon.

    1
    Cougar
    Full Member

    …do you trust your security firm for a zero day fix, or do you run an multi-million pound business unpatched for 48hrs to allow for testing and hope you don’t get hacked, either way it’s a risk.

    This, really.

    The IoT example above is all well and good, but it’s apples and oranges.  EDR/XDR is not like “normal” software.  Falcon’s very raison d’etre is to respond to threats fast. How often does your lightbulb get an update?(*)  Falcon Sensor receives multiple updates every day.

    If the building’s on fire, do you say “well, the hosepipe is still in Alpha so we’ll get to you in a couple of weeks?”  I’m increasingly of the mind that this wasn’t a testing issue, it was a QA issue.

    Quite what the solution is, I do not know.  But as I said at the outset, I expect will be some robust exchanges of view when it’s mostly all over.  Vendors like CrowdStrike essentially mark their own homework, that surely has to change.  If this incident had been malicious rather than a big whoopsie we would be in a VERY bad place right now.

    (* – probably answer: “not often enough”)

    Jamze
    Full Member

    A config file change that blue-screens the device and puts it in a boot-loop obviously would never get through CrowdStrike’s testing, so IMO something has gone drastically wrong with the deployment process. Either what was distributed was not the intended update, file got corrupted somehow, human error etc.

    Cougar
    Full Member

    I’ve seen all of those posited and more.  I too find it hard to believe from CrowdStrike, but here we are.

    I just tripped over this blog post, which seems to be as comprehensive and accurate technical overview as any I’ve found.

    oldnpastit
    Full Member

    That medium article is interesting. Sounds like they rolled out some new and broken code without testing it.

    Very poor. And nothing to do with urgently needing to fix threats as soon as possible (not that that is an excuse anyway).

    If the building’s on fire, do you say “well, the hosepipe is still in Alpha so we’ll get to you in a couple of weeks?

    The hosepipe had not been tested on your fire. Hard to believe there even was a fire.

    Cougar
    Full Member

    Hard to believe there even was a fire.

    You don’t have a fire brigade because there’s a fire.  You have a fire brigade in case there is.

    1
    mattyfez
    Full Member

    Vendors like CrowdStrike essentially mark their own homework, that surely has to change.

    True, but …cost? given the frequency of updates of this nature, Imagine clients would have to have a permenent ‘Security test and release’ team who’s only job is to test and release security patches/AV definition files etc.. it sound like a full time job, even if its just 2 or 3 people it could easily cost £100k a year or more..

    The bean counters would not like that… I have a hard enough time trying to convince clients up slightly up-spec thier VMs at sensible cost, despite…oh look thier SQL server has bombed again as it’s out of RAM…again 😀

    “but our environment didn’t go down, so why should we pay?”

    “because we had to manually fail over environment A to environment B when we were getting critical resource alerts, AGAIN!”

    Maybe it could be automated as a half way house, if its just a simple ‘smoke test’, is the definition file in the expected/correct format, simple stuff like that, but then you’d think that would happen at crowdstrike anyway as part of the automated deployments…

    2
    zomg
    Full Member

    Crowdstrike could be publishing their homework along with their product. Testing isn’t a sideband activity. It is the product too.

    mattyfez
    Full Member

    Virgin radio calling it a ‘microsoft windows outage’ just now…

    Thats like me crashing my car into a crowded bus stop and calling it a Ford issue, FFS, lol

    2
    fooman
    Full Member

    We know what went wrong but there’s still a question over how and why it happened. It’s almost unthinkable that some level of testing didn’t take place before making the update, so why was it inadequate? I think the clue is in CrowdStikes own blog, that these channel files are updated several times a day. This is the Falcon USP that they are responding to threats as they emerge, so the normal develop / test / release cycle is highly compressed and probably highly automated..

    Instead of a phased rollout, every machine online got updated at the same time. A little over an hour after release CrowdStrike realised there is a problem and pulled the channel file but by this time 8.5 million machines have already been compromised. CrowdStike themselves seemed surprised that an issue could even occur as they state there’s not been an issue with Falcon before, so I think a combination of trying to be the fastest to respond and their own hubris created the perfect storm.

    FWIW there could have been a simple failsafe – if Falcon fails after channel update, roll back that channel update, reboot, and you are back. The fact that a simple mechanism like this wasn’t considered leads me to think they didn’t believe a channel file could take down Falcon, which may have fed into a minimal testing strategy.

    Jamze
    Full Member

     if Falcon fails after channel update, roll back that channel update, reboot, and you are back

    Once you’ve caused the memory exception and blue-screened, don’t think you can then have a script do something else.

    1
    Cougar
    Full Member

    Right there with you until the last paragraph.

    There is no “simple mechanism” to roll back because of how early in the boot process Falcon is called.  It’s not loaded by the OS, it’s loaded by the boot manager.  The boot logic is basically “check for malware, if no then start Windows Kernel, if yes then Halt.”  It’s not an oversight.  Rather, it’s not possible.

    As I Understand It.

    1
    mattyfez
    Full Member

    Yeah, windows machines quite rightly cacked themselves due to ‘unexpected item in bagging area’.

    Theres no automatic roll back for such a low level security update for endpoint/desktop pc.

    If it were a server, then any ‘org’ could just take that server offline and fail over to an unpatched mirror/backup whilst the issue was figured out…

    GlennQuagmire
    Free Member

    It’s not loaded by the OS, it’s loaded by the boot manager.

    I would suggest the OS would instantiate the Falcon drivers at a very early stage.  Falcon will undoubtedly reference a whole raft of Windows dll’s for things like low-level IO access and the like.

    But agreed, if this part fails to work then there is no easy way to “roll back” hence Windows halts – and correctly so.

    MSP
    Full Member

    I am guessing that the solution would be to have some sort of integrity check on the update files. From my understanding of the problem (which isn’t great) even a digital signature in the file would have highlighted in this case that the content wasn’t sound, a checksum would have highlighted if the file was corrupted in the distribution network.

    FuzzyWuzzy
    Full Member

    I’m sure they can and will add some better error checking into the driver code. It’s not that driver that’s being updated frequently, it’s the channel files the driver calls which contain the updated content for the detection code that runs in the kernel layer. It appears there isn’t much validation done of those channel files by the driver as it just assumes they are correctly formatted etc. as they come from Crowdstrike. That will need to change (although it’s unlikely to be able to detect every anomaly) and a rollback process might be an option (as in if an anomaly in the latest channel file is detected it reverts to using the previous update, rather than disable itself).

    I still don’t understand how it was missed by Crowdstrike in their testing, it made more sense when it was speculated the updated channel file 291 had null bytes in it (which might have been caused by corruption whilst copying it to their public staging locations post validation – although even that process should have file hash checks) but Crowdstrike has said that wasn’t the case and imply it was just the new detection logic in the channel file that triggered a logic issue in the driver when it processed it (and if a kernel mode driver crashes it will intentionally crash the OS).

    dlr
    Full Member

    Yes full of zeros from posts I saw. Was a busy Friday. ~25 Servers, ~100 desktops half of which are installed in random areas in a manufacturing plant, great fun…….one Hyper-V Host in my cluster got itself messed up and would no longer live migrate, fixed now along with a couple of remaining desktops which I CBA to deal with on Friday and weren’t important.

    Ro5ey
    Free Member

    @ahsat

    Please can you ask your bro about Sky news’ choice of content around 7am on friday.

    (see my post on page 2 ?!?)

    Ta

    branes
    Free Member

    Good explanation of the technicals by Dave who used to work at MS (*):

    https://www.youtube.com/watch?v=wAzEJxOo1ts

    TLDR the Crowdstrike driver is a kernel driver that marks itself as required to boot (‘a bootstart driver’).

    The driver is tested and certified by MS….the definition files that the driver loads, which are almost certainly code, are not. The definition file made an invalid memory access causing a SEGV. Kernel quite reasonably gives up at this point, reasonable given its architecture and CrowdStrike’s use of it anyway.

    Still of course how Crowdstrike allowed something so large scope to happen is anyone’s guess.

    (*) and by the looks of things was in early enough to make an absolute boatload!

    oldnpastit
    Full Member

    Initial root cause analysis:

    https://www.crowdstrike.com/falcon-content-update-remediation-and-guidance-hub/

    On July 19, 2024, two additional IPC Template Instances were deployed. Due to a bug in the Content Validator, one of the two Template Instances passed validation despite containing problematic content data.

    It still doesn’t answer the question of why they were not doing staged rollouts of these new named-pipe templates.

    The first template for spotting named pipe usage went out in February, and the named pipe monitoring itself is just another way to possibly spot malicious programs – it wasn’t actually handling some kind of 0-day attack – i.e. they could have done a staged rollout without impacting their ability to protect customer systems.

    It also seems like a poor design choice to put so much complex code into the kernel – is it really not possible to do the complicated stuff in userspace? I don’t know anything about Windows, but in Linux all of this could have been in userspace (auditd, apparmor, etc). Maybe there’s some reason I don’t understand.

    1
    dissonance
    Full Member

    It still doesn’t answer the question of why they were not doing staged rollouts of these new named-pipe templates.

    Its worse than that.  Whilst initially they did test their “template type” properly once it is bedded in apparently they just switch to using a “content validator” and so were just throwing these into prod without real testing.

    On the plus side they have handed out some giftvouchers to their partners for the inconvenient caused.

    On the downside at $10 it is probably one of those times they shouldnt have bothered at all.

    FuzzyWuzzy
    Full Member

    I’ve also heard that a lot of companies configure staged deployments of Crowdstrike updates to their end points (not involved with managing it myself though) but the way they pushed this update (I guess the Rapid Response option) ignores all that and pushes out to all the end points at once – which is probably why it took down services in companies like Microsoft where you’d expect them to have staged roll-outs configured. If I were MS I’d certainly be suing Crowdstrike

    Cougar
    Full Member

    Initial root cause analysis:

    The executive overview is worth a read:

    Adopt a staggered deployment strategy, starting with a canary deployment to a small subset of systems before a further staged rollout.

    – called by oldnpastit at the top of this page

    Conduct multiple independent third-party security code reviews.

    – Called by me a couple of pages back.

    Strengthen error handling mechanisms in the Falcon sensor to ensure errors from problematic content are managed gracefully.

    – Called by multiple people on here.

    Make no mistake, even fully understanding how this happened, it really shouldn’t have.  But credit where it’s due, I cannot fault CrowdStrike’s subsequent handling of it.  They’d resolved the initial issue within 90 minutes – which I’d expect from a rapid response company – and have been transparent as to what went wrong and what they’re doing about it.

Viewing 27 posts - 121 through 147 (of 147 total)

You must be logged in to reply to this topic.