The 2024 CrowdStrike Outages
-
Connor Ell - 28 Jan, 2026
Introduction
On July 19, 2024 I, like many others in the IT space, woke to see chaos online as major companies around the world reported massive outages of their IT infrastructure. From large banks to major airlines it was a truly global incident that will likely be remembered in the IT space into the far future. This incident, which affected over 8.5 million Windows machines and caused billions in economic fallout, highlighted the immense trust global companies place in IT security vendors, and the responsibility these vendors have to their customers.
I was working in IT for a large Canadian company at the time, so I experienced the ripple effects firsthand remediating damage to our own systems, answering frantic questions from colleagues and leadership, and relief when we had managed to get all our systems back online in record time.
In this post I’ll break down exactly what happened, the fallout I watched unfold in real time, and the lessons that still feel relevant more than a year later. There’s already a mountain of coverage out there, but I hope my on-the-ground view from inside a Canadian company adds something useful to the conversation.
CrowdStrike and the CrowdStrike logo are trademarks of CrowdStrike, Inc. Used for informational purposes.
First confirmation
July 19th started like any other morning for me, waking up and turning on the news. Immediately I knew I was going to be in for a rough day. The news of the worldwide outages had spread and everyone already knew crowdstrike was to blame. I knew we used crowdstrike and that I would have to get to the office immediately. Quickly checking my email I saw that my supervisor, the head of IT, had already sent out a company wide email advising everyone that we were already aware of the incident and working on remediation. For me and my colleagues today would be a trial by fire as my supervisor was still on vacation and had sent that email from his hotel room. The office was a nightmare, almost every workstation in the open area was stuck on the infamous blue screen. Worse: all of our physical Hyper-V hosts in the server room had crashed overnight. That meant every VM from domain controllers to file servers were unreachable. Remote workers couldn’t authenticate, shared drives were gone, all business operations were halted indefinitely.
Those first hours highlight the importance of communication during an IT incident. As junior team members, we normally don’t hold the keys (literally) to half the systems we needed to fix. User passwords, Hyper-V admin credentials, everything had to come straight from our boss, who was on vacation in another country. We were calling back and forth on our personal phones because that was the fastest way to get what we needed. A direct trusted channel such as a cell carrier was the difference between recovery taking minutes or hours.
Triage and initial remediation
Most cyber security professionals will know this step of incident response well. For me and my colleagues our triage was fairly simple. The first thing we had to get back on line were the handful of physical machines in the site’s server room. These machines ran windows and used hyper-v to run virtual machines such as the file server, domain controller, and wifi access point controllers. Unfortunately for us every single physical machine as well as some of the virtual machines themselves had been taken down overnight. This meant that our first step of remediation was clear, get the domain controller back online to allow our supervisor to get remote access. After that we got the file server back online so staff could remote in from home and continue to work.
I think one of the most clear lessons here is the importance of redundancy. Having one machine running multiple critical services is convenient and cost effective. However a single point of failure like this inevitably will rear its ugly head. For me and my colleagues our efficiency bringing systems back online was severely hampered, you can only have so many hands on a single keyboard. A potentially less thought about aspect of an incident like this is business continuity planning. In our modern and interconnected world having systems go down is not a matter of if, it’s a matter of when. This time my employer was lucky and normal business operations could resume very quickly. Had the outage happened a few weeks earlier during a major event, the revenue lost would have been huge.
Final remediation and aftermath
After the initial systems in the server room had been brought back online we moved onto the task of physically going to each system and fixing them. This task was made somewhat more complicated since the entire site is over 100 acres, with multiple offices and many smaller buildings built over 50+ years. For my colleagues and I it turned into a sort of treasure hunt to find each and every machine that may be hidden in the more dated buildings. Since the site was so big we made liberal use of the golf carts provided to the department. This allowed us to finish fixing all the machines in record time. We successfully brought the 100+ acre site up and running within the day while other comparable sites were not online until the following couple days. In the aftermath of the outage I don’t think my supervisor was the only one seriously reconsidering using crowdstrike in the future.
I don’t think I was alone in breathing a massive sigh of relief when I learned that the outage wasn’t a cyber attack. SolarWinds 2020 anyone? Had this been a supply chain attack akin to solar winds, it would have been the biggest one in history.
Final thoughts
The biggest takeaway for me, after living through the CrowdStrike incident in real time, was just how much power today’s IT supply chain concentrates in a handful of vendors. One benign configuration error from a trusted security company brought airlines, hospitals, stock exchanges, and our own 100-acre site to its knees in minutes. The damage was already historic, and it was an accident. Now imagine the same update channel (or any other single point so many organizations depend on) had been weaponized by a nation-state or ransomware group. The destruction from a deliberate attack could have been orders of magnitude worse.
Modern IT is interdependent and securing the supply chain isn’t some checkbox exercise anymore. A single trusted vendor can accidentally become the biggest risk in the room, and a malicious actor exploiting that same trust can take down the entire world in an instant.