Manoeuvring the CrowdStrike Outage: A Behind-the-Scenes Look at Performanta’s Response

Performanta
Aug 23, 2024
5 min read

When the CrowdStrike outage blindsided us, it quickly became apparent that we were facing a significant challenge. Our Safe Platform (part of our Safe XDR MDR) notified us of an unusual number of endpoints becoming unresponsive. At that point, there was no official fix from CrowdStrike, but our team is accustomed to thinking on its feet. We knew that every minute counted to ensure our clients' security operations remained intact.

World-First Response: The Power of the Performanta Risk Operations Centre (ROC) and Safe Platform

The morning began with a sense of urgency. Round about 6am GMT+1 on Friday, 19th of July 2024, our Safe Platform alerted us to a decrease in our endpoint coverage, and simultaneously, some of our fast-to-respond clients had already reached out to confirm the outage.

Our SOC Technical Leadership and the rest of the tech team were already on high alert. Despite the uncertainty surrounding the situation, Performanta’s Safe Platform, underpinned by Encore Attack Surface Management [ASM] along with the world-first Risk Operations Centre (ROC) played an instrumental role in quickly detecting and responding to the outage.

Prior to the outage, our Safe Platform was actively monitoring coverage in real-time. The examples below demonstrate the real-time coverage view for CrowdStrike within our Safe Platform. This enabled us to detect the outage promptly, allowing us to respond more swiftly and minimise disruption while maintaining high-level coverage for our clients. By leveraging our innovative technologies and processes, we managed to get most of our CrowdStrike clients up and running within hours of the outage. This proactive and streamlined approach minimised downtime and ensured that security operations were restored swiftly.

Proactive Communication: Keeping Our Clients in the Loop

As soon as we identified potential workarounds, our team of Service Delivery Managers (SDMs) swung into action. We understood that our clients needed to be kept informed in real-time, even if the solutions were not yet official. Each SDM began to reach out to their respective clients, sharing the unofficial workarounds we had tested out and advising them on how to mitigate the impact of the outage on their operations.

Initially, many clients were reluctant to accept that nearly every machine would need to be manually touched to restore access. However, with consistent communication and support, they gradually accepted the situation and took the necessary steps.

At the same time, we proactively evaluated the systems impacted by the outage. This enabled us to offer specific guidance to our clients, assisting them in quickly restoring essential services. Although we anticipated that some systems would require long hours to repair, we used our Safe Platform to prioritise the most critical machines for immediate attention, allowing less urgent remediation to occur later. Consequently, our clients' critical systems were operational within two hours, as their vital systems were given priority and restored first.

Challenges and Client Resilience

One client was particularly hard hit, with many servers experiencing blue screen errors across multiple locations, making the restoration process especially tedious. Despite these challenges, our clients showed remarkable resilience. Most were not bothered about workstation issues, even though some had to be fully reinstalled due to boot loops.

Throughout the crisis, our clients handled the situation far better than initially expected. They understood that this was a widespread issue and appreciated the team effort required to get everything back online. Their understanding and cooperation were instrumental in the recovery process. Thanks to our Safe Platform, which had allowed for rapid response and prioritise systems, we were able to provide rapid support.

Waiting for Official Guidance: Balancing Speed and Accuracy

During the morning of the outage (GMT+1), we kept a close watch on the situation and assisted our clients in prioritising their essential systems using our platform. We offered guidance and temporary solutions to ensure the operation of crucial systems continued without interruption. While we managed to implement temporary and tested fixes based on community-sourced information, we knew that an official solution from CrowdStrike was essential for a full resolution. It wasn’t until close to midday on Friday that we finally received official communication from CrowdStrike, detailing the methods to restore services and addressing the root cause of the outage. We immediately shared this information with our clients, ensuring they had the correct steps to fully restore their systems.

From crisis to success in just over 2 hours: Response in minutes

Time (GMT+1)
5.57am	Our SOC, Safe XDR alert on unusual unresponsive endpoint
6.00am	Safe Platform (Part of Safe XDR) Notice of significant decrease in CrowdStrike (CS) coverage across all clients (first ever)
6.03am	First client contacting us with their observations
6.05am	Rapid check that our alert rules aren’t compromised / damaged
6.07am	Verification that the issue is with CS
6.10am	Contacting CS. Informal fix given (formal took a few hours)
6.20am	Realising what the challenge is and addressing a plan of action
6.25am	Communication begins with all our clients
6.27am	Service Delivery Managers (SDM) contact all impacted clients (more than 50 of them) in 3 continents based on our communication and suggested remediation
6.30am	ROC (Risk Operations Centre, part of Safe XDR) finished analysing impacted machines and mapped mission-critical machines at each client – continued communication with clients with suggestions from our side
6.40am	Remediation to start across all clients (some faster, some slower)
8.00am – 10.15am	Most clients’ mission critical systems are back to action. No major impact to clients
Next few hours	Remedial work continues on rest of machines that were impacted with guidance from Performanta’s ROC and based on our Safe Platform for prioritisation

Reflections on Response: Lessons Learned and Next Steps

The CrowdStrike outage underlined that crisis times require agility and communication. We minimised the negative impact of the case through combination of proactive monitoring powered by our platform, acting fast, leveraging collective wisdom, and maintaining open communication with our customers.

For instance, a customer mentioned their satisfaction with our quick, contextual and direct communication, which saved them from searching for answers elsewhere and offered much needed guidance and prioritisation during a moment of crisis.

We've learned a lot from this and are dedicated to improving our response tactics even further. Our approach proved effective and thorough, especially as many businesses worldwide faced extended challenges lasting long hours or even days. In addition to our proactive stance on technical matters, and while being a Microsoft Elite Partner and a Microsoft First organisation, it was very comforting to note that our clients were satisfied about the way we handled ourselves during this ordeal. That saved CrowdStrike from a significant revenue losses and embarrassment.

Our clients’ resilience and cooperation in the face of adversity deserve commendation. By being ahead of time and watchful, leveraging our Safe Platform’s real-time monitoring and our ROC for proactive data insights and advice, we shall continue to provide unsurpassed services and support that our customers rely on, even in the face of unexpected challenges.

Security Solutions

Secure Hybrid Identity™

Data and Risk Management

Managed Services

Consulting Services

1. Govern

4. Detect & Respond

2. Identify

5. Recover

3. Protect