Steel Aegis | IT Managed Services done right

While technology outages are catastrophic for many businesses, they can still have silver linings. In the case of the recent CrowdStrike outage, the silver lining comes in the form of key learnings businesses can gather and implement to strengthen or reinforce their existing disaster recovery strategies. In some cases, the takeaway may even be the detriment of not having these strategies in place.

This article will touch on top lessons learned that may be helpful in assessing the state of your current disaster recovery strategy – and planning for the next unlikely incident.

The Need For a Sound Disaster Recovery Strategy

Disaster recovery strategies are comprehensive plans designed to ensure an organization's IT infrastructure and operations can quickly recover and continue functioning after a catastrophic event – like the recent CrowdStrike outage.

Businesses without such a plan in place likely experienced significant challenges and disruptions including extended downtime due to lack of procedures in place to quickly restore their systems and services, data loss without regular and reliable backups, and a resulting damage to customer trust and satisfaction.

By implementing a robust disaster recovery strategy, organizations can ensure they are well-prepared to handle disruptions, maintain operational resilience, and protect critical assets and data. The primary goal is to minimize downtime and data loss, ensuring business continuity and maintaining trust.

The Importance of Site Reliability Engineering

Site Reliability Engineering (SRE) integrates principles of software engineering and applies them to infrastructure and operations to improve observability. SRE facilitates a strategy to get ahead of disaster recovery. As your observability and automation increase, your reliability increases. Through comprehensive monitoring and tracking systems, insights into application performance and behavior are gathered then harnessed in an automated way to ensure quick recovery from what could otherwise be a disastrous situation. By building and maintaining reliable, scalable, and efficient systems to handle routine tasks, incident response, and system recoveries, human error is reduced and faster, more consistent resolutions are achievable.

Suppose you're administrating the quintessential perfect storm of an infrastructure environment. We're talking about a thousand, or thousands of AWS EC2 servers that are not capable of being part of a scaling group due to various constraints. In this hypothetical environment, you may choose to restore every EC2 from its snapshot. Determining the associated snapshot, creating a volume, replacing the attached volume on your EC2, and bringing back up the box would be quite the task.

So, how can a tremendous task like this be eased? Automation...

Any repetitive task should ultimately be addressed with automation, or tooling that's as close to automated as possible. In the given example, writing a CLI application that interacts with the AWS CLI would be a tremendous help. We're talking about reducing man hours to a fraction of the time.

This application could be achieved using any number of programming languages: Python, Go, Java, Rust, etc. Some of these languages better lend themselves to the scenario. Golang, as an example, is closer to Python in the speed of deployment and provides substantial performance improvement over Python.

In another article, I’ll go into more detail about how such a tool could be designed and utilized.

The Danger of Providing One, Company Environment-Wide Admin Access

Over the past few decades, various companies have established some form of hook into the environments of their customers. CrowdStrike's tooling consists of an admin-level control of systems that have the CrowdStrike agent installed on them. A bit scary to think of one company having that much power.

As we saw in the recent CrowdStrike-induced outage, bad code being pushed took down nearly every single Windows server running CrowdStrike. If this had been a scenario where CrowdStrike had been hacked, and the bad actor obtained access to all customers' environments with the agent on them, things could have been much worse.

While this is certainly a risk vs. reward consideration, there may be ways to alleviate some of the risk while maintaining the reward. As an example, there are other products that you can self-host. A self-hosted solution can be equally cost-effective while providing you with more control over when and how things are updated or changed.

At the end of the day, if you’re not implementing the above strategies, you’ll be left back-footed when the next major outage, cyber breach or other unpredictable disaster comes along.

Thomas Ryan

Connoisseur of Tech

Top Takeaways for Tech Teams in the Wake of CrowdStrike’s Failure