I recently sat down with a newer member of the Red Canary team to review the way we talk about our managed detection and response (MDR) product. For people new to information security (as I was seven years ago), it can be completely overwhelming to try to understand the fundamentals and nuances associated with our seemingly simple job: keep bad guys out so organizations can make their greatest impact.
One of the most common questions we hear from people learning about our product is why we believe our endpoint-centric approach to detection and response is superior to other approaches. It’s certainly not the easiest way. Here’s a glimpse of what goes into it:
- We had to build a detection engine from scratch capable of handling massive data volumes (petabytes per day, as of this writing) without costing us so much that we can’t pay our bills.
- We continually invest huge amounts of research and development (R&D) money into developing our custom analyst workbench so our detection engineers can efficiently investigate threats without burning out on false positives.
- We’ve built internal testing and open source tools like Atomic Red Team to make sure our detection is resilient.
So why bother with all this when the rest of the MDR market takes an approach of “send us your alerts and we’ll look at them”? Why does processing this Mt. Everest of telemetry data every day lead to better outcomes than others can deliver?
At the risk of alienating most of the subscribers to this blog, I’m going to attempt a sports analogy to help answer some of these questions. Even better, it’s a sport I know relatively little about (unlike my previous sports analogy about golf). We’re going to talk about car racing, specifically Formula 1 racing. So buckle up (get it?) and let’s get started.
Off to the races
If you’ve never had a chance to go to a Formula 1 race, I highly suggest you do it. A few years ago, I had the opportunity to go down to Tampa to watch George Kurtz of CrowdStrike race sports cars and then stuck around for the Indycar race afterwards. The speed at which these cars move is mindblowing, and the margin for error is microscopic. One loose wheel nut, one time braking too late, one turn missed by one inch, and they’re dead. Not like “out of the race and sad” dead—like, dead dead.
So for the sake of our somewhat hyperbolic analogy, let’s equate the driver dying in a crash with an organization getting hit with a breach—a business-crippling, stock-price destroying, company-ending data breach. To add a little depth, let’s equate a car part breaking but not leading to a crash with a more typical security incident that an organization can recover from and keep going.
Anyone who has worked in security for a business knows there is a tension between safety and results. For simplicity, let’s break this into two teams: the security team is responsible for defending the organization and its data, and the business team is responsible for driving profit and measurable results. Using our racing analogy, the security team is trying to protect the driver, while the business team wants to succeed in the race. They both have to work closely together to balance risk. If they are too safe, they’ll never win races. If they ignore safety, they might kill their driver.
Their priority list might look something like this:
|Goal||Who's responsible||Business equivalent|
Keep the driver alive
|Who's responsible :|
Prevent catastrophic data breach
Prevent injury to the driver
|Who's responsible :|
Prevent security incident
Finish the race
|Who's responsible :|
Stay in business
Win the race
|Who's responsible :|
Grow and flourish
These priorities may not be fixed for all races. For instance, if our team really needs a win, we might prioritize winning over finishing. Doing so would mean we might push the car harder and risk it breaking in exchange for a better chance to win if it survives. That might also be true in our security programs. Sometimes we might have to accept more risk of an incident because it’s more important for the business to move fast to beat a competitor. The driver might even be willing to accept higher risk of getting hurt if it means they have a better chance to win. Some businesses operate this way, sacrificing security if it means they can grow faster.
Building a winning safety program
Let’s focus our attention on the security team. We know our mission and priorities, now we need to build our program to get it done. A modern Formula 1 car has more than 5,000 parts. Any one of them could be the cause of a devastating crash, but some are obviously much more likely than others. For instance, a fundamental malfunction in the braking system is more likely to lead to slamming into a wall at 200mph than driving with a slightly damaged rear wing.
So as a security team, our first project is likely to get a list of every part and do an assessment of:
- Ways it can fail
- The impact of it failing
- The likelihood that it will fail
Armed with this information, we’d work through each possible failure and determine what we could do to prevent it. That might include testing a component extensively to make sure the vendor who made it built it right. We’ll also want to do some inspection to make sure the component is installed correctly. If it is something the driver is going to interact with directly, such as a safety harness, we need to train them on how to use it.
With preventative controls in place, it’s now time to shift our focus to detecting issues during the race. If a safety harness comes loose during the race, the driver could go flying. If a tire overheats and fails, the car could go out of control and hit a wall. It’s our job to monitor the car for safety and work with the business team to fix any condition that arises.
To do this, we’re going to need sensors. A Formula 1 car has hundreds of sensors to capture a wide range of performance and safety information. One approach to designing these sensors is for them to sense specific conditions and send alerts when those conditions occur. We’ll call this “on-device” alerting. The sensor does nothing until an alert condition occurs, then it sends a message saying what happened.
While this works for simple conditions like a seatbelt coming loose, there are some inherent flaws:
- We have to pre-program all the alert conditions into the device ahead of time. It may or may not be possible to push updates to our alert conditions during the race. The alerts are rigid, which is problematic in a dynamic race environment.
- The sensor has to be smart enough to know when to send an alert and when not to. This complexity makes the sensor more prone to failure.
- By sending data only when an alert has occurred, we can’t see what led up to it. For example, did the tire temperature spike or gradually rise to the alert threshold?
- We likely can’t add alerts during the race if we need to. Say bad weather changes the threshold where things can go wrong, but we’re stuck with a pre-programmed alert.
So what is the alternative? We use sensors that are “dumb” data collectors that reliably and continuously send information back to a central data collector. We refer to this data as telemetry, and using it to detect issues comes with some huge benefits:
- It doesn’t require updating the end device to adjust alerting conditions.
- We get context of how the situation came to be for better alerting.
- Our devices and sensors can be far less complex.
- We have the ability to do forensics in the event of an issue we didn’t prevent.
To illustrate the difference in approach through an example, let’s look at our tire blowout example. Using telemetry collection, we will receive a continuous data stream of the state of the tire: the temperature, the wear level, the air pressure. Some measurements on F1 cars are taken thousands of times a second. This results in a crazy amount of data: 35 megabytes of raw telemetry per 2-minute lap. Why bother with that? If we know the conditions that lead to a blowout, why not just program our sensors to tell us when those conditions occur? If the condition never occurs, we never get a single byte of data, much less the gigabytes we get through the raw telemetry.
There are many reasons, but the most important are:
- Without the telemetry, we can’t know what led to the alert condition. Did the temperature spike rapidly or was it a slow build-up? This context is critical to determining the right response.
- We can’t assume we know every condition ahead of time. Having telemetry gives us the flexibility to rapidly implement new detection criteria without making changes on the device.
- If our alerts don’t fire and a tire blows out anyway, we have no ability to figure out why and prevent it in the future. With telemetry, we can rewind the clock and see why we missed it.
Similarly, monitoring by gathering data from outside of the device (for example, using binoculars from the top of a nearby building), we’re really just guessing as to what might actually be happening in the car. This is the equivalent of using network data when monitoring a corporate environment. We can see from the outside that an endpoint might be doing something bad through the network connections it is making, but we don’t really know until we go observe what is happening on the device.
This isn’t to entirely discount the value of other information. An IDS alert about connections to evil domains may well lead you to discovering an active compromise. Every security program is a bit different in terms of tolerances, environmental conditions, etc. We believe endpoint telemetry is the most valuable asset because it tells you exactly what is happening at the point where the attackers operate.
I hope this analogy helps to illuminate why Red Canary believes so strongly that you can’t achieve great detection and response without collecting and analyzing telemetry from where the action is: endpoints. It’s certainly not the easy or conventional approach, but it’s the most effective way to keep our driver alive—er, protect our organizations from security incidents.