These priorities may not be fixed for all races. For instance, if our team really needs a win, we might prioritize winning over finishing. Doing so would mean we might push the car harder and risk it breaking in exchange for a better chance to win if it survives. That might also be true in our security programs. Sometimes we might have to accept more risk of an incident because it’s more important for the business to move fast to beat a competitor. The driver might even be willing to accept higher risk of getting hurt if it means they have a better chance to win. Some businesses operate this way, sacrificing security if it means they can grow faster.
Building a winning safety program
Let’s focus our attention on the security team. We know our mission and priorities, now we need to build our program to get it done. A modern Formula 1 car has more than 5,000 parts. Any one of them could be the cause of a devastating crash, but some are obviously much more likely than others. For instance, a fundamental malfunction in the braking system is more likely to lead to slamming into a wall at 200mph than driving with a slightly damaged rear wing.
So as a security team, our first project is likely to get a list of every part and do an assessment of:
- Ways it can fail
- The impact of it failing
- The likelihood that it will fail
Armed with this information, we’d work through each possible failure and determine what we could do to prevent it. That might include testing a component extensively to make sure the vendor who made it built it right. We’ll also want to do some inspection to make sure the component is installed correctly. If it is something the driver is going to interact with directly, such as a safety harness, we need to train them on how to use it.
With preventative controls in place, it’s now time to shift our focus to detecting issues during the race. If a safety harness comes loose during the race, the driver could go flying. If a tire overheats and fails, the car could go out of control and hit a wall. It’s our job to monitor the car for safety and work with the business team to fix any condition that arises.
To do this, we’re going to need sensors. A Formula 1 car has hundreds of sensors to capture a wide range of performance and safety information. One approach to designing these sensors is for them to sense specific conditions and send alerts when those conditions occur. We’ll call this “on-device” alerting. The sensor does nothing until an alert condition occurs, then it sends a message saying what happened.
While this works for simple conditions like a seatbelt coming loose, there are some inherent flaws:
- We have to pre-program all the alert conditions into the device ahead of time. It may or may not be possible to push updates to our alert conditions during the race. The alerts are rigid, which is problematic in a dynamic race environment.
- The sensor has to be smart enough to know when to send an alert and when not to. This complexity makes the sensor more prone to failure.
- By sending data only when an alert has occurred, we can’t see what led up to it. For example, did the tire temperature spike or gradually rise to the alert threshold?
- We likely can’t add alerts during the race if we need to. Say bad weather changes the threshold where things can go wrong, but we’re stuck with a pre-programmed alert.
So what is the alternative? We use sensors that are “dumb” data collectors that reliably and continuously send information back to a central data collector. We refer to this data as telemetry, and using it to detect issues comes with some huge benefits:
- It doesn’t require updating the end device to adjust alerting conditions.
- We get context of how the situation came to be for better alerting.
- Our devices and sensors can be far less complex.
- We have the ability to do forensics in the event of an issue we didn’t prevent.
To illustrate the difference in approach through an example, let’s look at our tire blowout example. Using telemetry collection, we will receive a continuous data stream of the state of the tire: the temperature, the wear level, the air pressure. Some measurements on F1 cars are taken thousands of times a second. This results in a crazy amount of data: 35 megabytes of raw telemetry per 2-minute lap. Why bother with that? If we know the conditions that lead to a blowout, why not just program our sensors to tell us when those conditions occur? If the condition never occurs, we never get a single byte of data, much less the gigabytes we get through the raw telemetry.
There are many reasons, but the most important are:
- Without the telemetry, we can’t know what led to the alert condition. Did the temperature spike rapidly or was it a slow build-up? This context is critical to determining the right response.
- We can’t assume we know every condition ahead of time. Having telemetry gives us the flexibility to rapidly implement new detection criteria without making changes on the device.
- If our alerts don’t fire and a tire blows out anyway, we have no ability to figure out why and prevent it in the future. With telemetry, we can rewind the clock and see why we missed it.
Similarly, monitoring by gathering data from outside of the device (for example, using binoculars from the top of a nearby building), we’re really just guessing as to what might actually be happening in the car. This is the equivalent of using network data when monitoring a corporate environment. We can see from the outside that an endpoint might be doing something bad through the network connections it is making, but we don’t really know until we go observe what is happening on the device.
This isn’t to entirely discount the value of other information. An IDS alert about connections to evil domains may well lead you to discovering an active compromise. Every security program is a bit different in terms of tolerances, environmental conditions, etc. We believe endpoint telemetry is the most valuable asset because it tells you exactly what is happening at the point where the attackers operate.
I hope this analogy helps to illuminate why Red Canary believes so strongly that you can’t achieve great detection and response without collecting and analyzing telemetry from where the action is: endpoints. It’s certainly not the easy or conventional approach, but it’s the most effective way to keep our driver alive—er, protect our organizations from security incidents.