Grading on a curve: How to assess a pentest

Some people say that defenders need to be right every time but attackers only need to be right once. Those people are wrong. The reality is that breaches or intrusions are the result of threats or campaigns consisting of distinct sequences of actions. Catching any one of those actions might be enough to hinder an adversary, evict a threat, and avoid a breach or incident altogether.

Ideally, you want to have a depth of coverage that is capable of detecting adversary tactics and techniques as early and as redundantly as possible. Importantly though, you don’t have to detect every single adversary behavior to effectively detect a threat. You can detect and isolate a threat quickly based on a subset of adversary behaviors, and then investigate the incident in more depth during your response phase once the risk is entirely gone.

The same is true of red teaming, penetration testing, and adversary emulation. However, one common misconception we encounter pretty regularly is that any kind of testing is viewed as an exhaustive report card for your internal detection team, managed detection and response (MDR) vendor, or other security service providers. These sorts of tests sometimes bear the unrealistic (and unnecessary) expectation that every single atomic action generates an alert. The same problem plagues industry product evaluations like ATT&CK Evals (no offense to our friends at MITRE—this is just the evaluation we’re most familiar with). We’ve written about this before.

Most tests are unrealistic because they don’t look or behave like real threats—and your detection and response program should be optimized for real threats.

A brief detection manifesto: Detect to disrupt

It’s unnecessary to detect every single atomic action an adversary (or pentester) takes. We don’t attempt to do so and neither should you. Instead, we focus on detecting the critical behaviors across an intrusion or campaign that allow us to confidently determine whether activity is malicious, disrupt the threat, and ultimately mitigate the risk posed by it.

Real threats vs. emulated ones

Threats aren’t singular actions. They are multi-stage campaigns designed to achieve specific goals like data exfiltration, data theft, or financial gain. Real adversaries execute a series of tactics, techniques, and procedures (TTPs) to achieve an objective. These TTPs, categorized and standardized by frameworks like MITRE ATT&CK®, are the universal language of adversary behavior.

Whatever an adversary’s goal, there’s a required set of tactics they must accomplish to get there, and if you can disrupt the adversary sufficiently early in that sequence, the adversary fails and you succeed.

Consider an analogy: A bank robber’s goal is to steal money from the bank. They don’t just cut a wire and call it a day. They case the joint, disable security cameras, tell everyone to stay calm and be cool, crack open the vault, put the money in their duffel bags, and then run for their lives. Each of these steps represents an opportunity for the police to intervene and prevent the robbery. Likewise, a detection and response capability is engineered to identify the similarly critical steps of an intrusion and intervene as early as possible.

By contrast, penetration tests, red team exercises, and the various forms and flavors of adversary emulation often focus on demonstrating access, exploiting specific vulnerabilities, exercising certain techniques or procedures, or stress testing individual security controls. They frequently consist of isolated actions or small portions of a larger attack. They are often time-boxed, scope-constrained, and designed to find specific vulnerabilities or validate control effectiveness. While they simulate adversarial tactics, they rarely replicate the full adaptive, persistent, and evasive nature of a real adversary with an enduring objective.

The key difference here is intent and scope. A pentester might gain initial access via a novel exploit and then immediately stop or perform only very specific, non-TTP-generating actions to achieve their test objective. Their goal is often to prove that access can be achieved, not to achieve a full, multi-stage objective like data exfiltration or sustained network presence. A real attacker, on the other hand, would continue toward their objective, carrying out additional and clearly malicious behaviors, and we would almost certainly detect and disrupt the intrusion because Red Canary is engineered to counter the full, behavioral attack chain and disrupt real threats.

Breaking the chain, not every link

Our primary objective is to disrupt the adversary’s attack chain at the earliest feasible point. We do not aim to detect every single TTP within a chain to neutralize the threat. We identify the most indicative and actionable TTPs to break the attacker’s progression.

Let’s say an attack chain has five distinct sequential tactics:

Reconnaissance
Initial access
Privilege escalation
Lateral movement
Impact

If we detect and respond to any one of these first four tactics, we can effectively hinder the adversary, break the chain, and stop the threat before any material damage is done. Earlier is better because we always want to limit the amount of time that an adversary has access to a system. However, you don’t need to detect every distinct reconnaissance technique in order to successfully defend against this threat.

Catching a mid-chain tactic that is a strong indicator of malicious intent is often more efficient and effective than developing holistic detection coverage for techniques that may generate a prohibitive amount of false positives or can be easily modified by adversaries.

Having the ability to retroactively identify reconnaissance is great and security teams should want to be able to do that, but spinning your wheels to reliably detect that first tactic every time doesn’t necessarily lead to better outcomes. In fact, it might lead to worse outcomes if, for example, your analysts are getting lit up with alerts every time someone runs a network scan.

Beyond atomic events: Understanding malicious intent

Detection isn’t merely about flagging every single process execution, suspicious login, or cloud API call. That approach would generate an overwhelming volume of alerts, burying true threats in a mountain of false positives. Instead, the goal is to discern patterns of malicious intent from legitimate activity. We focus on the context and purpose behind actions. For example, file writes, logins, and API calls are atomic events. It’s hard to know if they are malicious or suspicious in isolation. The broader context that surrounds these atomic events is where you can start to discern malicious intent. For example:

If a file on an endpoint gets written to disk and then an unusual process executes that file and makes a network connection to an unusual external IP address, now you have a compelling pattern of malicious intent that’s certainly worth further scrutiny.
An identity alert for a login from a suspicious IP range might be ambiguous on its own, but certainly warrants further attention if the login is also happening from an unusual device or browser at an unexpected time.
Similarly, a user making a get-caller-identity API call in AWS is also somewhat ambiguous but is extremely suspicious if it follows identity activity like that described in the previous bullet.

Not all TTPs are created equal

We prioritize developing detectors for TTPs that have the highest fidelity (i.e., least prone to false positives) and most directly indicate malicious intent and progression.

For example, TTPs like LSASS credential dumping, the creation of suspicious persistence mechanisms, or the use of legitimate tools for illegitimate purposes are strong indicators of malicious progression and provide reliable points of intervention. Focusing on these high-fidelity signals ensures that the alerts you receive are truly actionable and representative of genuine threats.

It’s okay to not detect every isolated TTP. A test might not warrant detection if it’s emulating a behavior that’s too atomic or otherwise infeasible to detect. It’s not that it’s undetectable, it’s just that real attackers invariably gravitate toward chokepoints that are higher fidelity and more indicative of intent. These chokepoints are often highly prevalent techniques that defenders can prioritize to develop reliable detection coverage without generating overwhelming volumes of noise. Your security team didn’t fail the test if they failed to detect an isolated action.

Catching a threat (or test) sufficiently early, before any potential risk is realized, leads to the same outcome as detecting every component of the test.

Ultimately, your detection and response capability has to strike a delicate balance where you’re able to reliably detect malicious and suspicious behavior without generating an overwhelming number of false positives.

Conclusion

Every security team should routinely run different kinds of tests to validate that their security controls are working as intended. However, there’s a lot of genuinely good reasons you might not detect a test or a component of a test. By all means, review the output of your tests, strive to expand detection coverage in ways that scale well, but don’t beat yourself when you detect a threat but miss an isolated technique or procedure. And certainly don’t open the alert floodgate and drown your analysts for the sake of perfect detection across every atomic indicator or behavior.

We’re rapidly headed in a direction where organizations will be able to let loose with Mythos-like pentest-o-bots, and you’ll never make any progress if you get wrapped around the axle prioritizing the detection of everything over the detection of what actually matters.

How AI can streamline your security testing

Testing and validation

Resources • Blog Testing and validation

Grading on a curve: How to assess a pentest

Grading on a curve: How to assess a pentest

Brian Donohue• Mak Foss•

A brief detection manifesto: Detect to disrupt

Real threats vs. emulated ones

Breaking the chain, not every link

Beyond atomic events: Understanding malicious intent

Not all TTPs are created equal

Conclusion

Related Articles

How AI can streamline your security testing

Polishing Ruby on Rails with RSpec metadata

Explore the new Atomic Red Team website

Emu-lation: Validating detections for SocGholish with Atomic Red Team

Subscribe to our blog

See Red Canary in action

Watch the 10-minute demo now.

Security gaps? We got you.