October 4, 2019 Security operations
Justin Schoenfeld

Expediting false positive identification with string comparison algorithms

Our cyber incident response team (CIRT) recently developed a technique for correlating substantially similar events in our analysis stream. Here’s how we did it, and why it matters:

The security industry faces as much of a technology problem as it does a people problem. For many years now, the industry has been plagued by a cybersecurity talent shortage, a problem made worse by high turnover rates in security operations centers (SOC).

While there are plenty of explanations for analyst churn and the skills gap, a large part of the problem arises from a collective inability to provide analysts with the context they need to effectively investigate the mountains of alerts they face on a daily basis. A complex—sometimes convoluted—suite of security tools providing actionable, meaningful, and outright accurate alerts is the stuff of dreams for most security teams.

Unfortunately, there probably won’t ever be a silver bullet security solution to solve all of our needs. However, we can empower our analysts with data and innovative solutions that improve analytical tradecraft and help SOCs maintain their sanity.

Suppress all the things?

We’ve detailed extensively how the Red Canary CIRT operates. In doing so, we revealed how we enable detection engineers with technologies like our suppression engine, which allows them to create one-to-one rules against false positive events. If a rule matches an event based on some criteria (explained fully here), the event is hidden from our event stream. This happens millions of times every week, freeing up countless thousands of hours of valuable analyst time.

Essentially, the whole idea of suppression is to prevent detection engineers from analyzing the same activity (i.e. event) more than once. Suppression helps overall efficiency because once a detection engineer investigates an event and identifies it as benign, suppression rules will keep the same behavior from reappearing in the analysis stream.

What if an event slips by suppression logic?

Unfortunately, our suppression engine only works on events that consist of multiple identical variables. For example, given the following two events, can you spot the difference between the event and the suppression rule?

Suppression rule:

Process MD5: 40945b8b68c48961b354dfede17ec04e

CLI: curl -L -b "oraclelicense=a" -O http://download.oracle.com/releases/2/3/10.0.2+13/19aef6/jdk-10.0.2_linux-x64_bin.rpm

New event command line:

Process MD5: 40945b8b68c48961b354dfede17ec04e

CLI: curl -L -b "oraclelicense=a" -O http://download.oracle.com/releases/2/4/10.0.2+13/19aef6/jdk-10.0.2_linux-x64_bin.rpm

Did you notice the difference between the two command lines? This event would bypass that suppression rule—even though, to the human eye, they are nearly identical. We observe hundreds of events like this each day. Given the volume of events we analyze and the number of suppression rules we create, human memory isn’t always an effective tool for recognizing small command line deviations like the one above. In many cases, our detection engineers would end up reinvestigating the above alert, potentially without realizing that a suppression rule already existed for a nearly and functionally identical event.


This would be a great place to offer our detection engineers a bit more context. Specifically, it would be tremendously useful to know that our CIRT had previously created a rule to suppress a substantially similar event. To accomplish this, we recently developed a process that automatically compares every suppression rule associated with a given detector against new events triggered by that detector, presenting our detection engineers with this information in the event block as they are analyzing events.

If a suppression rule is similar to the command line of an incoming event, we can preemptively analyze old rules and present additional context to the detection engineer. If we compare the above suppression rule and the above event’s command line, we see that these two command lines are roughly 95.5 percent similar. Presenting this information to our detection engineers gives them the confidence to know that we’ve analyzed this same activity before, and that the new event is likely a false positive.

Into the weeds

Examining string comparison algorithms

Of course, this sort of string comparison is easier said than done. There are many different string metric functions that will display the similarity of two strings. We focused on the following functions during our testing: Damerau-Levenshtein distance, Sorensen-Dice coefficient, Jaccard Similarity Index, and Jaro Winkler. Each function calculates distance in distinct ways and each could accomplish our tasks individually, but there must only be one.

The Sorensen-Dice Coefficient and Jaccard Similarity Index share an input in that they operate on tokens. The main difference is how they calculate the result of an intersection between two sets. One bonus is that they are very fast—requiring far less code to execute. The input tokens operate through sets of trigrams that are sequences of three consecutive characters taken from a string. The use of trigrams is beneficial because as the consecutive character counts grow in size, so too does the performance impact.

The Damerau-Levenshtein distance will calculate the number of operations (like substitution, transposition, deletions, and insertions) it takes to transform one string into another. The main downside to this function is that it’s much slower compared to the others and can severely impact performance at scale due the number of operations it calculates. Choosing a similarity threshold for this function is not easy since the resulting calculation is an integer and not percentage-based. As strings grow in size, so will the output value.

Jaro Winkler also calculates distance based on the distance two strings are from each other. It favors whether or not the start of one string is similar to the start of another. This characteristic makes this algorithm more favorable when comparing small strings. In our case, we could have strings upwards of 5,000 characters in length, which is not ideal for our use-case. Compared to Damerau-Levenshtein, it’s quick, but not nearly as fast as using a combo of trigrams and the Jaccard Similarity.

Ultimately, we went with trigrams and the Jaccard Similarity Index.

Now let’s talk string distance

In order to get that 95.5 percent number listed in the previous section, we needed to feed our two strings through a distance calculator and compare them. Using a combination of trigrams and the Jaccard Similarity Index, we’re basically creating trigram sets from strings (i.e., command lines).

An example trigram set for ‘the quick brown fox’ looks like:

["__t", "_th", "the", "he ", "e q", " qu", "qui", "uic", "ick", "ck ", "k b", " br", "bro", "row", "own", "wn ", "n f", " fo", "fox"]

And ‘the quick red fox’ turns into:

["__t", "_th", "the", "he ", "e q", " qu", "qui", "uic", "ick", "ck ", "k r", " re", "red", "ed ", "d f", " fo", "fox"]

Creating these deduplicated trigrams can be done by the following Ruby code:

def _gramify(string, num_chars)
gram_array = []
# front padded
padding = Array.new(num_chars - 1, '_')
character_array = padding + string.chars

character_array.each_cons(num_chars) {|v| gram_array << v.join('')}

The num_chars variable tells the function how large you want each item in the set to be. Calling the function like: s1_set = _gramfiy(“the quick brown fox”, 3) you will be output with a set of trigrams as seen above.

The Jaccard Similarity Index accepts two sets as input. We take the intersection of the two, which outputs a new set containing items found in each set. Our resulting intersection looks like this:

["__t", "_th", "the", "he ", "e q", " qu", "qui", "uic", "ick", "ck ", " fo", "fox"]

After performing our calculations we see the two strings are 50 percent similar. This number is generated by taking the size of the largest set of the two, which was our first set with a size of 19. Then we take the size of the intersection set, which was 12, and divide by 19.

This can be generated by the following Ruby code:

def jaccard_value(s1_set, s2_set)
max_count = (s1_set.length | s2_set.length).max
similarity_count = (s1_set & s2_set).size

similarity_count.to_f / max_count


Ultimately, having the ability to identify events that are substantially similar—albeit not quite similar enough to automatically suppress them outright—empowers our detection engineers to vastly expedite the process of rooting out false positive events. In turn, spending less time on false positives allows the Red Canary CIRT to spend more time investigating truly malicious events and detecting actual threats.

While we hope that these descriptions and the code and tools shared here can enable other security teams to improve analytical tradecraft and relieve the burdens associated with false positives, suppression is but one use-case for this string comparison method.

In the coming weeks and months, we hope to publish a second version of this blog exploring how this technology is helping us combat malware obfuscation techniques.


Breaking down the modern security operations center


It’s time for better cloud workload security


Cloud workload security: 7 reasons why it’s complicated


Onboarding log: My first 30 days at Red Canary

Subscribe to our blog