Skip Navigation
Get a Demo
 
 
 
 
 
 
 
 
 
Resources Blog Security operations

Shape shifting: How to wrangle unpredictable data at scale

Shape shifting: How to wrangle unpredictable data at scale

Here’s how Red Canary’s engineering team transforms petabytes of data from dozens of third-party vendors into a consistent and readable format for threat detection

Ian Woodley
Originally published . Last modified .

Ingesting and transforming data is easy. Ingesting and transforming nearly a petabyte of data per day in a reliable, predictable, and cost-effective way is quite hard.

Red Canary ingests data from dozens of third-party providers. Once ingested, this data passes through many stages of transformation before arriving in the Red Canary Portal. Here is a simplified visualization of this pipeline, detailed further in our previous post:

Red Canary's data transforming process

Each stage of this pipeline produces a new file to be processed by the next stage; this is critical for a couple reasons:

  • Observability: We can see exactly how each stage of the pipeline is producing its output, given its input.
  • Replayability: If an issue is detected at any stage of the pipeline, data can be replayed starting at that stage once the issue is remediated.

So, you may be wondering, what’s the problem?

Big data challenges

Each third-party provider exports their data in different ways. Some common categories we’ve encountered:

  • Tiny files containing a single event, produced frequently
    • Generally no more than a few kilobytes in size
  • Smaller files containing many events, produced frequently
    • Sizes may range from several kilobytes to several megabytes
  • Larger files containing many events, produced less frequently
    • Sizes may range from several megabytes to several gigabytes

Inconsistencies can occur as well; sometimes a provider that generally produces smaller files will suddenly produce a massive file.

Without predictable, reasonably-sized files, it’s difficult to allocate resources for each transformation stage appropriately. If files are too small, the time required to establish the S3 connections dwarfs the time it takes to actually transform the file. If files are too large, fewer S3 connections need to be established, but transformation time and memory usage increases significantly.

Sometimes a provider that generally produces smaller files will suddenly produce a massive file.

Both of these scenarios require more aggressive scaling strategies to manage, which can quickly exacerbate infrastructure costs and operational headaches. Generally speaking: applications have to account for the worst-case scenario (massive files), meaning more processing power and memory is allocated to every stage in the transformation pipeline, even if most of those resources are rarely used.

Red Canary addresses these challenges by shaping our data prior to any transformation.

An intro to data shaping

Let’s introduce a “Shape” step prior to transformation:

Red Canary's data transforming process with a Shaping phase added

Shaping describes:

  • Batching several smaller files into one larger file
  • Splitting one large file into several smaller files

We use newline-delimited JSON files throughout our pipeline to simplify this shaping; each line in a given file represents a single JSON event we’ve received from a provider. As such, any number of lines can be extracted from a given file to batch or split appropriately.

Red Canary utilizes a homegrown shaper application to enforce consistent file sizes across all transform stages. This shaper is composed of two components: a producer and consumer.

The producer records metadata about incoming files in its database, including the customer name and file size. The producer routinely queries this database, aggregating small file references into batch instructions once one of its criteria has been met:

  • The desired batched file size has been reached.
  • The maximum “time to batch” has been reached.
    • This limit ensures the producer does not wait too long to produce results, preventing delays in threat detection.

If any single file surpasses the configured size limit, a split instruction is produced for that file instead.

The consumer simply accepts instructions from the producer. For batch instructions, the consumer downloads the referenced files, aggregates them, and uploads that aggregated file for transformation. For split instructions, the consumer downloads the referenced file, splits it into several smaller files, and uploads each split file for transformation.Chart of the data shaper architecture in the Red Canary Portal

This effectively addresses files containing several events, but it’s not such a great solution for managing individual events from data streams. Uploading individual events to S3 would exacerbate storage costs, increase the risk of rate limits, and slow down overall processing due to the overhead from establishing an S3 connection per event.

Enter stage left: Amazon Data Firehose

Turning on the Firehose

Amazon Data Firehose was designed for this exact use case: reliably aggregating individual events from data streams at scale.

For integrations utilizing such data streams, we route their data to Firehose in place of the shaper:

How Red Canary shapes data using Amazon Data Firehose

We use dynamic partitioning to aggregate and isolate incoming data per stream.

Read about how our Google Cloud Platform (GCP) integration uses Firehose.

The results: Cutting the file load by millions

Let’s compare a couple days worth of data. Within the same period, the producer observed more than 4.1 million files during its busiest hour, and the consumer emitted close to 185,000 files, a reduction of more than 3.9 million files (about 95 percent).

Graph showing the number of files observed by the producer component of Red Canary's shaping architecture

Files observed by the producer component


Graph showing the number of files emitted by the consumer component of Red Canary's shaping architecture

Files emitted by the consumer component

 

That’s nearly 4 million files that don’t need to be downloaded and processed by downstream transformation stages. Not only does this minimize S3 costs, it increases system throughput to facilitate fast, effective threat detection.

As you can see, shaping is a crucial piece of managing Red Canary’s large-scale data ingestion. The shaping strategies we’ve implemented effectively handle unpredictable incoming data, ensuring a more efficient and cost-effective pipeline.

 

Red Canary CFP tracker: July 2025

 

Here’s what you missed on Office Hours: June 2025

 

A large learning model: Red Canary’s AI journey

 

Red Canary CFP tracker: June 2025

Subscribe to our blog

Security gaps? We got you.

Get curated insights on managed detection and response (MDR) services, threat intelligence, and security operations—delivered straight to your inbox every month.


 
 
Back to Top