Ingesting and transforming data is easy. Ingesting and transforming nearly a petabyte of data per day in a reliable, predictable, and cost-effective way is quite hard.
Red Canary ingests data from dozens of third-party providers. Once ingested, this data passes through many stages of transformation before arriving in the Red Canary Portal. Here is a simplified visualization of this pipeline, detailed further in our previous post:
Each stage of this pipeline produces a new file to be processed by the next stage; this is critical for a couple reasons:
- Observability: We can see exactly how each stage of the pipeline is producing its output, given its input.
- Replayability: If an issue is detected at any stage of the pipeline, data can be replayed starting at that stage once the issue is remediated.
So, you may be wondering, what’s the problem?
Big data challenges
Each third-party provider exports their data in different ways. Some common categories we’ve encountered:
- Tiny files containing a single event, produced frequently
- Generally no more than a few kilobytes in size
- Smaller files containing many events, produced frequently
- Sizes may range from several kilobytes to several megabytes
- Larger files containing many events, produced less frequently
- Sizes may range from several megabytes to several gigabytes
Inconsistencies can occur as well; sometimes a provider that generally produces smaller files will suddenly produce a massive file.
Without predictable, reasonably-sized files, it’s difficult to allocate resources for each transformation stage appropriately. If files are too small, the time required to establish the S3 connections dwarfs the time it takes to actually transform the file. If files are too large, fewer S3 connections need to be established, but transformation time and memory usage increases significantly.
Sometimes a provider that generally produces smaller files will suddenly produce a massive file.
Both of these scenarios require more aggressive scaling strategies to manage, which can quickly exacerbate infrastructure costs and operational headaches. Generally speaking: applications have to account for the worst-case scenario (massive files), meaning more processing power and memory is allocated to every stage in the transformation pipeline, even if most of those resources are rarely used.
Red Canary addresses these challenges by shaping our data prior to any transformation.
An intro to data shaping
Let’s introduce a “Shape” step prior to transformation:
Shaping describes:
- Batching several smaller files into one larger file
- Splitting one large file into several smaller files
We use newline-delimited JSON files throughout our pipeline to simplify this shaping; each line in a given file represents a single JSON event we’ve received from a provider. As such, any number of lines can be extracted from a given file to batch or split appropriately.
Red Canary utilizes a homegrown shaper application to enforce consistent file sizes across all transform stages. This shaper is composed of two components: a producer and consumer.
The producer records metadata about incoming files in its database, including the customer name and file size. The producer routinely queries this database, aggregating small file references into batch instructions once one of its criteria has been met:
- The desired batched file size has been reached.
- The maximum “time to batch” has been reached.
- This limit ensures the producer does not wait too long to produce results, preventing delays in threat detection.
If any single file surpasses the configured size limit, a split instruction is produced for that file instead.
The consumer simply accepts instructions from the producer. For batch instructions, the consumer downloads the referenced files, aggregates them, and uploads that aggregated file for transformation. For split instructions, the consumer downloads the referenced file, splits it into several smaller files, and uploads each split file for transformation.
This effectively addresses files containing several events, but it’s not such a great solution for managing individual events from data streams. Uploading individual events to S3 would exacerbate storage costs, increase the risk of rate limits, and slow down overall processing due to the overhead from establishing an S3 connection per event.
Enter stage left: Amazon Data Firehose
Turning on the Firehose
Amazon Data Firehose was designed for this exact use case: reliably aggregating individual events from data streams at scale.
For integrations utilizing such data streams, we route their data to Firehose in place of the shaper:
We use dynamic partitioning to aggregate and isolate incoming data per stream.
Read about how our Google Cloud Platform (GCP) integration uses Firehose.
The results: Cutting the file load by millions
Let’s compare a couple days worth of data. Within the same period, the producer observed more than 4.1 million files during its busiest hour, and the consumer emitted close to 185,000 files, a reduction of more than 3.9 million files (about 95 percent).
Files observed by the producer component
Files emitted by the consumer component
That’s nearly 4 million files that don’t need to be downloaded and processed by downstream transformation stages. Not only does this minimize S3 costs, it increases system throughput to facilitate fast, effective threat detection.
As you can see, shaping is a crucial piece of managing Red Canary’s large-scale data ingestion. The shaping strategies we’ve implemented effectively handle unpredictable incoming data, ensuring a more efficient and cost-effective pipeline.