What is a security data lake?

Security data lake overview

A security data lake is a central repository for reams of security data from various sources across an organization’s network. Security data lakes store the following kinds of structured, semi-structured, and unstructured data:

logs
alerts
network traffic
endpoint telemetry
threat intelligence feeds,
malware databases,
vulnerability scans
JSON/XML/NoSQL data formats

Data lakes as a whole are well suited for big data and artificial intelligence (AI)-based analytics, machine learning (ML), and business intelligence. Security data lakes can also help security teams with threat detection, incident response, and threat hunting, along with support compliance and auditing through long-term data retention.

Unlike other types of data storage, such as data warehouses, security data lakes:

accept all types of data
store data in its original form vs. applying a predefined schema first
transform data on demand

Building a security data lake on your own requires solutions, services, and tools offering scalable storage capabilities, data processing power, and strong security controls. You can buy a data lake service from your cloud service provider or a vendor like Red Canary. Open source options are also available.

The term “data lake” was coined by James Dixon of Pentaho in 2010, giving a name to a new concept for storing, analyzing, and managing raw data. Today, the data lake market is growing by double digits and its size in 2024 is estimated at between $16 billion and $20 billion.

According to Gartner, “A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores.”

The security data lake emerged to ingest, store, and analyze security data exclusively. Data lakes are better suited for storing and managing security data than legacy security information and event management (SIEM) systems. SIEM tools, which predate the Big Data era, are not designed to handle the increase in security data generated in the last few years. Further, SIEMs were designed for on-premises environments, not the cloud.

Why companies are adopting security data lakes

Analyzing security data is fundamental to gaining insights into risks, threats, and security posture. But storing data in structured databases can lead to unwitting deletion of information or context that analysts may need in the future. A security data lake stores all types of data in a central location for long periods. This makes it easier for analysts to find what they need to investigate incidents, compared to having data siloed in different tools and systems.

Similarly, centralization gives analytics tools and ML technologies access to very large data sets that can reveal subtle threat patterns and anomalies.

Cost control is another reason for adopting security data lakes. As security data increases exponentially, organizations need larger and more-affordable storage options than SIEMs can offer. SIEMs often require maintaining expensive, dedicated hardware for constant processing. A security data lake can leverage cost-effective cloud storage from providers that charge only for the compute power used when analyzing data.

Finally, because a security data lake can be useful to a broad range of users, it offers an attractive return on investment.

Analysts can take advantage of the extended data retention period of a security data lake, which can be months or years, compared to a SIEM’s typical 30- to 90-day retention period. Access to historical data provides valuable context for investigations into security threats and incidents.
Data scientists can use the data to construct AI and ML models and other analytical applications.
Less-technical users can more easily learn how to use a security data lake than a highly complex SIEM tool.
Security leaders can improve decision making and reporting to the board of directors with data-driven, actionable insights.

Security data lake benefits

A specialized repository for security data and threat intelligence helps security teams identify, understand, and respond to threats more effectively.

What features and functions of a security data lake improve threat detection, analysis, and response?

Holistic view: By consolidating information from a wide range of sources, a security data lake provides a comprehensive, centralized view of security data. Analyzing all this security data together makes it easier to identify threats that might be overlooked when taking a piecemeal approach.
Stronger security: When security data is dispersed across different tools, cloud services, and functions, identifying potential threats can be difficult. A security data lake addresses this issue by centralizing all security data.
Cross-functional collaboration: Centralized access to the same data set for security, IT, compliance, and operations functions facilitates teamwork.
Scalability: The security data lake accommodates fast-growing data volumes for compute, analysis, and long-term storage, so security teams can gain the full value of this historical data for activities like threat hunting.
Flexibility: A security data lake can aggregate and correlate data in any format. It ingests data without having to define schema.
On-demand advanced analytics: Without requiring a complex infrastructure to pre-process and organize data, security analysts can query using SQL, full-text search, and AI or ML. In particular, the combination of structured and unstructured data lends itself to the use of algorithms for identifying patterns, correlations, and anomalies.
Cost-effectiveness: Security data lakes take advantage of low-cost cloud storage that can handle terabytes or even petabytes of data.

Security data lake downsides

The huge amounts of sensitive data stored in security data lakes make them a tempting target for malicious actors, including insiders with easy access to the data lake. Attack vectors can include phishing, SQL injection, zero-day exploits, and man-in-the-middle attacks. Protection requires robust access controls based on roles and responsibilities, which can be a major challenge due to the many different types of stored data.

In addition, these vast data volumes must be kept private to comply with regulations such as Europe’s GDPR, the U.S. Federal HIPAA statute, and the California Consumer Privacy Act.

Following are other concerns and drawbacks regarding security data lakes:

Complexity of integration with the security stack: For data ingestion, security data lakes must seamlessly integrate with existing security infrastructure. Integration can be complex and time-consuming.
Shortage of qualified workers: Since security data lakes are relatively new, organizations may have trouble finding experienced staff.
High costs of in-house development: Building and maintaining a security data lake in house requires a substantial upfront investment in hardware, software, and storage infrastructure on top of ongoing operational and administrative costs.

Security data lake vs. SIEM

Traditional SIEM software aggregates, analyzes, and stores data from logs, enabling security teams to query data in real time, correlate data from different sources to detect anomalies, and respond quickly to incidents. But SIEM tools were not designed to handle today’s enormous volumes of security data. As data exploded, SIEM vendors sharply increased their processing and storage fees, forcing organizations to either reduce the amount of telemetry sent to the SIEM or shorten the data retention period in an effort to control costs.

Besides data quantities, another scalability issue for SIEMs involves data correlation. While SIEMs perform log collection and aggregation very well, they can run into problems when correlating vast amounts of data.

Capability drawbacks include fragmented automation and a lack of advanced analytics, which are necessary to derive useful insights from large data volumes. Legacy SIEM platforms do not usually provide robust user and entity behavior analytics (UEBA), which prevents effective analysis and correlation of user behavior patterns to detect compromised accounts or insider threats.

Finally, SIEMs are complex and require extensive training and specialized expertise.

Following are the main differences between these two solutions.

Data ingestion

SIEM solutions typically ingest structured data or auto-normalize security data as it is ingested (known as schema-on-write). In contrast, data lakes ingest raw data in its source format (known as schema-on-read).

Data storage and retention

SIEM vendors typically charge based on the amount of data processed and the length of time it is stored in their systems, and fees can be very high. Security data lakes use commodity storage in the cloud to sharply reduce fees. Further, a SIEM will often store logs and alert data for less than a year, hampering analyst efforts to spot long-term trends and patterns. Security data lakes can retain security data for years.

Infrastructure

In general, there is no inherent infrastructure advantage for either security data lakes or SIEM tools. However, if an organization already has a data lake, the security team could add its data to that existing repository.

Security teams usually know how to connect their SIEM to data feeds, secure the infrastructure, and host data in the tool. Adopting a security data lake may introduce a learning curve.

Threat hunting

To support threat hunting, security data lakes offer advantages over SIEMS: they can ingest more data types, accommodate more data, and host it for longer periods, giving threat hunting teams access to far greater data resources. Security data lakes also provide a data query interface to help with investigating key alerts.

SIEMs issue alerts and can flag specific events for further investigation, but threat hunting must be performed outside of the tool.

Al and ML

A limited data set can potentially inhibit algorithm training. For this reason, a security data lake’s large and unfiltered data set offers better opportunities to train AI and ML models in threat detection than a SIEM.

Next-gen SIEM + security data lake: solving the data problem

Acknowledging the limitations of legacy SIEM tools, the industry has developed “next-gen” SIEM technologies. These systems incorporate important features and capabilities that security data lakes also offer:

Advanced analytics, ML, and UEBA
A broader range of data sources and types and larger data volumes
Automation and orchestration to avoid manual intervention
Support for cloud-native and hybrid environments
Built-in connectors for data ingestion from existing tools

Still, next-gen SIEMs can be complex and resource-intensive to implement and manage. To address these challenges, many organizations turn to managed or co-managed SIEM services. Outsourcing SIEM management allows organizations to focus on strategic security initiatives while experts take care of scaling resources, automating routine tasks, and optimizing SIEM performance.

At the same time, next-gen SIEM technologies often struggle with data storage capacity, scalability, and the ability to handle diverse data types. Security data lakes, however, provide a flexible and scalable solution for storing and analyzing large volumes of data, including low-fidelity logs and network flows. Together, these technologies create a more comprehensive security posture. Next-gen SIEMs excel at real-time threat detection and response, while security data lakes enable long-term data retention, advanced analytics, and proactive threat hunting.

By combining the strengths of next-gen SIEMs and security data lakes, organizations can leverage the best of both worlds. This approach allows them to detect and respond to threats in real time while maintaining the scalability and flexibility needed to manage large and varied datasets. Understanding the unique strengths and limitations of each technology enables organizations to build a security strategy that protects their critical assets and adapts to emerging threats.

Red Canary SIEM services and Security Data Lake

Organizations often face tough trade-offs when executing their security data strategies. Red Canary’s Security Data Lake helps unlock valuable insights while optimizing storage costs, empowering teams to focus on what matters most. With flexible retention and efficient data handling, organizations can leverage their data without compromising on security.

LEARN MORE

Discover how Red Canary's Security Data Lake can enhance your data strategy and help reduce storage costs.

Cybersecurity 101
Security Data
What is a security data lake?

What is a security data lake?

Security data lake overview

Why companies are adopting security data lakes

Security data lake benefits

Security data lake downsides

Security data lake vs. SIEM

Data ingestion

Data storage and retention

Infrastructure

Threat hunting

Al and ML

Next-gen SIEM + security data lake: solving the data problem

Red Canary SIEM services and Security Data Lake

LEARN MORE

See Red Canary in action

Watch the 10-minute demo now.

Security gaps? We got you.

Cybersecurity 101 Security Data What is a security data lake?

What is a security data lake?

Security data lake overview

Why companies are adopting security data lakes

Security data lake benefits

Security data lake downsides

Security data lake vs. SIEM

Data ingestion

Data storage and retention

Infrastructure

Threat hunting

Al and ML

Next-gen SIEM + security data lake: solving the data problem

Red Canary SIEM services and Security Data Lake

LEARN MORE

See Red Canary in action

Watch the 10-minute demo now.

Security gaps? We got you.

Cybersecurity 101
Security Data
What is a security data lake?