Skip Navigation
Get a Demo
 
 
 
 
 
 
 
 
 

What is a security data lake?

Security data lake (SDL) is a specialized data lake explicitly designed to store, manage, and analyze large volumes of security-related data from various sources across an organization's network.

Security data lake components

Enormous volumes of raw security data spread across different tools, cloud platforms, and functions; the need for faster responses to complex, emerging cyberthreats; and increasingly stringent security and privacy regulations are key forces that have given rise to the security data lake (SDL).

As its name indicates, a security data lake is a specialized data lake explicitly designed to store, manage, and analyze large volumes of security-related data from various sources across an organization’s network. It’s a centralized repository of structured, semi-structured, and unstructured security data such as logs, alerts, network traffic, endpoint telemetry, threat intelligence feeds, malware databases, vulnerability scans, and JSON/XML/NoSQL data formats.

Data lakes as a whole are well suited for big data and artificial intelligence (AI)-based analytics, machine learning (ML), and business intelligence. Security data lakes can also help security teams with threat detection, incident response, and threat hunting, and support compliance and auditing through long-term data retention.

Unlike other types of data storage, such as data warehouses, security data lakes:

  1. accept all types of data
  2. store data in its original form vs. applying a predefined schema first
  3. transform data on demand

Building a security data lake requires solutions, services, and tools offering scalable storage capabilities, data processing power, and strong security controls. Open-source and proprietary tools are available, as well as services from leading public cloud companies.

The term “data lake” was coined by James Dixon of Pentaho in 2010, giving a name to a new concept for storing, analyzing, and managing raw data. Today, the data lake market is growing by double digits and its size in 2024 is estimated at between $16 billion and $20 billion.

According to Gartner, “A data lake is a concept consisting of a collection of storage instances of various data assets. These assets are stored in a near-exact, or even exact, copy of the source format and are in addition to the originating data stores.”

The security data lake emerged to ingest, store, and analyze security data – exclusively. A key reason is the need for a solution that is better suited to storing and managing security data than legacy security information and event management (SIEM) systems. As security data volumes have skyrocketed, organizations are acknowledging that SIEM tools, which predate the Big Data era, are not designed to handle the increase. Further, SIEMs were created for on-premises environments, not the cloud-based infrastructures that predominate today.

Why organizations need security data lakes

Analyzing security data is fundamental to gaining insights into risks, threats, and security posture. However, storing data in structured databases can lead to unwitting deletion of information or context that analysts may need in the future. A security data lake, which stores all types of data in a central location for long periods, makes it easier for analysts to find what they need to investigate incidents, compared to having data siloed in different tools and systems.

Similarly, centralization gives analytics tools and ML technologies access to very large data sets that can reveal subtle threat patterns and anomalies.

Cost control is another reason for adopting security data lakes. As security data increases exponentially, organizations need larger and more-affordable storage options than SIEMs can offer. Compared to a SIEM system, which typically involves maintaining expensive, dedicated hardware for constant processing, a security data lake can leverage cost-effective cloud storage whose providers charge only for the compute power used when analyzing data.

Finally, because a security data lake can be useful to a broad range of users, it offers an attractive return on investment.

  • Analysts can take advantage of the extended data retention period of a security data lake – which is typically months or years, compared to a SIEM’s typical 30- to 90-day retention period. Access to historical data provides valuable context for investigations into security threats and incidents.
  • Data scientists can use the data to construct AI and ML models and other analytical applications.
  • Less-technical users can more easily learn how to use a security data lake than a highly complex SIEM tool.
  • Security leaders can improve decision making and reporting to the board of directors with data-driven, actionable insights based on huge security data volumes.

Security data lake benefits

Access to a specialized repository for security data and threat intelligence helps security teams identify, understand, and respond to threats more effectively.

What features and functions of a security data lake improve threat detection, analysis, and response?

  • Holistic view: By consolidating information from a wide range of sources, a security data lake provides a comprehensive, centralized view of security data. Analyzing all this security data together makes it easier to identify threats that might be overlooked when taking a piecemeal approach.
  • Stronger security: When security data is dispersed across different tools, cloud services, and functions, identifying potential threats can be difficult. A security data lake addresses this issue by centralizing all security data.
  • Cross-functional collaboration: Centralized access to the same data set for security, IT, compliance, and operations functions facilitates teamwork.
  • Scalability: The security data lake easily accommodates fast-growing data volumes for compute, analysis, and long-term storage, so security teams can gain the full value of this historical data for activities like threat hunting.
  • Flexibility: A security data lake can aggregate and correlate data in any format. It ingests data without having to define schema.
  • On-demand advanced analytics: Without requiring a complex infrastructure to pre-process and organize data, security analysts can query using SQL, full-text search, and AI or ML. In particular, the combination of structured and unstructured data lends itself to the use of algorithms for identifying patterns, correlations, and anomalies.
  • Cost-effectiveness: Security data lakes take advantage of low-cost cloud storage that can handle terabytes or even petabytes of data.

Security data lake downsides

The huge amounts of sensitive data stored in security data lakes make them a tempting target for malicious actors, including insiders with easy access to the data lake. Attack vectors can include phishing, SQL injection, zero-day exploits, and man-in-the-middle attacks. Protection requires robust access controls based on roles and responsibilities, which can be a major challenge due to the many different types of stored data.

In addition, these vast data volumes must be kept private to comply with regulations such as Europe’s GDPR, the U.S. Federal HIPAA statute, and the California Consumer Privacy Act.

Following are other concerns and drawbacks regarding security data lakes:

  • Complexity of integration with the security stack: For data ingestion, security data lakes must seamlessly integrate with existing security infrastructure. Integration can be complex and time-consuming.
  • Shortage of qualified workers: Since security data lakes are relatively new, and there is an ongoing shortage of cybersecurity talent, organizations may have trouble finding staff to administer the data lake.
  • High costs of in-house development: Building and maintaining a security data lake in house requires a substantial upfront investment in hardware, software, and storage infrastructure, as well as ongoing operational and administrative costs.

Legacy SIEM limitations

Traditional SIEM software aggregates, analyzes, and stores data from logs, enabling security teams to query data in real time, correlate data from different sources to detect anomalies, and respond quickly to incidents. However, SIEM tools were not designed to handle today’s enormous volumes of security data. As data exploded, SIEM vendors sharply increased their processing and storage fees, forcing organizations to either reduce the amount of telemetry sent to the SIEM or shorten the data retention period in an effort to control costs.

Besides data quantities, another scalability issue for SIEMs involves data correlation. While SIEMs perform log collection and aggregation very well, they can run into problems when correlating vast amounts of data.

Capability drawbacks include fragmented automation and a lack of advanced analytics, which are necessary to derive useful insights from large data volumes. Legacy SIEM platforms do not usually provide robust user and entity behavior analytics (UEBA), which prevents effective analysis and correlation of user behavior patterns to detect compromised accounts or insider threats.

Finally, SIEMs are complex and require extensive training and specialized expertise.

Security data lake vs. SIEM

Following are the main differences between these two solutions.

Data ingestion

SIEM solutions typically ingest structured data or auto-normalize security data as it is ingested (known as schema-on-write). In contrast, data lakes ingest raw data in its source format (known as schema-on-read).

Data storage and retention

SIEM vendors typically charge based on the amount of data processed and the length of time it is stored in their systems, and fees can be very high. Security data lakes use commodity storage in the cloud to sharply reduce fees. Further, a SIEM will often store logs and alert data for less than a year, hampering analyst efforts to spot long-term trends and patterns. Security data lakes can retain security data for years.

Infrastructure

In general, there is no inherent advantage regarding infrastructure for either security data lakes or SIEM tools. However, if an organization already has a data lake, the security team could add its data to that existing repository.

On the other hand, while security teams usually know how to connect their SIEM to data feeds, secure the infrastructure, and host data in the tool, adopting a security data lake will require the team to acquire new skills in these areas.

Threat hunting

To support threat hunting, security data lakes offer advantages over SIEMS: they can ingest more data types, accommodate more data, and host it for longer periods, giving threat hunting teams access to far greater data resources. Security data lakes also provide a data query interface to help with investigating key alerts, and supply the context needed to understand them.

SIEMs issue alerts and can flag specific events for further investigation, but threat hunting must be performed outside of the tool.

Al and ML

A limited data set can potentially inhibit algorithm training. For this reason, a security data lake’s large and unfiltered data set offers better opportunities to train AI and ML models in threat detection than a SIEM.

Next-gen SIEM + security data lake: solving the data problem

Acknowledging the limitations of legacy SIEM tools, the industry has developed “next-gen” SIEM technologies. These systems incorporate important features and capabilities that may also be provided by security data lakes:

  • Advanced analytics, ML, and UEBA
  • A broader range of data sources and types and larger data volumes
  • Automation and orchestration to avoid manual intervention
  • Support for cloud-native and hybrid environments
  • Built-in connectors for data ingestion from existing tools

Still, next-gen SIEMs can be complex and resource intensive to implement and manage. To address these challenges, many organizations turn to managed or co-managed SIEM services. By outsourcing the management and maintenance of SIEM infrastructure, organizations can focus on strategic security initiatives while benefiting from expert support to scale resources, automate routine tasks, and optimize SIEM performance.

At the same time, next-gen SIEM technologies often face limitations related to data storage capacity, scalability, and the ability to handle diverse data types. Security data lakes, however, provide a flexible and scalable solution for storing and analyzing large volumes of data, including low-fidelity logs and network flows. Together, these technologies create a more comprehensive security posture: next-gen SIEMs excel at real-time threat detection and response, while security data lakes enable long-term data retention, advanced analytics, and proactive threat hunting.

By combining the strengths of next-gen SIEMs and security data lakes, organizations can leverage the best of both worlds. This approach allows them to detect and respond to threats in real time while maintaining the scalability and flexibility needed to manage large and varied datasets. Understanding the unique strengths and limitations of each technology enables organizations to build a security strategy that protects their critical assets and adapts to emerging threats.

Red Canary SIEM services and Security Data Lake

Organizations often face tough trade-offs when executing their security data strategies. Red Canary’s Security Data Lake helps unlock valuable insights while optimizing storage costs, empowering teams to focus on what matters most. With flexible retention and efficient data handling, organizations can leverage their data without compromising on security.

 
LEARN MORE

Discover how Red Canary's Security Data Lake can enhance your data strategy and help reduce storage costs.

 
 
Back to Top