The Architect’s Guide: A Modern Data Lake Reference Architecture

Businesses aiming to maximize their data assets are adopting scalable, flexible and unified approaches to data storage and analytics. This trend is driven by enterprise architects tasked with crafting infrastructures that align with evolving business demands. A modern data lake architecture addresses this need by integrating the scalability and flexibility of a data lake with the structure and performance optimizations of a data warehouse. This post provides a reference architecture for understanding and implementing a modern data lake.
What is a Modern Data Lake?
A modern data lake is one-half data warehouse and one-half data lake and uses object storage for everything. This may sound like a marketing trick — put two products in one package and call it a new product — but the data warehouse presented in this article is better than a conventional data warehouse. It uses object storage, so it provides all the benefits of object storage in terms of scalability and performance. Organizations that adopt this approach pay only for what they need (facilitated by the scalability of object storage) and achieve performance by equipping their underlying object store with NVMe drives connected by a high-end network.
Using object storage in this fashion is enabled by the rise of open table formats (OTFs) such as Apache Iceberg, Apache Hudi and Delta Lake. These specifications, once implemented, make it seamless to use object storage as the underlying storage solution for a data warehouse. They also provide features that may not exist in a conventional data warehouse, including snapshots (also known as time travel), schema evolution, partitions, partition evolution and zero-copy branching.
But the modern data lake is more than just a fancy data warehouse, as it also contains a data lake for unstructured data. OTFs also provide integration to external data in the data lake, which allows external data to be used as an SQL table if needed. Or the external data can be transformed and routed to the data warehouse using high-speed processing engines and familiar SQL commands.
So, the modern data lake is more than just a data warehouse and a data lake in one package with a different name. Collectively they provide more value than what’s found in a conventional data warehouse or a standalone data lake.
Conceptual Architecture
Layering is a convenient way to present the components and services needed by the modern data lake. Layering provides a clear way to group services that provide similar functionality. It also allows a hierarchy to be established, with consumers on top and data sources (with their raw data) on the bottom. The layers of the modern data lake from top to bottom are:
- Consumption layer: Contains the tools used by power users to analyze data. Also contains applications and AI/ML workloads that will programmatically access the modern data lake.
- Semantic layer: An optional metadata layer for data discovery and governance.
- Processing layer: This layer contains the compute clusters needed to query the modern data lake. It also contains compute clusters used for distributed model training. Complex transformations can occur in the processing layer using the storage layer’s integration between the data lake and the data warehouse.
- Storage layer: Object storage is the primary storage service for the modern data lake; however, machine learning operations (MLOps) tools may need other storage services, such as relational databases. If you are pursuing generative AI, you will need a vector database.
- Ingestion layer: Contains the services needed to receive data. Advanced ingestion can retrieve data based on a schedule. The modern data lake should support a variety of protocols. It should also support data arriving in streams and batches. Simple and complex data transformations can occur in the ingestion layer.
- Data sources: The data sources layer is technically not a part of the modern data lake solution, but it is included in this article because a well-constructed modern data lake must support a variety of data sources with varying capabilities for sending data.
The diagram below visually depicts these layers and the capabilities that may be needed to implement these layers. This is an end-to-end architecture where the heart of the platform is a modern data lake. This diagram also shows the components needed to ingest, transform, discover, govern and consume data. It also depicts the tools needed to support important use cases that depend on a modern data lake, such as MLOps storage, vector databases and machine learning clusters.
The storage layer and the processing layer are at the heart of a modern data lake. These two layers also contain the fastest evolving technologies for building a data infrastructure: data warehouses built with open table formats, high-speed object storage and vector databases.
The Storage Layer
The data storage layer is the bedrock that all other layers depend upon. Its purpose is to store data reliably and serve it efficiently. It contains separate object storage services for the data lake and the data warehouse side of the modern data lake.
These two object storage services can be combined into one physical instance of an object store if needed by using buckets to keep data warehouse storage separate from data lake storage. However, if your consumption layer and data pipelines will be putting different workloads on these two storage services, consider keeping them separate and installing them on different hardware.
For example, a common data flow is to have all new data land in the data lake. Then it can be transformed and ingested into the data warehouse, where it can be consumed by other applications and used for data science and data analytics. In this data flow, the modern data lake puts more load on your data warehouse, so you will want to run it on high-end hardware (storage devices, storage clusters and network).
External table functionality allows data warehouses and processing engines to read objects in the data lake as if they were SQL tables. If the data lake is used as the landing zone for raw data, then this capability, along with the data warehouse’s SQL capabilities, can be used to transform raw data before inserting it into the data warehouse. Alternatively, the external table could be used “as-is” and joined with other tables and resources inside the data warehouse without it ever leaving the data lake. This pattern can help save on migration costs and overcome some data security concerns by keeping the data in one place while simultaneously making it available to outside services.
You can also pursue an AI storage strategy with this reference architecture, but this is beyond the scope of this article. Our reference architecture for an AI/ML modern data lake provides information on building an AI data infrastructure.
The Processing Layer
The processing layer contains the compute needed for all the workloads supported by the modern data lake. At a high level, compute comes in two varieties: processing engines for the data warehouse and clusters for distributed machine learning.
The data warehouse processing engine supports the distributed execution of SQL commands against the data in data warehouse storage. Transformations that are part of the ingestion process may also need the processing layer’s compute power. For example, in some data warehouses, you may wish to use a medallion architecture; in others, you may choose a star schema with dimensional tables. These designs often require substantial extract, transform and load (ETL) against the raw data during ingestion.
The data warehouse within a modern data lake disaggregates compute from storage. So, if needed, multiple processing engines can exist for a single data warehouse data store. (This differs from a conventional relational database, where compute and storage are tightly coupled and there is one compute resource for every storage device.)
A possible processing layer design is to set up one processing engine for each entity in the consumption layer. For example, using a processing cluster for business intelligence (BI), a separate cluster for data analytics and yet another for data science. Each processing engine would query the same data warehouse storage service, however since each team has its own dedicated cluster, they will not compete with each other for compute. If the BI team is running compute-intensive month-end reports, they will not interfere with another team running daily reports.
Machine learning models, especially large language models, can be trained faster if training is done in a distributed fashion. The machine learning cluster supports distributed training. Distributed training should be integrated with an MLOps tool for experiment tracking and checkpointing.
Summary
This article presents a high-level reference architecture for a modern data lake and explores its core components. The goal is to provide organizations with a strategic blueprint for building a platform that efficiently manages and extracts value from their vast and diverse data sets.
The modern data lake combines the strengths of an OTF-based data warehouse and a flexible data lake, offering a unified and scalable solution for storing, processing and analyzing data. If you would like to go deeper into these concepts, reach out to the Min.io team at hello@min.io.