The Data Lake traces its origin back to the heyday of Hadoop and the emergence of cloud computing. The idea was intriguing. Store all of your raw data, e.g. log files, CSV, JSON, and so forth, on commodity scale-out storage and compute and make sense of it at read time, for example, with MapReduce.
That was great for emerging schema, scalable write, and utilising comparatively inexpensive computing compared to Oracle or similar. However, MapReduce is expensive in engineering time and excludes most users from directly using the data.
Innovations like Apache Hive and Apache Parquet addressed some of these shortcomings. Hive provides familiar SQL interfaces to the data, and Parquet makes storing and retrieving data more efficient. A key aspect of Data Lakes remained the same. They usually are append-only and optimised for large-scale processing making small data retrieval slow compared to RDBMS or indexed NoSQL stores.
Delta Lake offers several improvements intended to make it more usable for users and in more scenarios than a Data Lake. It sometimes is framed as an additional layer on top or an evolution of the Data Lake. However, as expert readers might note, some features and improvements made their way into the ever-evolving Data Lake ecosystems or could be achieved with effort, making the lines between them blurry.
Something most RDBMS users miss in a Data Lake are ACID transactions which Delta Lake provides as a powerful feature, including the ability to version and time travel back and forth between versions of the data. Delta Lake improves metadata management with more scalability, consistency, and flexibility. It also integrates well with batching and streaming with SQL, Python, Scala and Java APIs.
Ideally, Delta Lakes offer more usability to engineers and savvy data end-users, reducing or eliminating the need for landing data aggregates in specialised stores for use cases with differing access patterns and data reconciliation from multiple stores.
While Delta Lake provides significant improvements and features, it comes at a cost. Firstly, Delta Lakes are more expensive in money, time, infrastructure and complexity than Data Lakes. Data Lakes and technology like Hive, Trino and Athena are cost-efficient for their ideal use cases. AWS Athena, for example, is priced at $5 per TB scanned, which can go a long way with efficient Parquet storage.
You should only consider a Delta Lake if a Data Lake proves insufficient. The connectivity and integration option of Delta Lakes allows this as an evolution. Upgrading to a Delta Lake from a Data Lake when needed is a viable option. Use Hive, Athena, Trino, Flink or a similar Data Lake if it works for your use case, especially when you are cost sensitive.
Secondly, the underlying technologies and the architecture of Delta Lakes dictate constraints. Limits are natural when you run on distributed computing and storage. You can lessen them with cutting-edge and expensive machines, local storage, or optimise and tune things in private clouds. But you will not beat systems built for a purpose. Do not expect to outperform RDBMS in their domain or specialised NoSQL stores in theirs.
Depending on the use case, you could choose a simple Data Lake and optimised storage. However, Delta Lakes can be an appealing option where performance expectations allow some flexibility to address multiple access patterns and needs in one architecture.
Christian Prokopp, Bold Data, Founder
AWS released Athena Spark at re:invent 2022. Is Athena Spark a Delta Lake alternative to Databricks?