Delta Lake vs Data Lake

November 2nd, 2022
Should you switch your Data Lake to a Delta Lake? At first glance, Delta Lakes offer benefits and features like ACID transactions. But at what cost?
Photo of a beautiful lake

Data Lake

The Data Lake traces its origin back to the heyday of Hadoop and the emergence of cloud computing. The idea was intriguing. Store all of your raw data, e.g. log files, CSV, JSON, and so forth, on commodity scale-out storage and compute and make sense of it at read time, for example, with MapReduce.

That was great for emerging schema, scalable write, and utilising comparatively inexpensive computing compared to Oracle or similar. However, MapReduce is expensive in engineering time and excludes most users from directly using the data.

Innovations like Apache Hive and Apache Parquet addressed some of these shortcomings. Hive provides familiar SQL interfaces to the data, and Parquet makes storing and retrieving data more efficient. A key aspect of Data Lakes remained the same. They usually are append-only and optimised for large-scale processing making small data retrieval slow compared to RDBMS or indexed NoSQL stores.

Delta Lake

Delta Lake offers several improvements intended to make it more usable for users and in more scenarios than a Data Lake. It sometimes is framed as an additional layer on top or an evolution of the Data Lake. However, as expert readers might note, some features and improvements made their way into the ever-evolving Data Lake ecosystems or could be achieved with effort, making the lines between them blurry.

Something most RDBMS users miss in a Data Lake are ACID transactions which Delta Lake provides as a powerful feature, including the ability to version and time travel back and forth between versions of the data. Delta Lake improves metadata management with more scalability, consistency, and flexibility. It also integrates well with batching and streaming with SQL, Python, Scala and Java APIs.

Ideally, Delta Lakes offer more usability to engineers and savvy data end-users, reducing or eliminating the need for landing data aggregates in specialised stores for use cases with differing access patterns and data reconciliation from multiple stores.

Data Lake vs Delta Lake

While Delta Lake provides significant improvements and features, it comes at a cost. Firstly, Delta Lakes are more expensive in money, time, infrastructure and complexity than Data Lakes. Data Lakes and technology like Hive, Trino and Athena are cost-efficient for their ideal use cases. AWS Athena, for example, is priced at $5 per TB scanned, which can go a long way with efficient Parquet storage.

You should only consider a Delta Lake if a Data Lake proves insufficient. The connectivity and integration option of Delta Lakes allows this as an evolution. Upgrading to a Delta Lake from a Data Lake when needed is a viable option. Use Hive, Athena, Trino, Flink or a similar Data Lake if it works for your use case, especially when you are cost sensitive.

Secondly, the underlying technologies and the architecture of Delta Lakes dictate constraints. Limits are natural when you run on distributed computing and storage. You can lessen them with cutting-edge and expensive machines, local storage, or optimise and tune things in private clouds. But you will not beat systems built for a purpose. Do not expect to outperform RDBMS in their domain or specialised NoSQL stores in theirs.

Depending on the use case, you could choose a simple Data Lake and optimised storage. However, Delta Lakes can be an appealing option where performance expectations allow some flexibility to address multiple access patterns and needs in one architecture.


Christian Prokopp, Bold Data, Founder


Update

AWS released Athena Spark at re:invent 2022. Is Athena Spark a Delta Lake alternative to Databricks?

    Let's talk

    You have a business problem in need for data and analysis? Send us an email.

    Subscribe to updates

    Join Bold Data's email list to receive free data and updates.

Related Posts

OpenAI GTP-3: Content spam or more?

Robot on a typewriter in a library (DALL·E generate)
OpenAI's ChatGPT has made the news recently as a next-generation conversational agent. It has a surprising breadth which made me wonder, could OpenAI generate specific technology content good enough to post, and what would that imply for the future?

Is Athena Spark a Delta Lake alternative to Databricks?

Morning on a lake
Finally. AWS re:Invent 2022 brought the answer to both Databricks and Athena's worst limitations. Athena Spark promises to bring Delta Lake scale-out processing effortlessly and inexpensively.

One simple thing companies miss about their data

When you can't see the wood for the trees
There is one simple thing most companies miss about their data. It has been instrumental in my work as a data professional ever since.

Your data lies: Be a data-driven Luddite

Manual labour
I have worked with data for decades. There are the two key lessons I share with every customer, stakeholder and beginner in the field. Firstly, follow the data, not (only) your gut or experience. Secondly, do not trust the data; never. Why and what has this to do with Luddites?

Will Tesla's AI break the insurance market?

Car accident
Insurance works because it shares costs in the face of uncertainty. What happens when Tesla removes uncertainty and distributes cost seemingly more fairly? First partially and eventually wholly? Will insurance fail, doing more harm than good?

Why I became a Solo Founder

Single lego figure walking in sand
I never wanted to be a solo founder. Yet, in 2021, I quit my job and started Bold Data to mine the Internet single-handedly. Trust me, it sounds as insane to write as to read. What on earth possessed me, and more importantly, would I do it again?

Amazon bestsellers are big data

Your data's size matters
According to an adage, big data is anything too big for Excel, i.e. more than 1,048,576 rows. It is a bit cheek-in-tongue, but, as with many jokes, it is grounded in truth. Many business processes run on Excel to this day. That is an issue when analysing datasets like Amazon product data for valuable insight on pricing, production and supply planning, and new product or category development. Excel cannot load a single country's Amazon bestseller list. Even if you use more scalable systems, many will struggle to analyse the more comprehensive product catalogue, complex product and category relationships, or changes over time.

Free Amazon bestsellers datasets (May 8th 2022)

All you can eat free data
Get huge, valuable datasets with 4.9 million Amazon bestsellers for free. No payment, registration or credit card is needed.

All Blog Posts

See the full list of blog posts to read more.
Subscribe for updates, free datasets and analysis.