Is Athena Spark a Delta Lake alternative to Databricks?

December 2nd, 2022
Finally. AWS re:Invent 2022 brought the answer to both Databricks and Athena's worst limitations. Athena Spark promises to bring Delta Lake scale-out processing effortlessly and inexpensively.
Morning on a lake

Athena Data Lake

Athena started as a serverless Presto engine six years ago, in 2016. Its key selling point was to do away with the cost and complexity of Hadoop at a fixed price per TB of data scanned. As an experienced and avid user of Hadoop and Hive at the time, I fell in love with Athena. It took away the pain and complexity of Hadoop, and with S3, it had the perfect durable, scalable, performant storage layer.

AWS Glue joined as its metastore in the 2017 product lineup. It allowed later on the option to access and process the same data from Athena or Spark via EMR and Hadoop. An important addition since Athena's Presto core has limits, and Athena queries timeout at a hard limit of 15 minutes. So Athena could not execute complex, extremely memory-intensive or long-running processing. Of course, with enough creativity, breaking down processes and data into more tables or steps, you can workaround the limit to a degree but at the cost of complexity, stability and your sanity.

Athena Delta Lake

Athena grew further with the addition of Iceberg and the Trino engine update. These promised ACID compliance and performance improvements. It is noteworthy that the underlying concept of storing the data in S3 with serverless processing means Athena never intended to compete with the access patterns of RDBMS or low latency NoSQL stores. ACID compliance, in this case, is more aimed at helping with the overlapping or continuous landing of data that previously needed to be reconciled before analysis or loading into low-latency stores.

I have yet to explore these features in depth. Do check out how Athena engine v2 vs v3 performs for you, though. I was surprised by how much slower instead of faster it was in a specific circumstance of mine. That may be resolved with future updates, or I may have to change my data processing strategy a bit, e.g. embracing Iceberg over my (now) slow legacy workarounds.

Athena Spark

While I am a fan of Athena, I had the opportunity to work with Databricks for a while. Its ease of use, performance and accessibility to Data Engineers, Data Scientists and Data Analysts/BI make it a valuable part of a company's data stack. However, that value comes at a cost in licensing and computing. Athena was always mind-blowingly cheaper but limited in applicable use cases. Extending the use cases with less expensive EMR/Hadoop adds so much complexity that it removes the cost benefits compared to Databricks in many scenarios.

That is the sweet spot that Athena Spark could fill and make a huge impact. If Athena Spark turns out to be nearly as easy to use and as cost-effective as Athena was so far, then it could cover a massive range of use cases requiring something like Hadoop with Spark or Databricks today.

'Legacy' Athena with Trino is excellent at scale-out SQL for BI analytics. The addition of Athena Spark promises to break into advanced analytics, ML, Data Science and extensive data processing and movement. The parts are here; it is up to AWS to prove the ease of use and cost.

Athena Spark Pricing

Anyone familiar with Databricks pricing will find Athena Spark pricing oddly familiar. Instead of DBU (Databricks Unit) AWS uses DPU (Data Processing Units), an hourly rate (e.g. $0.35 in us-east-1) for the machines involved doing the processing normalised at 4 vCPUs and 16GB RAM per DPU and hour. The cost depends on the length of sessions, the driver machine and the number and time of machines processing data.

Sessions are abstracted away from you with transparent starts and timed out shutdowns making it less complicated than DIY Hadoop on EC2s or EMR. However, it is less predictable and straightforward than the Athena Trino version's $5 per TB scanned. It is understandable why AWS chose this model to tie pricing to costs, but from a user perspective, it is as complicated as Databrick's model. With one significant benefit. Databricks does encourage contracts and pre-buying DBUs with its pricing and discounting model, which is a severe headache when you budget and try to predict the future. The AWS pay-as-you-go one/best price option is preferable.

Lastly, I have yet to try the service. It is not available in the regions I need it most. AWS promised to roll it out over the following months in more regions. I also have no illusions that its feature set will be significantly smaller than Databrick's. However, when I attempted to start an Athena Spark session in a supported region, it failed with a permission error. Teething errors, I hope. The promise of an accessible, cost-effective Athena Spark is exciting but has to prove itself in the real-world now.

Christian Prokopp, Bold Data, Founder

    Let's talk

    You have a business problem in need for data and analysis? Send us an email.

    Subscribe to updates

    Join Bold Data's email list to receive free data and updates.

Related Posts A deep dive into the (in)feasibility of RAG with LLMs

Llama looking through wooden fence
Over four months, I created a working retrieval-augmented generation (RAG) product prototype for a sizeable potential customer using a Large-Language Model (LLM). It became a ChatGPT-like expert agent with deep, up-to-date domain knowledge and conversational skills. But I am shutting it down instead of rolling it out. Here is why, how I got there and what it means for the future.

Delta Lake vs Data Lake

Photo of a beautiful lake
Should you switch your Data Lake to a Delta Lake? At first glance, Delta Lakes offer benefits and features like ACID transactions. But at what cost?

One simple thing companies miss about their data

When you can't see the wood for the trees
There is one simple thing most companies miss about their data. It has been instrumental in my work as a data professional ever since.

Your data lies: Be a data-driven Luddite

Manual labour
I have worked with data for decades. There are the two key lessons I share with every customer, stakeholder and beginner in the field. Firstly, follow the data, not (only) your gut or experience. Secondly, do not trust the data; never. Why and what has this to do with Luddites?

Will Tesla's AI break the insurance market?

Car accident
Insurance works because it shares costs in the face of uncertainty. What happens when Tesla removes uncertainty and distributes cost seemingly more fairly? First partially and eventually wholly? Will insurance fail, doing more harm than good?

All Blog Posts

See the full list of blog posts to read more.
Subscribe for updates, free datasets and analysis.