By Christian Prokopp on 2022-12-02
Finally. AWS re:Invent 2022 brought the answer to both Databricks and Athena's worst limitations. Athena Spark promises to bring Delta Lake scale-out processing effortlessly and inexpensively.
Athena started as a serverless Presto engine six years ago, in 2016. Its key selling point was to do away with the cost and complexity of Hadoop at a fixed price per TB of Data scanned. As an experienced and avid user of Hadoop and Hive at the time, I fell in love with Athena. It took away the pain and complexity of Hadoop, and with S3, it had the perfect durable, scalable, performant storage layer.
AWS Glue joined as its metastore in the 2017 product lineup. It allowed later on the option to access and process the same data from Athena or Spark via EMR and Hadoop. An important addition since Athena's Presto core has limits, and Athena queries timeout at a hard limit of 15 minutes. So Athena could not execute complex, extremely memory-intensive or long-running processing. Of course, with enough creativity, breaking down processes and data into more tables or steps, you can workaround the limit to a degree but at the cost of complexity, stability and your sanity.
Athena grew further with the addition of Iceberg and the Trino engine update. These promised ACID compliance and performance improvements. It is noteworthy that the underlying concept of storing the data in S3 with serverless processing means Athena never intended to compete with the access patterns of RDBMS or low latency NoSQL stores. ACID compliance, in this case, is more aimed at helping with the overlapping or continuous landing of data that previously needed to be reconciled before analysis or loading into low-latency stores.
I have yet to explore these features in depth. Do check out how Athena engine v2 vs v3 performs for you, though. I was surprised by how much slower instead of faster it was in a specific circumstance of mine. That may be resolved with future updates, or I may have to change my data processing strategy a bit, e.g. embracing Iceberg over my (now) slow legacy workarounds.
While I am a fan of Athena, I had the opportunity to work with Databricks for a while. Its ease of use, performance and accessibility to Data Engineers, Data Scientists and Data Analysts/BI make it a valuable part of a company's data stack. However, that value comes at a cost in licensing and computing. Athena was always mind-blowingly cheaper but limited in applicable use cases. Extending the use cases with less expensive EMR/Hadoop adds so much complexity that it removes the cost benefits compared to Databricks in many scenarios.
That is the sweet spot that Athena Spark could fill and make a huge impact. If Athena Spark turns out to be nearly as easy to use and as cost-effective as Athena was so far, then it could cover a massive range of use cases requiring something like Hadoop with Spark or Databricks today.
'Legacy' Athena with Trino is excellent at scale-out SQL for BI analytics. The addition of Athena Spark promises to break into advanced analytics, ML, Data Science and extensive data processing and movement. The parts are here; it is up to AWS to prove the ease of use and cost.
Anyone familiar with Databricks pricing will find Athena Spark pricing oddly familiar. Instead of DBU (Databricks Unit) AWS uses DPU (Data Processing Units), an hourly rate (e.g. $0.35 in us-east-1) for the machines involved doing the processing normalised at 4 vCPUs and 16GB RAM per DPU and hour. The cost depends on the length of sessions, the driver machine and the number and time of machines processing data.
Sessions are abstracted away from you with transparent starts and timed out shutdowns making it less complicated than DIY Hadoop on EC2s or EMR. However, it is less predictable and straightforward than the Athena Trino version's $5 per TB scanned. It is understandable why AWS chose this model to tie pricing to costs, but from a user perspective, it is as complicated as Databrick's model. With one significant benefit. Databricks does encourage contracts and pre-buying DBUs with its pricing and discounting model, which is a severe headache when you budget and try to predict the future. The AWS pay-as-you-go one/best price option is preferable.
Lastly, I have yet to try the service. It is not available in the regions I need it most. AWS promised to roll it out over the following months in more regions. I also have no illusions that its feature set will be significantly smaller than Databrick's. However, when I attempted to start an Athena Spark session in a supported region, it failed with a permission error. Teething errors, I hope. The promise of an accessible, cost-effective Athena Spark is exciting but has to prove itself in the real-world now.
Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at firstname.lastname@example.org for inquiries.
Large-language models (LLMs) are great generalists, but modifications are required for optimisation or specialist tasks. The easiest choice is Retr...
Over four months, I created a working retrieval-augmented generation (RAG) product prototype for a sizeable potential customer using a Large-Langua...
The Battle of the AI Chatbots Begins: Google's Bard Takes on ChatGPT.
OpenAI's ChatGPT has made the news recently as a next-generation conversational agent. It has a surprising breadth which made me wonder, could Open...
Should you switch your Data Lake to a Delta Lake? At first glance, Delta Lakes offer benefits and features like ACID transactions. But at what cost?
Many Amazon marketplace customers know that its huge product catalogue has data quality issues. However, they might expect its top sellers, which t...