Securely Deploying Delta Lake on AWS: Best Practices for Cloud Engineers

February 15th, 2023
Five tips for Cloud Engineers to deploy Databricks' Delta Lake on AWS safely.
Stay safe

Cloud engineering is a critical component of any modern business infrastructure. With more and more companies relying on cloud-based systems to store and analyse data, it's essential to have knowledgeable engineers who can manage and deploy these systems effectively. One tool that has become increasingly popular in recent years is Delta Lake on AWS, a powerful platform that provides advanced analytics and data processing capabilities. However, deploying Delta Lake on AWS can be a complex process, and it's essential to have a thorough understanding of the best practices and guidance to ensure the deployment is successful. This article will provide a comprehensive list of tips and recommendations to help cloud engineers deploy Delta Lake on AWS effectively.

Customer-Managed VPC

The customer-managed VPC is the foundation of any standard Delta Lake deployment. It's crucial to ensure that the VPC is appropriately sized for your specific needs and has DNS hostnames and DNS resolution enabled. Additionally, each workspace must have a minimum of two dedicated private subnets in separate availability zones, and aim that no other AWS resources are placed in that subnet space. It's also essential to use the automatic availability zone (auto-AZ) option for all clusters to optimise the availability of IPs within a single workspace.

Security Groups

Security group rules for the Delta Lake EC2 instances must allow all inbound (ingress) traffic from other resources in the same security group. It's also crucial to ensure that the outbound (egress) traffic rules allow all TCP and UDP access to the workspace security group and that ports such as 443, 3306, and 6666 are open. Ensure that your workspace security group allows outbound ports for services like Apache Kafka and Redshift to avoid connection issues.

Data Plane to Control Plane Connectivity

When the cross-account IAM role spins up an EC2 instance in your AWS account, the cluster will try to call home to the Delta Lake AWS account. This process of calling home is known as the secure cluster connectivity relay or SCC. Ideally, use AWS PrivateLink endpoints to route the SCC relay and REST API traffic over the AWS backbone infrastructure, adding another layer of security. Additionally, you can entirely lockdown the cluster if you route to the control plane through PrivateLink and have no internet gateway available for the EC2 instances to make their way to.

Cross-Account IAM Role: A required portion of the deployment, the cross-account IAM role is used when initially spinning up EC2 instances within your environment. It's recommended to restrict your cross-account role to only the necessary resources and to restrict where the EC2 AMIs are sourced from. Be sure to use the proper IAM roles for both Unity Catalog and Serverless.

Workspace Root Bucket

The workspace root bucket is used to store workspace objects like cluster logs, notebook revisions, job results, and libraries. It's essential to create a specific bucket policy allowing the Delta Lake control plane to write to it and not to use it for customer data or multiple workspaces. Do plan to implement encryption on your workspace storage, EBS volumes, and managed objects.

In the end, deploying Delta Lake on AWS requires a thorough understanding of the best practices and guidance. The customer-managed VPC, security groups, data plane to control plane connectivity, cross-account IAM role, and workspace root bucket are all critical components of a standard Delta Lake deployment. Cloud engineers must ensure that each component is appropriately configured and optimised to avoid connection or security issues. By following these tips and recommendations, cloud engineers can deploy Delta Lake on AWS effectively and ensure that their organisation has the analytics and data processing capabilities needed to succeed in the modern business world.

Read more about it in a related Databricks blog post.

    Let's talk

    You have a business problem in need for data and analysis? Send us an email.

    Subscribe to updates

    Join Bold Data's email list to receive free data and updates.

Related Posts

Llamar.ai: A deep dive into the (in)feasibility of RAG with LLMs

Llama looking through wooden fence
Over four months, I created a working retrieval-augmented generation (RAG) product prototype for a sizeable potential customer using a Large-Language Model (LLM). It became a ChatGPT-like expert agent with deep, up-to-date domain knowledge and conversational skills. But I am shutting it down instead of rolling it out. Here is why, how I got there and what it means for the future.

The Power of Schema Enforcement in Delta Lake

Police car light
Prevent errors and inconsistencies with Delta Lake's robust data management technology.

A Guide to the Delta Lake Transaction Log

Wooden logs
Discover the power of the Delta Lake transaction log - ensuring data reliability and consistency.

Bing Chat argues and lies when it gets code wrong

Pinocchio
Microsoft could follow Google's $100bn loss. I tried the new Bing Chat (ChatGPT) feature, which was great until it went disastrously wrong. It even started arguing with me while being wrong and making source code up.

Python TDD with ChatGPT

Being tested
Programming with ChatGPT using an iterative approach is difficult, as I have demonstrated previously. Maybe ChatGPT can benefit from Test-driven development (TDD). Could it aid LLMs as it does humans?

Is Athena Spark a Delta Lake alternative to Databricks?

Morning on a lake
Finally. AWS re:Invent 2022 brought the answer to both Databricks and Athena's worst limitations. Athena Spark promises to bring Delta Lake scale-out processing effortlessly and inexpensively.

Delta Lake vs Data Lake

Photo of a beautiful lake
Should you switch your Data Lake to a Delta Lake? At first glance, Delta Lakes offer benefits and features like ACID transactions. But at what cost?

All Blog Posts

See the full list of blog posts to read more.
Subscribe for updates, free datasets and analysis.