By Christian Prokopp on 2023-02-15
Five tips for Cloud Engineers to deploy Databricks' Delta Lake on AWS safely.
Cloud engineering is a critical component of any modern business infrastructure. With more and more companies relying on cloud-based systems to store and analyse Data, it's essential to have knowledgeable engineers who can manage and deploy these systems effectively. One tool that has become increasingly popular in recent years is Delta Lake on AWS, a powerful platform that provides advanced Analytics and data processing capabilities. However, deploying Delta Lake on AWS can be a complex process, and it's essential to have a thorough understanding of the best practices and guidance to ensure the deployment is successful. This article will provide a comprehensive list of tips and recommendations to help cloud engineers deploy Delta Lake on AWS effectively.
The customer-managed VPC is the foundation of any standard Delta Lake deployment. It's crucial to ensure that the VPC is appropriately sized for your specific needs and has DNS hostnames and DNS resolution enabled. Additionally, each workspace must have a minimum of two dedicated private subnets in separate availability zones, and aim that no other AWS resources are placed in that subnet space. It's also essential to use the automatic availability zone (auto-AZ) option for all clusters to optimise the availability of IPs within a single workspace.
Security group rules for the Delta Lake EC2 instances must allow all inbound (ingress) traffic from other resources in the same security group. It's also crucial to ensure that the outbound (egress) traffic rules allow all TCP and UDP access to the workspace security group and that ports such as 443, 3306, and 6666 are open. Ensure that your workspace security group allows outbound ports for services like Apache Kafka and Redshift to avoid connection issues.
When the cross-account IAM role spins up an EC2 instance in your AWS account, the cluster will try to call home to the Delta Lake AWS account. This process of calling home is known as the secure cluster connectivity relay or SCC. Ideally, use AWS PrivateLink endpoints to route the SCC relay and REST API traffic over the AWS backbone infrastructure, adding another layer of security. Additionally, you can entirely lockdown the cluster if you route to the control plane through PrivateLink and have no internet gateway available for the EC2 instances to make their way to.
Cross-Account IAM Role: A required portion of the deployment, the cross-account IAM role is used when initially spinning up EC2 instances within your environment. It's recommended to restrict your cross-account role to only the necessary resources and to restrict where the EC2 AMIs are sourced from. Be sure to use the proper IAM roles for both Unity Catalog and Serverless.
The workspace root bucket is used to store workspace objects like cluster logs, notebook revisions, job results, and libraries. It's essential to create a specific bucket policy allowing the Delta Lake control plane to write to it and not to use it for customer data or multiple workspaces. Do plan to implement encryption on your workspace storage, EBS volumes, and managed objects.
In the end, deploying Delta Lake on AWS requires a thorough understanding of the best practices and guidance. The customer-managed VPC, security groups, data plane to control plane connectivity, cross-account IAM role, and workspace root bucket are all critical components of a standard Delta Lake deployment. Cloud engineers must ensure that each component is appropriately configured and optimised to avoid connection or security issues. By following these tips and recommendations, cloud engineers can deploy Delta Lake on AWS effectively and ensure that their organisation has the analytics and data processing capabilities needed to succeed in the modern business world.
Read more about it in a related Databricks blog post.
Christian Prokopp, PhD, is an experienced data and AI advisor and founder who has worked with Cloud Computing, Data and AI for decades, from hands-on engineering in startups to senior executive positions in global corporations. You can contact him at firstname.lastname@example.org for inquiries.
Large-language models (LLMs) are great generalists, but modifications are required for optimisation or specialist tasks. The easiest choice is Retr...
Recently, OpenAI released GPT-4 turbo preview with 128k at its DevDay. That addresses a serious limitation for Retrieval Augmented Generation (RAG...
Prevent errors and inconsistencies with Delta Lake's robust data management technology.
ChatGPT is a state-of-the-art language model developed by OpenAI, utilising the Transformer model and fine-tuned through reinforcement learning to...
OpenAI's ChatGPT has made the news recently as a next-generation conversational agent. It has a surprising breadth which made me wonder, could Open...