My First Experience with Databricks
--
The synergy between Terraform, AWS, Databricks and GitHub Actions surpassed my expectations!
As a consultant, part of my job is constantly facing new clients that come with new projects and different technology requirements. My last client was big Global Trade Management enterprise looking to revolutionize the supply chain industry. Needless to say, they had major expectations about their soon to be brand-new data platform.
With an array of microservices generating large amounts of data and Postgres databases starting to pile up. They were looking to create a robust data sink where one could understand what was happening system wide. For example, without having to joining data from tables that could be not just in separate databases but even separate geographic locations.
Now, the cloud is well-suited for workloads that have variable demand or require a lot of computing power or storage capacity, any of the major hyperscalers can handle large systems loads, no matter how big our systems grow. In this case, AWS was the chosen one.
With most if not all of AWS’s services supported by Terraform and, the help of GitHub Actions to maintain healthy consistent deployments, building-up this solution with its multiple environments (dev/test/pre-prod/prod) was fairly straight forward.
Nevertheless, building a system from scratch can be a daunting task, particularly when it comes to data platforms. It is important to approach the process with a carefully considered plan. An improperly designed system can result in significant challenges for both upstream and downstream systems, creating inefficiencies and potential roadblocks in the data pipeline. Therefore, it is crucial to prioritize careful planning and designing to ensure a successful implementation.
For this reason, one has to research and consider multiple options. My opinion was that, I had to find something which could scale quickly yet be replicable and secure. This had to be backed by a strongly documented Terraform provider, preferably with a large amount of examples.
Security was a concern since it handles user’s data, for this same reason, data governance was another concern. I wanted to allow my client to retain total ownership of their data.
Avoid dependencies between storage and processing to allow them to scale independently was a must, I wanted even data pipelines to scale independently.
Here is when a colleague mentioned Databricks, which offers all of the features I was looking for. After learning about their many offerings, spending some time clicking trough their documentation, reading many of their public examples on CI-CD/Governance/Data Analysis, Databricks was chosen to build the data platform on.
So, what was built?
Let’s take a piece of the data platform as an example to understand why I was looking for so much flexibility.
In this Sample Pipeline, every component of the architecture can be provisioned and managed using Terraform, any of them can be replaced, at the same time every component has different security scopes. Databricks will be able to interact with any of these or any new component as long as it has authorization (which We can grant or restrict at any given time).
Now, this pipeline is utilizing a Compute Optimized Cluster, in other pipelines a Memory Optimized one was the better option. It all depends on the use case. Is thanks to Databricks flexibility that we can choose the proper compute option leveraging Terraform configurations.
I found the flexibility I was looking for.
After a few iterations on how to properly cover the initial requirements, the resulting architecture exceeded on Timeliness and meet Data Accuracy requirements.
Writing the Terraform code to provision the required infrastructure was in general as documented. Provisioning a Databricks Workspace for the first time was tricky but thanks to the detailed documentation by Databricks and the documentation at Terraform’s registry, the learning curve was not too steep.
After the initial POC and a lot of PySpark code written, automating the pipeline deployments was the only thing left. To accomplished this, I leveraged GitHub Actions and Databricks’ tool called DBX which is built specifically for CD actions on Databricks.
With this in place, it all seemed to be going smoothly. That is until I started working on Governance which -from my point of view- was the most challenging aspect of the entire platform.
Certainly, one has to sit and read the documentation about Unity Catalog. Understanding how it works, how it is deployed and configured is essentially simple. But for me the confusing part was to understand how ONE catalog interacts with ALL your environments.
My initial deployments resulted in me having to destroy a few resources (that was ok, it was dev), then create a new repository with its own actions just for Unity Catalog (No, I don’ t believe in mono-repo, even more if you have to mix different code lifecycles).
At the end I found this experience quite enriching. Databricks is a tool I want to use again in the future. Sadly, as a consultant I have to wait until I come across a project/client willing to use this tool.
If you ask me, Databricks is definitely a tool worth investing in. The fact that development with Databricks is so easy, a plug-and-play piece of architecture, where you setup storage, then assign permissions to create compute resources, which you can remove any time while keeping your data intact — that to me is priceless.