DIY Analytics Platform
There are a lot of platforms and providers out there offering great tools and solutions falling in the scope of an Analytics Platform. Many of them requiring zero manual effort to set them up. But sometimes what is missing is just that little bit of customization to fit our needs perfectly. Often I struggled to find a place where I could do an explorative analysis of multiple data sources and have the tables, table structures, and intermediate results available when developing for the production data pipeline. In this post, I will share how to build a tiny analytics platform on a local machine, having the power to customize every aspect of it. The documentation and code used in this tutorial are in Github.
Requirements for an Analytics Platform
There are four main aspects such a platform needs to satisfy:
- Data Source Discovery — enables access to many data sources. Preferably to all of them, because the more data the better, right? Well, not always, but let’s not get into that sort of argument, shall we?
- Experimentation — whoever has worked in an analytics-related project knows the actual data is not what it looks like. Many ugly, and sometimes nice, surprises hide in there. Hence, it is important to have the freedom to experiment with different tools, libraries, programming languages, etc.
- Handle Workload — experimentation alone will not bring in any revenue. The actual $$$ come from making the full results available to target consumers, be it recommendations to end-users or actions triggered based on data insight. That’s why currently in the job market there is huge demand for Data Engineering and ML Engineering.
- Orchestration —most of the time, data flows between multiple stages getting cleansed, enriched, transformed, and so on. Manually triggering one step after the other is at least time-consuming. Orchestration capabilities would take care of triggering the sequence of steps in the right order and automatically.
Tooling
As the basis for the platform, my recommendation is a framework that is widely adopted by the community, easy to work with and you can have access to independent of your personal infrastructure changes. I have chosen to use minikube (local Kubernetes), however, there are other great options outthere that can be used.
The first step for the platform is to create a metastore. Metastore is a database storing data about the data. It stores information about databases in the platform, tables and also provides a certain degree of authorization controls. And most importantly, this service should be shared between other players on the platform. Apache Hive is one such service. Sadly, there is no easy way to deploy it in a Kubernetes cluster, therefore it has to be done manually, step by step. See ReadMe for detail.
From now on, it becomes less complicated and includes fewer moving parts. After having the metastore problem solved, next step is to satisfy the requirements for the platform. To ensure the discoverability of a wide range of data sources, Trino (formerly known as Presto) is a great option. It is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Deployment is very straightforward with a helm chart.
Next in line is experimentation. Jupyter Notebooks and Jupyterhub have become the go-to tools for every task related to data exploration. They are easy to use, provide a wide range of capabilities, and are ridiculously easy to extend with community-provided plugins or external packages.
To handle distributed large-scale workload execution, Spark is the de facto standard across all industries. Spark-on-k8s-operator makes it especially easy to add Spark capabilities to the platform.
Orchestration is something not included in the repository, however, it would be a choice between Airflow and Argo. Both tools offer Pipeline as code capabilities and are actively supported by great communities.
Next Steps
There are a few things, I did not get around to adding in the deployment:
- Dremio — would be an alternative to Trino. It uses Apache Arrow, and it is easy to deploy in a Kubernetes cluster. Checkout https://github.com/dremio/dremio-cloud-tools .
- Argo or Airflow — used for orchestration. Although Argo has a tighter Kubernetes integration, Airflow is a bit more mature for data-related tasks. It is a difficult choice between the two.
- Minio — is widely used to provide a performant unified access layer for object storage, be it AWS S3 or GCS from GCP.
Summary
While there are a lot of offerings in the market for analytics platforms, we prefer the ones that can be tailored to our needs. Tools like Hive, Trino, Jupyter, and Spark cover a large portion of the responsibilities of an analytics platform. Nonetheless, it is important to have a base layer allowing easy extension. Kubernetes, Helm, and so on, certainly fit the bill and make building a platform tailored to our infrastructure and personal preferences a piece of cake.