DIY Analytics Platform

Photo by Robin Glauser on Unsplash

There are a lot of platforms and providers out there offering great tools and solutions falling in the scope of an Analytics Platform. Many of them requiring zero manual effort to set them up. But sometimes what is missing is just that little bit of customization to fit our needs perfectly. Often I struggled to find a place where I could do an explorative analysis of multiple data sources and have the tables, table structures, and intermediate results available when developing for the production data pipeline. In this post, I will share how to build a tiny analytics platform on a local machine, having the power to customize every aspect of it. The documentation and code used in this tutorial are in Github.

Requirements for an Analytics Platform

There are four main aspects such a platform needs to satisfy:

Tooling

As the basis for the platform, my recommendation is a framework that is widely adopted by the community, easy to work with and you can have access to independent of your personal infrastructure changes. I have chosen to use minikube (local Kubernetes), however, there are other great options outthere that can be used.

The first step for the platform is to create a metastore. Metastore is a database storing data about the data. It stores information about databases in the platform, tables and also provides a certain degree of authorization controls. And most importantly, this service should be shared between other players on the platform. Apache Hive is one such service. Sadly, there is no easy way to deploy it in a Kubernetes cluster, therefore it has to be done manually, step by step. See ReadMe for detail.

From now on, it becomes less complicated and includes fewer moving parts. After having the metastore problem solved, next step is to satisfy the requirements for the platform. To ensure the discoverability of a wide range of data sources, Trino (formerly known as Presto) is a great option. It is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Deployment is very straightforward with a helm chart.

Next in line is experimentation. Jupyter Notebooks and Jupyterhub have become the go-to tools for every task related to data exploration. They are easy to use, provide a wide range of capabilities, and are ridiculously easy to extend with community-provided plugins or external packages.

To handle distributed large-scale workload execution, Spark is the de facto standard across all industries. Spark-on-k8s-operator makes it especially easy to add Spark capabilities to the platform.

Orchestration is something not included in the repository, however, it would be a choice between Airflow and Argo. Both tools offer Pipeline as code capabilities and are actively supported by great communities.

Next Steps

There are a few things, I did not get around to adding in the deployment:

Summary

While there are a lot of offerings in the market for analytics platforms, we prefer the ones that can be tailored to our needs. Tools like Hive, Trino, Jupyter, and Spark cover a large portion of the responsibilities of an analytics platform. Nonetheless, it is important to have a base layer allowing easy extension. Kubernetes, Helm, and so on, certainly fit the bill and make building a platform tailored to our infrastructure and personal preferences a piece of cake.

Senion ML/Data Engineer

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store