DIY Analytics Platform

Photo by Robin Glauser on Unsplash

There are a lot of platforms and providers out there offering great tools and solutions falling in the scope of an Analytics Platform. Many of them requiring zero manual effort to set them up. But sometimes what is missing is just that little bit of customization to fit our needs perfectly. Often I struggled to find a place where I could do an explorative analysis of multiple data sources and have the tables, table structures, and intermediate results available when developing for the production data pipeline. In this post, I will share how to build a tiny analytics platform on a local machine, having the power to customize every aspect of it. The documentation and code used in this tutorial are in Github.

Requirements for an Analytics Platform

  • Data Source Discovery — enables access to many data sources. Preferably to all of them, because the more data the better, right? Well, not always, but let’s not get into that sort of argument, shall we?
  • Experimentation — whoever has worked in an analytics-related project knows the actual data is not what it looks like. Many ugly, and sometimes nice, surprises hide in there. Hence, it is important to have the freedom to experiment with different tools, libraries, programming languages, etc.
  • Handle Workload — experimentation alone will not bring in any revenue. The actual $$$ come from making the full results available to target consumers, be it recommendations to end-users or actions triggered based on data insight. That’s why currently in the job market there is huge demand for Data Engineering and ML Engineering.
  • Orchestration —most of the time, data flows between multiple stages getting cleansed, enriched, transformed, and so on. Manually triggering one step after the other is at least time-consuming. Orchestration capabilities would take care of triggering the sequence of steps in the right order and automatically.

Tooling

The first step for the platform is to create a metastore. Metastore is a database storing data about the data. It stores information about databases in the platform, tables and also provides a certain degree of authorization controls. And most importantly, this service should be shared between other players on the platform. Apache Hive is one such service. Sadly, there is no easy way to deploy it in a Kubernetes cluster, therefore it has to be done manually, step by step. See ReadMe for detail.

From now on, it becomes less complicated and includes fewer moving parts. After having the metastore problem solved, next step is to satisfy the requirements for the platform. To ensure the discoverability of a wide range of data sources, Trino (formerly known as Presto) is a great option. It is a distributed SQL query engine designed to query large data sets distributed over one or more heterogeneous data sources. Deployment is very straightforward with a helm chart.

Next in line is experimentation. Jupyter Notebooks and Jupyterhub have become the go-to tools for every task related to data exploration. They are easy to use, provide a wide range of capabilities, and are ridiculously easy to extend with community-provided plugins or external packages.

To handle distributed large-scale workload execution, Spark is the de facto standard across all industries. Spark-on-k8s-operator makes it especially easy to add Spark capabilities to the platform.

Orchestration is something not included in the repository, however, it would be a choice between Airflow and Argo. Both tools offer Pipeline as code capabilities and are actively supported by great communities.

Next Steps

  • Dremio — would be an alternative to Trino. It uses Apache Arrow, and it is easy to deploy in a Kubernetes cluster. Checkout https://github.com/dremio/dremio-cloud-tools .
  • Argo or Airflow — used for orchestration. Although Argo has a tighter Kubernetes integration, Airflow is a bit more mature for data-related tasks. It is a difficult choice between the two.
  • Minio — is widely used to provide a performant unified access layer for object storage, be it AWS S3 or GCS from GCP.

Summary

Senior ML/Data Engineer