A maestro conductor in front of an orchestra.
Image by mohamed Hassan from Pixabay.

This post is about Kubeflow, Spark and their interaction. One year ago, I was trying to trigger a Spark job from a Kubeflow pipeline. It was proving to be a stubborn but exciting problem. The other weekend, I started tinkering with this problem again. In the end, persistence and patience yielded satisfactory results. To skip the intro, click here.


Why is it necessary to orchestrate Spark jobs with Kubeflow? Isn’t it sufficient to simply submit these jobs to a Kubernetes cluster and wait till they are finished? The answer of course is, it depends. …

Apache Spark has become the de facto standard tool for distributed data processing. Similarly, Kubernetes has become the de facto standard container orchestration tool. Deploying Spark jobs on Kubernetes has been natively supported since version 2.3. An alternative to such deployment, is available using operator pattern, see spark-on-k8s-operator. Both these solutions start by preparing a container image followed by deploying it in the cluster. However, these solutions follow different principles on building their images.

Spark-provided container image build

On the one hand, the Spark-provided build scripts start by building Spark locally and then adding the application specific artifacts. …

Photo by Robin Glauser on Unsplash

There are a lot of platforms and providers out there offering great tools and solutions falling in the scope of an Analytics Platform. Many of them requiring zero manual effort to set them up. But sometimes what is missing is just that little bit of customization to fit our needs perfectly. Often I struggled to find a place where I could do an explorative analysis of multiple data sources and have the tables, table structures, and intermediate results available when developing for the production data pipeline. In this post, I will share how to build a tiny analytics platform on…

Photo by Connor McSheffrey on Unsplash

A big challenge businesses faces is the deployment of machine learning models in production environments. It requires dealing with a complex set of moving parts through different pipelines. Once the models are developed, they need to be trained, deployed, monitored and kept track of.

This post tries to describe, how using AWS EKS, Kubeflow-Pipelines and seldon-core enables productionalizing the deployment of ML models including important components like CI/CD pipeline, Model Registry and a scalable inference layer using modern microservice architecture. All the necessary steps for creating this architecture are available on Github.

There are essentially 3 stages in the ML…

Sadik Bakiu

Senior ML/Data Engineer @ data-max.io

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store