Orchestrating Spark Jobs with Kubeflow for ML Workflows

A maestro conductor in front of an orchestra.
Image by mohamed Hassan from Pixabay.

This post is about Kubeflow, Spark and their interaction. One year ago, I was trying to trigger a Spark job from a Kubeflow pipeline. It was proving to be a stubborn but exciting problem. The other weekend, I started tinkering with this problem again. In the end, persistence and patience yielded satisfactory results. To skip the intro, click here.

Motivation

A typical Machine Learning (ML) workflow consists of three stages: preprocessing, training, and serving. During preprocessing, data cleaning and data transformations are performed. During the training stage, one or more models are trained. These models are in turn made available to the target audience in the serving stage. The data generated during serving, can be fed back to the pipeline to help improve model accuracy, keep the model relevant and up-to-date as old training data might become irrelevant.

Typical Machine Learning Workflow. Icon credits: Eucalyp, Freepik

This post is about the preprocessing stage.

The increasing amount of data available as input for ML workflows, increases the need for efficient processing of data. Using bigger machines is expensive and sometimes not possible anymore. Hence, the solution is Distributed Computing. Distributed computing frameworks with time have become more mature and available for a broader audience, besides research or expert data engineers. Naturally, the complexity of the workflows increases as well. On one hand, some workflow steps are dependent on the output of others and have to be executed sequentially. On the other hand, other steps have no such dependencies between them and can be executed in parallel. In this post, I will consider Apache Spark for distributed computing. There are however other alternatives, e.g. Dask, Ray, etc.

Preprocessing, multiple dependent steps. Icon credits: phatplus, Eucalyp, Freepik

In addition to orchestration complexity, data transformations happening outside of the common infrastructure create data silos. Hence, it is important for these workflows to be integrated into the existing infrastructure. This comes with the additional benefit of saving time and money by not providing separate infrastructure. Nowadays, companies use Kubernetes as a platform to run an ample mix of workloads types, and ML workloads should also be one of them.

Using common infrastructure for many types of workloads.

Is it possible to orchestrate and integrate these workloads in a Kubernetes cluster? Let’s start by taking a look into the tools and concepts involved here.

Spark

Spark Components. Source

Kubernetes

Similar to Spark, in Kubernetes, the control plane is responsible for distributing the workload into a set of nodes and making sure it continues to run. In the context of this post, the workload is a single step, e.g. processing step, model training, or model serving. The workload is defined in a manifest file (i.e. YAML file).

Kubernetes components. Source

Spark on Kubernetes

Submitting a Spark Job to k8s cluster using spark operator. Source

Kubeflow

  • KFServing is a component used for model deployment and serving. It handles multiple frameworks out-of-the-box. e.g. Tensoflow, PyTorch, ONNX, scikit-learn.
  • Training Operators are a collection of operators that help in training ML models. Supported frameworks are: Tensorflow, PyTorch, MXNet, etc.
  • Katlib is a component dedicated to hyper-parameter tuning, for neural architecture search.
  • KFPipelines helps orchestrate and execute each step of the pipeline in the platform.

Considering again the stages of an ML workflow, Kubeflow is a platform that helps in all those stages. In the processing step, one can perform all sorts of transformations, as it will be shown below. For model training, training operators can be used out-of-the-box or via custom code wrapped in a Docker container. The same logic applies to the serving stage. ML models can be made available via KFServing, or via containerized custom code.

Kubeflow Pipelines

A sample KF pipeline.

Components and Component Specification

Spark on Kubeflow on Kubernetes

KF Pipeline Spark job orchestration. Icon credits: Freepik, document-clipart

Here is an example manifest file for a Spark job. It calculates an approximate value of Pi(π):

Spark operator Job Manifest

The next step is to use the k8s apply component. The input to this component is the manifest file, and the output will be, among others, the sparkapplication name. The following code snippet will make sure the job is submitted.

K8s apply component

After the apply operation, the execution engine has to wait for the job to complete before moving on. Using the k8s get component, it gets the application’s state and iterates until the job achieves COMPLETED state. The iteration is done using recursion and the dsl.Condition instruction. The @graph_component decorator indicates recursive execution for the function.

KF Pipelines Graph Component

Once the application is completed, the execution will carry on with other pipeline steps.

The full source code used here can be found on Github.

Summary

Great and exciting times are ahead!

Let me know what you think about this post feel free to connect with me on LinkedIn.