Docker best-practices in Apache Spark application deployments

Spark-provided container image build
Kubernetes operator container image build

Spark on Kubernetes

Starting from a client machine, with access configured to the cluster, spark-submit submits an application to the cluster.

Image credits: https://spark.apache.org/docs/latest/img/k8s-cluster-mode.png
Image credits: Spark Docs https://spark.apache.org/docs/latest/running-on-kubernetes.html

Multi-stage builds

Multi-stage builds are used to reduce the resulting Docker image size, while also reducing the build time. Therefore in the template files, the following split has been followed:
1. Starting from openjdk base image, a few required packages are installed

For Scala/Java Applications
For PySpark Applications

Testing

To test the application deployment, the example spark-pi application will be used. Make sure you have access to a Kubernetes cluster. If you don’t have one, you can use minikube to create a local one.

Scala Spark Test Application
PySpark Test Application

Next Steps

There are a few things that can be improved in this deployment.

Summary

While Spark and Kubernetes have gone a long way together, container image building best practices are not always being followed. Here, I focus on bringing in multi-stage build principle for Spark job container images.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store