Docker best-practices in Apache Spark application deployments

Apache Spark has become the de facto standard tool for distributed data processing. Similarly, Kubernetes has become the de facto standard container orchestration tool. Deploying Spark jobs on Kubernetes has been natively supported since version 2.3. An alternative to such deployment, is available using operator pattern, see spark-on-k8s-operator. Both these solutions start by preparing a container image followed by deploying it in the cluster. However, these solutions follow different principles on building their images.

Spark-provided container image build

On the one hand, the Spark-provided build scripts start by building Spark locally and then adding the application specific artifacts. On the other, in the operator solution the application specific artifacts are added by building an image on top of a provided base image.

Kubernetes operator container image build

In my experience, building Spark locally is rarely needed, and starting with a predefined base image can be limiting. Therefore, here I will present a couple of optimized and extensible Dockerfile templates. The aim is to use Docker multi-stage builds to reduce build time for the container image, especially time needed to download Spark itself and other dependencies that do not change often. The source code including the Dockerfile(s) can be found here.

Spark on Kubernetes

Starting from a client machine, with access configured to the cluster, spark-submit submits an application to the cluster.

Image credits: https://spark.apache.org/docs/latest/img/k8s-cluster-mode.png
Image credits: https://spark.apache.org/docs/latest/img/k8s-cluster-mode.png
Image credits: Spark Docs https://spark.apache.org/docs/latest/running-on-kubernetes.html

1. Initially, the Spark driver is created in a Pod.
2. The driver then created executors, running also in Pods and connects to them.
3. Once the processing is done, executors are terminated and cleaned up. The driver pod remains in completed state.

Multi-stage builds

Multi-stage builds are used to reduce the resulting Docker image size, while also reducing the build time. Therefore in the template files, the following split has been followed:
1. Starting from openjdk base image, a few required packages are installed

2. Spark is installed

3. A few system configurations are made, like setting SPARK_HOME environment variable

4. Add support for public cloud object storage systems like S3 and GCS

5. In this final step, Scala/Java Spark applications differ from PySpark application. The former only needs to copy the .jar artifact to the designated location, while the latter needs to first have python installed (together with the application specific dependencies) and copy the artifacts in the designated directory.

For Scala/Java Applications
For PySpark Applications

Testing

To test the application deployment, the example spark-pi application will be used. Make sure you have access to a Kubernetes cluster. If you don’t have one, you can use minikube to create a local one.

Run kubectl cluster-info to get the master IP address and the port where the jobs can be submitted. Next one can build the Docker images.

For Scala run: docker build -t localhost:5000/spark-local -f Dockerfile .

For Python run: docker build -t localhost:5000/spark-local-py -f Dockerfile-python .

And finally deploy the applications on the cluster:

Scala Spark Test Application
PySpark Test Application

Next Steps

There are a few things that can be improved in this deployment.

Firstly, the applications are being executed by the root user. This is of course not recommended in production deployments because of the extensive permissions this user has.

Secondly, based on the organizational set up, credentials need to be added/ingested to the containers. There are a lot of possibilities to achieve this. Therefore, it is not feasible to be included in a template solution. Nonetheless, it cannot be avoided 😄.

Summary

While Spark and Kubernetes have gone a long way together, container image building best practices are not always being followed. Here, I focus on bringing in multi-stage build principle for Spark job container images.

Senior ML/Data Engineer