Docker best-practices in Apache Spark application deployments
Apache Spark has become the de facto standard tool for distributed data processing. Similarly, Kubernetes has become the de facto standard container orchestration tool. Deploying Spark jobs on Kubernetes has been natively supported since version 2.3. An alternative to such deployment, is available using operator pattern, see spark-on-k8s-operator. Both these solutions start by preparing a container image followed by deploying it in the cluster. However, these solutions follow different principles on building their images.
On the one hand, the Spark-provided build scripts start by building Spark locally and then adding the application specific artifacts. On the other, in the operator solution the application specific artifacts are added by building an image on top of a provided base image.
In my experience, building Spark locally is rarely needed, and starting with a predefined base image can be limiting. Therefore, here I will present a couple of optimized and extensible
Dockerfile templates. The aim is to use Docker multi-stage builds to reduce build time for the container image, especially time needed to download Spark itself and other dependencies that do not change often. The source code including the
Dockerfile(s) can be found here.
Spark on Kubernetes
Starting from a client machine, with access configured to the cluster,
spark-submit submits an application to the cluster.
1. Initially, the Spark driver is created in a Pod.
2. The driver then created executors, running also in Pods and connects to them.
3. Once the processing is done, executors are terminated and cleaned up. The driver pod remains in
Multi-stage builds are used to reduce the resulting Docker image size, while also reducing the build time. Therefore in the template files, the following split has been followed:
1. Starting from
openjdk base image, a few required packages are installed
2. Spark is installed
3. A few system configurations are made, like setting
SPARK_HOME environment variable
4. Add support for public cloud object storage systems like S3 and GCS
5. In this final step, Scala/Java Spark applications differ from PySpark application. The former only needs to copy the
.jar artifact to the designated location, while the latter needs to first have python installed (together with the application specific dependencies) and copy the artifacts in the designated directory.
To test the application deployment, the example
spark-pi application will be used. Make sure you have access to a Kubernetes cluster. If you don’t have one, you can use
minikube to create a local one.
kubectl cluster-info to get the master IP address and the port where the jobs can be submitted. Next one can build the Docker images.
For Scala run:
docker build -t localhost:5000/spark-local -f Dockerfile .
For Python run:
docker build -t localhost:5000/spark-local-py -f Dockerfile-python .
And finally deploy the applications on the cluster:
There are a few things that can be improved in this deployment.
Firstly, the applications are being executed by the root user. This is of course not recommended in production deployments because of the extensive permissions this user has.
Secondly, based on the organizational set up, credentials need to be added/ingested to the containers. There are a lot of possibilities to achieve this. Therefore, it is not feasible to be included in a template solution. Nonetheless, it cannot be avoided 😄.
While Spark and Kubernetes have gone a long way together, container image building best practices are not always being followed. Here, I focus on bringing in multi-stage build principle for Spark job container images.