How to build a spark docker image and deploy to Kubernetes

 In this blog post, I will demonstrate creating an Apache spark latest docker image from the official Dockerfile and deploying it to a Kubernetes cluster. When writing this document the latest version of Apache spark is v3.2.1.

Prerequisites

  • A Kubernetes cluster, please make sure that the worker node has a minimum of 3 CPUs and 4G memory. To install kubernetes click here
  • Docker software to build the image  (Docker version 20.10.12)

 Lab setup

Node nameRoleIP addressOS
kmasterControle plane192.168.56.111Ubuntu 18-04
kworker1worker node192.168.56.121Ubuntu 18-04

Build spark image

To create the Apache spark docker image, download the spark complete package with the help of the below link. The below package contains the Hadoop library as well.

spark-3.2.1-bin-hadoop3.2.tgz

Make a directory and untar the downloaded package

mkdir spark-build
cd spark-build
wget https://downloads.apache.org/spark/spark-3.2.1/spark-3.2.1-bin-hadoop3.2.tgz
tar -xf spark-3.2.1-bin-hadoop3.2.tgz 
cd spark-3.2.1-bin-hadoop3.2

Apache official documentation provides a script (docker-image-tool.sh) to create the image using Dockerfile.

Execute the below command from the spark-3.2.1-bin-hadoop3.2 folder to generate spark images.

The command will take a few minutes to complete.

./bin/docker-image-tool.sh -r techiescorner -t spark -p kubernetes/dockerfiles/spark/Dockerfile build

Once the above script is completed, execute the “docker images” command to list all the available images in the machines. Please note that by default the script builds a spark docker image for running jvm jobs.

$ docker images
REPOSITORY               TAG       IMAGE ID       CREATED         SIZE
techiescorner/spark-py   spark     d6143553585a   8 minutes ago   602MB
techiescorner/spark      spark     d6143553585a   8 minutes ago   602MB

The image has been created. Next I will tag the image and upload to my docker hub repository. For this execute below commands.

docker tag d6143553585a techiescorner/spark:v3.2.1
docker login
docker push techiescorner/spark:v3.2.1

Now we have a spark docker image and the next is to deploy the image in a Kubernetes cluster.

Prepare the kubernetes cluster.

To deploy a spark job to the Kubernetes cluster first we need to create a namespace, service account, and then a cluster role. The namespace is optional but it is good to have. Execute the below kubectl command in Kubernetes control-plane node to create the above 3 Kubernetes services.

kubectl create ns techies-spark
kubectl create serviceaccount spark -n techies-spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=techies-spark:spark --namespace=default

How the spark-submit works

An application can be submitted to the Kubernetes cluster with spark-submit. The controle-plane of the cluster acts as a cluster manager and creates a driver pod. This pod then initiates the application and requests the cluster manager to create executor pods. These executor pods then use the allocated memory to finish the job. Hence, using spark on Kubernetes becomes an easy process as the cluster manager takes care of the connectivity between the driver and executor pods once the networking is done. The driver pods required additional permission to spin up worker pods, for this, we have created the above-mentioned roles.

Kubernetes cluster

Scheduling an application

The Kubernetes cluster services are ready, next submit a sample spark job, to execute the below spark-submit command from your location where we downloaded the spark package. (Please make sure that the cluster config file is copied to your home directory)

./bin/spark-submit   --master k8s://https://192.168.56.111:6443   --deploy-mode cluster   --name spark-pi  --conf spark.kubernetes.namespace=techies-spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark   --class org.apache.spark.examples.SparkPi   --conf spark.executor.instances=1   --conf spark.kubernetes.container.image=techiescorner/spark:v3.2.1 --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent  local:///opt/spark/examples/jars/spark-examples_2.12-3.2.1.jar

  • –master Define Kubernetes cluster, to get the detail to execute “kubectl cluster-info” on the Kubernetes cluster.
  • –deploy-mode The spark will run in cluster mode
  • –name The name for spark driver and worker
  • spark.kubernetes.namespace The namespace name which we created
  • spark.executor.instances run the Spark Executor with 1 replicas on Kubernetes that will be spawned by your Spark Driver
  • spark.kubernetes.container.image Spark image which we created earlier, you can specify local or a repository name
  • spark.kubernetes.container.image.pullPolicy How to pull the image
  • local:// Example jar location in the container.
22/02/11 16:14:44 INFO LoggingPodStatusWatcherImpl: State changed, new state: 
	 pod name: spark-pi-6943a17ee8619171-driver
	 namespace: techies-spark
	 labels: spark-app-selector -> spark-3c8526636473481cb1bc03b012723c29, spark-role -> driver
	 pod uid: e34a6090-b950-43ff-b643-473bcd31d020
	 creation time: 2022-02-11T10:41:32Z
	 service account name: spark
	 volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-ds2nx
	 node name: kworker1
	 start time: 2022-02-11T10:41:32Z
	 phase: Running
	 container status: 
		 container name: spark-kubernetes-driver
		 container image: techiescorner/spark:v3.2.1
		 container state: running
		 container started at: 2022-02-11T10:42:38Z
22/02/11 16:14:44 INFO LoggingPodStatusWatcherImpl: Application status for spark-3c8526636473481cb1bc03b012723c29 (phase: Running)
22/02/11 16:14:45 INFO LoggingPodStatusWatcherImpl: Application status for spark-3c8526636473481cb1bc03b012723c29 (phase: Running)

If you are seeing a message like the above then your application is successfully submitted. Now there are two pods are running in the techies-spark namespace. Here one pod is the driver pod and the other one is the executor pod. Execute the below command to list the pods

$ kubectl get pods -n techies-spark
NAME                               READY   STATUS    RESTARTS   AGE
spark-pi-2c6e437ee860c440-exec-1   1/1     Running   0          4s
spark-pi-6943a17ee8619171-driver   1/1     Running   0          76s

Once the job is completed, the execute pods will be automatically terminated and deleted from the cluster.

2 comments

  1. Aslam Reply

    Very useful, the official document is bit confusing, but this post helped me to create a spark image.

Leave a Reply

Your email address will not be published. Required fields are marked *