In this blog post, I will demonstrate creating an Apache spark latest docker image from the official Dockerfile and deploying it to a Kubernetes cluster. When writing this document the latest version of Apache spark is v3.2.1.
- A Kubernetes cluster, please make sure that the worker node has a minimum of 3 CPUs and 4G memory. To install kubernetes click here
- Docker software to build the image (Docker version 20.10.12)
Build spark image
To create the Apache spark docker image, download the spark complete package with the help of the below link. The below package contains the Hadoop library as well.
Make a directory and untar the downloaded package
tar -xf spark-3.2.1-bin-hadoop3.2.tgz
Apache official documentation provides a script (docker-image-tool.sh) to create the image using Dockerfile.
Execute the below command from the spark-3.2.1-bin-hadoop3.2 folder to generate spark images.
The command will take a few minutes to complete.
./bin/docker-image-tool.sh -r techiescorner -t spark -p kubernetes/dockerfiles/spark/Dockerfile build
Once the above script is completed, execute the “docker images” command to list all the available images in the machines. Please note that by default the script builds a spark docker image for running jvm jobs.
$ docker images
REPOSITORY TAG IMAGE ID CREATED SIZE
techiescorner/spark-py spark d6143553585a 8 minutes ago 602MB
techiescorner/spark spark d6143553585a 8 minutes ago 602MB
The image has been created. Next I will tag the image and upload to my docker hub repository. For this execute below commands.
docker tag d6143553585a techiescorner/spark:v3.2.1
docker push techiescorner/spark:v3.2.1
Now we have a spark docker image and the next is to deploy the image in a Kubernetes cluster.
Prepare the kubernetes cluster.
To deploy a spark job to the Kubernetes cluster first we need to create a namespace, service account, and then a cluster role. The namespace is optional but it is good to have. Execute the below kubectl command in Kubernetes control-plane node to create the above 3 Kubernetes services.
kubectl create ns techies-spark
kubectl create serviceaccount spark -n techies-spark
kubectl create clusterrolebinding spark-role --clusterrole=edit --serviceaccount=techies-spark:spark --namespace=default
How the spark-submit works
An application can be submitted to the Kubernetes cluster with spark-submit. The controle-plane of the cluster acts as a cluster manager and creates a driver pod. This pod then initiates the application and requests the cluster manager to create executor pods. These executor pods then use the allocated memory to finish the job. Hence, using spark on Kubernetes becomes an easy process as the cluster manager takes care of the connectivity between the driver and executor pods once the networking is done. The driver pods required additional permission to spin up worker pods, for this, we have created the above-mentioned roles.
Scheduling an application
The Kubernetes cluster services are ready, next submit a sample spark job, to execute the below spark-submit command from your location where we downloaded the spark package. (Please make sure that the cluster config file is copied to your home directory)
./bin/spark-submit --master k8s://https://192.168.56.111:6443 --deploy-mode cluster --name spark-pi --conf spark.kubernetes.namespace=techies-spark --conf spark.kubernetes.authenticate.driver.serviceAccountName=spark --class org.apache.spark.examples.SparkPi --conf spark.executor.instances=1 --conf spark.kubernetes.container.image=techiescorner/spark:v3.2.1 --conf spark.kubernetes.container.image.pullPolicy=IfNotPresent local:///opt/spark/examples/jars/spark-examples_2.12-3.2.1.jar
- –master Define Kubernetes cluster, to get the detail to execute “kubectl cluster-info” on the Kubernetes cluster.
- –deploy-mode The spark will run in cluster mode
- –name The name for spark driver and worker
- spark.kubernetes.namespace The namespace name which we created
- spark.executor.instances run the Spark Executor with 1 replicas on Kubernetes that will be spawned by your Spark Driver
- spark.kubernetes.container.image Spark image which we created earlier, you can specify local or a repository name
- spark.kubernetes.container.image.pullPolicy How to pull the image
- local:// Example jar location in the container.
22/02/11 16:14:44 INFO LoggingPodStatusWatcherImpl: State changed, new state:
pod name: spark-pi-6943a17ee8619171-driver
labels: spark-app-selector -> spark-3c8526636473481cb1bc03b012723c29, spark-role -> driver
pod uid: e34a6090-b950-43ff-b643-473bcd31d020
creation time: 2022-02-11T10:41:32Z
service account name: spark
volumes: spark-local-dir-1, spark-conf-volume-driver, kube-api-access-ds2nx
node name: kworker1
start time: 2022-02-11T10:41:32Z
container name: spark-kubernetes-driver
container image: techiescorner/spark:v3.2.1
container state: running
container started at: 2022-02-11T10:42:38Z
22/02/11 16:14:44 INFO LoggingPodStatusWatcherImpl: Application status for spark-3c8526636473481cb1bc03b012723c29 (phase: Running)
22/02/11 16:14:45 INFO LoggingPodStatusWatcherImpl: Application status for spark-3c8526636473481cb1bc03b012723c29 (phase: Running)
If you are seeing a message like the above then your application is successfully submitted. Now there are two pods are running in the techies-spark namespace. Here one pod is the driver pod and the other one is the executor pod. Execute the below command to list the pods
$ kubectl get pods -n techies-spark
NAME READY STATUS RESTARTS AGE
spark-pi-2c6e437ee860c440-exec-1 1/1 Running 0 4s
spark-pi-6943a17ee8619171-driver 1/1 Running 0 76s
Once the job is completed, the execute pods will be automatically terminated and deleted from the cluster.