Enabling cluster deploy-mode for the Spark service

Enabling cluster deploy-mode for the Spark service

What is cluster deploy-mode?

The submission of a spark job to a spark cluster, in which the driver will be executed in one of the worker nodes.

Our current Spark service uses client deploy-mode, where the driver is executed on the local machine( Jupyter, web-shell, zeppelin).
more info


What is it good for? 

  1. Automation.
  2. Triggering spark jobs from Nuclio based on event or schedule.
  3. Remote submission of spark jobs (**).
  4. Better performance by reducing the network latency between the driver and the executors.
  5. No more cron jobs on the app node to trigger a script on the shell service!


Supported languages

Scala, Java.

Python is NOT supported currently (v2.4.4) in spark standalone clusters.


Required steps (high level):

  1. Enable spark-workers and spark-master to be resolved by their pod-names, 
    to achieve that I used an open source operator which creates a headless service for pods with a specific annotation.
  2. Open port 6066 for the spark-master pod.
  3. Add the v3io jar files to the spark-worker’s classpath.


 

Step-by-step guide

  1. Create a spark service
  2. Login to the app node or use kubectl with kubeconfig for the next steps
    1. Install k8s-pod-headless-service-operator, It will enable resolving the workers and master pods of the spark service by pod-name.
      1. git clone https://github.com/src-d/k8s-pod-headless-service-operator
      2. cd k8s-pod-headless-service-operator
      3. git checkout tags/v0.1.1
      4. cd manifests
      5. edit deployment.yaml so the value of the environment value 'NAMESPACE' will be ‘default-tenant’ 

        env:
          - name: NAMESPACE
            value: default-tenant
        
        


      6. kubectl apply -f rbac.yaml

      7. kubectl apply -f deployment.yaml

    2. Modify the spark-worker deployment to include the v3io-jars in its classpath,
      and add an annotation to the created pods for the k8s-pod-headless-service-operator.

      1. kubectl -n default-tenant edit deployment <spark-worker-deployment name>
        To include the v3io-jars you need to edit the command and args under ‘spec.template.spec.containers’
        The annotation should be added under ‘spec.template.metadata’
        it should look like -

        spec:
          template:
            metadata:
               annotations:
                 srcd.host/create-headless-service: "true"
            spec:
              containers:
              - args:
                - cp /igz/java/libs/v3io-*.jar /spark/jars; /bin/bash /etc/config/v3io/v3io-spark.sh
                command:
                - /bin/bash
                - -c


      2. Verify by checking that the new worker pods were deployed successfully.

    3. Modify the spark-master deployment to open port 6066,
      and add an annotation to the created pods for the k8s-pod-headless-service-operator.
      1. kubectl -n default-tenant edit deployment <spark-master-deployment name>
        The ports section to modify is under ‘spec.template.spec.containers’
        The annotation should be added under ‘spec.template.metadata’
        it should looke like -

        spec:
          template:
            metadata:
              annotations:
                srcd.host/create-headless-service: "true"
            spec:
              containers:
                ports:
                - containerPort: 6066
                  protocol: TCP
                - containerPort: 7077
                  protocol: TCP
                - containerPort: 8088
                  protocol: TCP


      2. Verify by checking that the new master pod was deployed successfully.

  3. Upload you spark application jar to v3io and submit your job from the web-shell/jupyter/zeppelin
    “spark-submit --deploy-mode cluster --master spark://<spark-master-svc-name>:6066  --class com.iguazio.customers.SparkExample  v3io://<container-name>/path/to/SparkJob.jar”