Accessing Kerberized Sources From Spark2 In Cluster Mode on Yarn


Many of phData’s customers face the issue that they need to connect to a source secured via Kerberos in a Spark application. A source can be a JDBC connection like Impala, or a web URL that uses Kerberos for Authentication. While a simple workaround is to run the application on YARN with the deploy-mode client, phData recommends to have all Spark applications run in cluster mode. This post will cover how to connect to a secured source in cluster mode based on the example of connecting to secured Kafka from a Spark streaming app.

What’s the Problem?

When Spark runs on YARN on a secured cluster, the user needs to kinit. After performing a kinit, when a job gets submitted, delegation tokens get sent out to the Application Master(AM) and the executors. Those delegation tokens are for HDFS, HBase, and YARN. When a developer wants to connect to a different kerberized source and run the application in cluster mode, it fails. Accessing Kudu with Impala JDBC drivers is a common use case as well as the access of secured Kafka.

The Solution

The solution in this article covers the approach using the concept of a jaas.conf file.

Please note that in order to run in cluster mode, all references to keytabs need to use relative paths.

Example jaas.conf (frank_jaas.conf):
KafkaClient { required
Client { required

We ship the jaas.conf along with a keytab to the application master and the executors by specifying the –files option in spark-submit. This is different than using the –principal –keytab option in spark-submit. Please note the configuration for extraJavaOptions for the driver and the executors.

In the example below, we set, but you can set to true if you want to get debug information from Kerberos for troubleshooting.


If a custom truststore is required, the same approach can be followed with the trustore file (–files) and the system properties and (…extraJavaOptions…)

Spark submit example command:

spark2-submit \
--name my_streaming_app \
--master yarn --deploy-mode cluster\
--num-executors 1 \
--files $FILES \
--conf ""\
--conf ""\
--class io.phdata.spark.streaming.StreamingDriver \

More to explore

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit