July 17, 2018

Accessing Kerberized Sources From Spark2 In Cluster Mode on Yarn

By Frank Rischner

Introduction

Many of phData’s customers face the issue that they need to connect to a source secured via Kerberos in a Spark application. A source can be a JDBC connection like Impala, or a web URL that uses Kerberos for Authentication. While a simple workaround is to run the application on YARN with the deploy-mode client, phData recommends to have all Spark applications run in cluster mode. This post will cover how to connect to a secured source in cluster mode based on the example of connecting to secured Kafka from a Spark streaming app.

What’s the Problem?

When Spark runs on YARN on a secured cluster, the user needs to kinit. After performing a kinit, when a job gets submitted, delegation tokens get sent out to the Application Master(AM) and the executors. Those delegation tokens are for HDFS, HBase, and YARN. When a developer wants to connect to a different kerberized source and run the application in cluster mode, it fails. Accessing Kudu with Impala JDBC drivers is a common use case as well as the access of secured Kafka.

The Solution

The solution in this article covers the approach using the concept of a jaas.conf file.

Please note that in order to run in cluster mode, all references to keytabs need to use relative paths.

Example jaas.conf (frank_jaas.conf):
KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="frank.keytab"
  principal="frank@PHDATA.IO";
};
Client {
  com.sun.security.auth.module.Krb5LoginModule required
  doNotPrompt=true
  useKeyTab=true
  keyTab="frank.keytab"
  principal="frank@PHDATA.IO"
  storeKey=true
  useTicketCache=false;
};

We ship the jaas.conf along with a keytab to the application master and the executors by specifying the –files option in spark-submit. This is different than using the –principal –keytab option in spark-submit. Please note the configuration for extraJavaOptions for the driver and the executors.

In the example below, we set -Dsun.security.krb5.debug=false, but you can set to true if you want to get debug information from Kerberos for troubleshooting.

TLS/SSL

If a custom truststore is required, the same approach can be followed with the trustore file (–files) and the system properties javax.net.ssl.trustStorePassword and javax.net.ssl.trustStore (…extraJavaOptions…)

Spark submit example command:

FILES=frank_jaas.conf,frank.keytab,my_truststore.truststore
spark2-submit \
--name my_streaming_app \
--master yarn --deploy-mode cluster\
--num-executors 1 \
--files $FILES \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=frank_jaas.conf -Dsun.security.krb5.debug=false -Djavax.net.ssl.trustStore=my_truststore.truststore -Djavax.net.ssl.trustStorePassword=changeit"\
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=frank_jaas.conf -Dsun.security.krb5.debug=false -Djavax.net.ssl.trustStore=my_truststore.truststore -Djavax.net.ssl.trustStorePassword=changeit"\
--class io.phdata.spark.streaming.StreamingDriver \
/home/frank/streaming-driver-2.0.7-jar-with-dependencies.jar

Data Coach is our premium analytics training program with one-on-one coaching from renowned experts.

Accelerate and automate your data projects with the phData Toolkit