Accessing Kerberized Sources From Spark2 In Cluster Mode on Yarn

Introduction

Many of phData’s customers face the issue that they need to connect to a source secured via Kerberos in a Spark application. A source can be a JDBC connection like Impala, or a web URL that uses Kerberos for Authentication. While a simple workaround is to run the application on YARN with the deploy-mode client, phData recommends to have all Spark applications run in cluster mode. This post will cover how to connect to a secured source in cluster mode based on the example of connecting to secured Kafka from a Spark streaming app.

What’s the Problem?

When Spark runs on YARN on a secured cluster, the user needs to kinit. After performing a kinit, when a job gets submitted, delegation tokens get sent out to the Application Master(AM) and the executors. Those delegation tokens are for HDFS, HBase, and YARN. When a developer wants to connect to a different kerberized source and run the application in cluster mode, it fails. Accessing Kudu with Impala JDBC drivers is a common use case as well as the access of secured Kafka.

The Solution

The solution in this article covers the approach using the concept of a jaas.conf file.

Please note that in order to run in cluster mode, all references to keytabs need to use relative paths.

Example jaas.conf (frank_jaas.conf):
KafkaClient {
  com.sun.security.auth.module.Krb5LoginModule required
  useKeyTab=true
  keyTab="frank.keytab"
  principal="frank@PHDATA.IO";
};
Client {
  com.sun.security.auth.module.Krb5LoginModule required
  doNotPrompt=true
  useKeyTab=true
  keyTab="frank.keytab"
  principal="frank@PHDATA.IO"
  storeKey=true
  useTicketCache=false;
};

We ship the jaas.conf along with a keytab to the application master and the executors by specifying the –files option in spark-submit. This is different than using the –principal –keytab option in spark-submit. Please note the configuration for extraJavaOptions for the driver and the executors.

In the example below, we set -Dsun.security.krb5.debug=false, but you can set to true if you want to get debug information from Kerberos for troubleshooting.

TLS/SSL

If a custom truststore is required, the same approach can be followed with the trustore file (–files) and the system properties javax.net.ssl.trustStorePassword and javax.net.ssl.trustStore (…extraJavaOptions…)

Spark submit example command:

FILES=frank_jaas.conf,frank.keytab,my_truststore.truststore
spark2-submit \
--name my_streaming_app \
--master yarn --deploy-mode cluster\
--num-executors 1 \
--files $FILES \
--conf "spark.driver.extraJavaOptions=-Djava.security.auth.login.config=frank_jaas.conf -Dsun.security.krb5.debug=false -Djavax.net.ssl.trustStore=my_truststore.truststore -Djavax.net.ssl.trustStorePassword=changeit"\
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=frank_jaas.conf -Dsun.security.krb5.debug=false -Djavax.net.ssl.trustStore=my_truststore.truststore -Djavax.net.ssl.trustStorePassword=changeit"\
--class io.phdata.spark.streaming.StreamingDriver \
/home/frank/streaming-driver-2.0.7-jar-with-dependencies.jar

Accessing Kerberized Sources From Spark2 In Cluster Mode on Yarn

Introduction

What’s the Problem?

The Solution

TLS/SSL

More to explore

Snowflake Query Tagging Best Practices

Data Ingestion from PostgreSQL to Snowflake using Openflow

The GenAI Strategy Playbook

Join our team

Partners

Resources

Software

Accelerate and automate your data projects with the phData Toolkit

Industries

Solutions

Company

Technology Partners

Other Technology Partners

Check out our latest insights

Snowflake Query Tagging Best Practices

Data Ingestion from PostgreSQL to Snowflake using Openflow

Data Engineering

Consulting, Migrations, Data Pipelines, DataOps

Change Management, Enablement & Learning

COE, Coaching, PMO

Data Science and Machine Learning Services

MLOps Enablement, Prototyping, Model Development and Deployment

Strategy Services

Data, Analytics, and AI Strategy, Architecture and Assessments

Reporting, Analytics, and Visualization Services

Self-Service, Integrated Analytics, Dashboards, Automation

Elastic Operations

Data Platforms, Data Pipelines, and Machine Learning