Configuring Oozie for Spark SQL on a Secure Hadoop Cluster

A secure hadoop cluster requires actions in Oozie to be authenticated. However, due to the way that Oozie workflows execute actions, Kerberos credentials are not available to actions launched by Oozie. Oozie runs actions on the Hadoop cluster. Specifically, for legacy reasons, each action is started inside a single task map-only MapReduce job.

Spark does allow for specifying a keytab and principal as options and the distributed cache can be used to distribute the keytab:

<spark-opts>--files hive-site.xml --keytab user.keytab --principal user@your-realm.com</spark-opts>

However, using a keytab file specified in spark opts results in the following confusing error:

“Failing Oozie Launcher, Main class [org.apache.oozie.action.hadoop.SparkMain], main() threw exception, Delegation Token can be issued only with kerberos or web authentication at org.apache.hadoop.hdfs.server.namenode.FSNamesystem.getDelegationToken”

Oozie schedules and executes workflows submitted by users. The workflow ultimately launches the job that communicates with various services in the Hadoop eco-system like the Hive metastore, HBase service, and so on. For security purposes, each service might need a different type of credential, which are granted by the corresponding services. Each job needs to present the service-specific credentials when contacting the service. Hence, the right solution is to configure Oozie to use its Kerberos credentials to obtain “delegation tokens” on behalf of the user from the service in question. Delegation tokens are secret keys shared with the NameNode or Hive Metastore, that can be used to authenticate inside the cluster.

This can be done by adding a credentials section to the top of the workflow. The credentials section is available in Oozie workflow schema version 0.3 and later. Note that for the purposes of this discussion, you can think of HCatalog as equivalent to the Hive metastore, which isn’t far from the truth generally speaking.

    <credentials>
     <credential name="hcatauth" type="hcat">
        <property>
           <name>hcat.metastore.uri</name>
           <value>HIVE_METASTORE_URI</value>
        </property>
        <property>
            <name>hcat.metastore.principal</name>
            <value>HIVE_METASTORE_PRINCIPAL</value>
        </property>
     </credential>
   </credentials>

Oozie provides a unified credential framework to get any custom credential. The admin or more commonly distribution provider, pre-defines a mapping between a new credential type and the corresponding classes to get the credential. Oozie currently comes with the following Credentials implementations:

  1. HCatalog and Hive Metastore: org.apache.oozie.action.hadoop.HCatCredentials
  2. HBase: org.apache.oozie.action.hadoop.HBaseCredentials
  3. Hive Server 2: org.apache.oozie.action.hadoop.Hive2Credentials

Use Case

Suppose we have an Oozie workflow containing a Spark SQL job that uses HiveContext. In order to communicate with the Hive metastore, Spark SQL requires some kerberos configuration information which can be provided using a credentials section. In this case you would need to use HCatCredentials.

Oozie HCatCredentials implementation class requires these two properties to be provided to retrieve the token:

  1. hcat.metastore.principal
  2. hcat.metastore.uri

The user declares a new credential “hcatauth” with the necessary configuration to get the token for the credential type “hcat”. The user specifies the name of the credential through the “cred” attribute in the action definition. The cred attribute can accept a list of comma-separated credentials.

The values for hcat.metastore.uri and hcat.metastore.principal are available in hive-site.xml.

Sample Workflow

<workflow-app name="Spark-wf" xmlns="uri:oozie:workflow:0.5">
   <credentials>
     <credential name="hcatauth" type="hcat">
        <property>
           <name>hcat.metastore.uri</name>
           <value>HIVE_METASTORE_URI</value>
        </property>
        <property>
            <name>hcat.metastore.principal</name>
            <value>HIVE_METASTORE_PRINCIPAL</value>
        </property>
     </credential>
   </credentials>

<start to="spark-node" />
<action name="spark-node" cred="hcatauth">
<spark xmlns="uri:oozie:spark-action:0.2">
           <job-tracker>${jobTracker}</job-tracker>
           <name-node>${nameNode}</name-node>
           <master>${master}</master>
           <name>Spark-Job</name>
           <class>Spark-Job-Class</class>
           <jar>${nameNode}/user/${wf:user()}/${sparkRoot}/lib/Spark-Job-Jar.jar</jar>
           <spark-opts>--files hive-site.xml</spark-opts>
           <file>${nameNode}/user/${wf:user()}/hive-site.xml</file>
</spark>
<ok to="end"/>
<error to="fail"/></action>
<kill name="fail">
<message>Workflow failed, error message[${wf:errorMessage(wf:lastErrorNode())}</message>
</kill>
<end name="end"/>
</workflow-app>

Notes

  • Ensure that you provide a hive-site.xml file along-with your submission to avoid the following exception: “java.lang.RuntimeException: Unable to instantiate org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient”
  • When using <file> tag with spark action, ensure that you use version “uri:oozie:spark-action:0.2” as version 0.1 does not include the <file> tag

References

  1. https://oozie.apache.org/docs/4.2.0/DG_ActionAuthentication.html
  2. http://www.cloudera.com/documentation/enterprise/5-5-x/topics/cm_sg_principal_keytab.html
  3. Apache Oozie: The Workflow Scheduler for Hadoop; By Mohammad Kamrul Islam, Aravind Srinivasan