Spark

Configuring Oozie for Spark SQL on a Secure Hadoop Cluster

A secure hadoop cluster requires actions in Oozie to be authenticated. However, due to the way that Oozie workflows execute actions, Kerberos credentials are not available to actions launched by Oozie. Oozie runs actions on the Hadoop cluster. Specifically, for legacy reasons, each action is started inside a single task map-only MapReduce job. Spark does […]

Read More

Spark Job History Server OutOfMemoryError

One of our customers hit an issue where the Spark Job History running out of memory every few hours. The heap size was set to 4GB and the customer was not a heavy user of Spark, submitting no more than a couple jobs a day. We did notice that they had many long running spark-shell […]

Read More

Troubleshooting Spark and Kafka escalations for our managed services customers

My favorite technical task is handling a troubleshooting escalation from one of our Managed Services customers. This week I was lucky enough to work on a problem worth sharing. The job was producing messages in a Spark job to Kafka. Half way through the Spark job, the job froze, no more messages were produced to Kafka. After […]

Read More

Introduction to Apache Spark GraphX – Part 2

For is a follow-up to Introduction to Apache Spark GraphX – Part 1, we decided we’d traverse a real-world graph. As developers, we use Github, which hosts public git repositories. Caught-up in the “social” buzz, Github has added social features. In addition to having a public profile which optionally lists your location, they allow you […]

Read More

Getting started with Apache Spark GraphX – Part 1

This is two part series on Apache Spark GraphX. The second part is located here. From DNA sequencing to anti-terrorism efforts, graph computation is increasingly ubiquitous in today’s world. Graph’s have vertices and edges. The vertices are the entities in the graph while edges are connections between the entities. Graphs can be symmetric or directional. […]

Read More

Exploring Spark MLlib: Part 3 – Transformation and Model Creation

In the first two(1,2) posts we got the data ingested and explored the data with the spark-shell. Now we’ll move on to creating and submitting our code as standalone Spark application. Again all the code cover in the posts can be found here. We’ll start by creating a case class and a function for parsing […]

Read More

Exploring Spark MLlib: Part 2 – Exploring the data

In the last post we got the environment setup. Now that the data is in the cluster and spark is setup, we can begin to explore the data. A common way to start is by using the spark-shell. Spark-shell is a powerful command line interpreter for the Spark environment. Let’s get started. Execute the spark-shell […]

Read More

Exploring Spark MLlib: Part 1 – Setup and Ingest

This 4 part series will introduce Spark MLlib by walking through a basic example much like a chapter in Advanced Spark Analytics (phdata highly recommends). The goal is to cover a MLlib workflow end-to-end. The posts assume a basic understanding of Spark and the Scala programming language. As much as possible the code and examples […]

Read More