Technical

Configuring Oozie for Spark SQL on a Secure Hadoop Cluster

A secure hadoop cluster requires actions in Oozie to be authenticated. However, due to the way that Oozie workflows execute actions, Kerberos credentials are not available to actions launched by Oozie. Oozie runs actions on the Hadoop cluster. Specifically, for legacy reasons, each action is started inside a single task map-only MapReduce job. Spark does […]

Read More

Spark Job History Server OutOfMemoryError

One of our customers hit an issue where the Spark Job History running out of memory every few hours. The heap size was set to 4GB and the customer was not a heavy user of Spark, submitting no more than a couple jobs a day. We did notice that they had many long running spark-shell […]

Read More

Troubleshooting Spark and Kafka escalations for our managed services customers

My favorite technical task is handling a troubleshooting escalation from one of our Managed Services customers. This week I was lucky enough to work on a problem worth sharing. The job was producing messages in a Spark job to Kafka. Half way through the Spark job, the job froze, no more messages were produced to Kafka. After […]

Read More

Apache Kafka Performance Numbers

A search for “Apache Kafka performance” will result in dozens of articles but few results useful for estimating real-world results. Specifically I’ve few found results which run on hardware common to modern data centers, replicate the data with the common factor of 3, and many parallel producers and consumers. These results are meant to be used […]

Read More

Binary Stream Ingest: Flume vs Kafka vs Kinesis

Introduction The internet of things will put new demands on Hadoop ingest methods, specifically in its ability to capture raw sensor data — binary streams. As discussed, big data will remove previous data storage constraints and allow streaming of raw sensor data at granularities dictated by the sensors themselves. The focus of this post will […]

Read More

4 Strategies for Updating Hive Tables

Apache Hive and complementary technologies such as Cloudera Impala provide scalable SQL on Apache Hadoop. Unlike legacy database systems Hive and Impala have traditionally not provided any update functionality. However, many use cases require periodically updating rows such as slowly changing dimension tables. SQL on Hadoop technologies typically utilize one of two storage engines, Apache HBase […]

Read More

The Truth about SQL on Hadoop (part 3)

This is a multi-part blog post meant to be an exhaustive introduction to SQL-on-Hadoop. The first part in this series covered Storage Engines and Online Transaction Processing (OLTP). The next post covered Online Analytical Processing (OLAP) while this post will cover engine retrofits for Hadoop and choosing among the alternatives. Retrofits When breaking this topic […]

Read More

The Truth about SQL on Hadoop (part 2)

This is a multi-part blog post meant to be an exhaustive introduction to SQL-on-Hadoop. The first part in this series covered Storage Engines and Online Transaction Processing (OLTP). This post will cover Online Analytical Processing (OLAP) while the third in the series will cover engine retrofits for Hadoop and choosing among the alternatives. Data processing and […]

Read More